all mom hook events run in a single iteration of the main mom loop are counted against job's walltime

Description

Mom sets pjob->ji_qs.ji_stime right before it calls fork() to spawn the job process. Then it services the hooks on the child side before becoming the user. ji_stime on MoM is set BEFORE we run the execjob_begin etc. hooks. So the time to run these is actually counted as "walltime":

/* set remaining job structure elements */
pjob->ji_qs.ji_state = JOB_STATE_RUNNING;
pjob->ji_qs.ji_substate = JOB_SUBSTATE_PRERUN;
pjob->ji_qs.ji_stime = time_now;
pjob->ji_polltime = time_now;
pjob->ji_wattr[(int)JOB_ATR_mtime].at_val.at_long =
(long)time_now;
pjob->ji_wattr[(int)JOB_ATR_mtime].at_flags |= ATR_VFLAG_SET;

/* np is set from job_nodes_inner */

/* NULL value passed to hook_input.vnl */
/* means to assign */
/* vnode list using pjob->ji_host[]. */
mom_hook_input_init(&hook_input);
hook_input.pjob = pjob;

mom_hook_output_init(&hook_output);
hook_output.reject_errcode = &hook_errcode;
hook_output.last_phook = &last_phook;
hook_output.fail_action = &hook_fail_action;

switch ((hook_rc=mom_process_hooks(HOOK_EVENT_EXECJOB_BEGIN,
PBS_MOM_SERVICE_NAME, mom_host,
&hook_input, &hook_output,
hook_msg, sizeof(hook_msg), 1))) {

In other words: you spent quite some time in execjob_begin hooks and you need to take that into account when specifying walltime limits.

The problem can be worse than that, because time_now (the global that supposedly contains the current time) is only set at the top of the main loop, the user is charged wall time for everything that Mom does within the current loop iteration, including running execjob_end hooks for all the terminating jobs. This can cause wall time to be off by a variable amount of time, depending on how busy Mom happens to be when the job is starting.

Acceptance Criteria

None

Status

Assignee

Minghui Liu

Reporter

Scott Campbell

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Fix versions

Affects versions

14.0.0

Priority

High
Configure