pbs_mom is too quick to abort a job

Description

Summary was: pbs_mom is too quick to abort a job

It was pointed out to me recently by a colleague that a user\'s job had been aborted by PBS without ever having really started. I tried to find a good reason why this might have happened but couldn\'t really find much except that it does seem like pbs_mom is willing to abort a job for a number of trouble scenarios where a retry seems much more appropriate. Consider these couple of code chunks pulled from 13.0 QAB16 finish_exec() in start_exec.c:

vtable.v_envp = (char **)malloc(vtable.v_ensize * sizeof(char *));
if (vtable.v_envp == NULL) {
log_err(ENOMEM, id, 'out of memory');
starter_return(upfds, downfds, JOB_EXEC_FAIL1, &sjr);
}

<snip some lines>

sprintf(buf, '%s/aux/%s', pbs_conf.pbs_home_path, pjob->ji_qs.ji_jobid);
bld_env_variables(&vtable, variables_else[11], buf);

if ((nhow = fopen(buf, 'w')) == NULL) {
sprintf(log_buffer, 'cannot open %s', buf);
log_err(errno, id, log_buffer);
starter_return(upfds, downfds, JOB_EXEC_FAIL1, &sjr);
}

If my reading of the code is correct the calls to starter_return() with JOB_EXEC_FAIL1 will eventually cause pbs_server to abort the job (via exec_bail(), scan_for_exiting(), and Obit being sent). The first code chunk is just a malloc problem, possibly OOM, which another node might not have so why abort the job? And the second chunk is a failure to open the PBS_NODEFILE, which again might succeed on a different node.

There are many such uses of JOB_EXEC_FAIL1, JOB_EXEC_FAIL2, and possibly others I haven\'t found that result in job abort. Does Altair have a good reason(s) for doing this, or is it time to let jobs have another crack at it on a different node?

-Customer

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Former user

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Components

Priority

High
Configure