PBS kills jobs that are suspended when the pbs_comm is restarted

Description

One of our customer suspended all running jobs on their cluster in order to perform some system activities, namely turning off 16 racks that will be retired. Part of the activities involved restarting pbs_comm on all cluster rack leaders to update their config file with the now-smaller list of pbs_comm instances to connect to (due to rack retirement).

It turns out that when we restarted pbs_comm we inadvertently killed about 100 of the 300+ suspended jobs. My post-mortem analysis and testing indicates a bug in im_eof() in mom_comm.c regarding what to do when a mom\'s pbs_comm connection is dropped. I\'ve attached the code in question from 13.1, our version is nearly identical except for the first if-statement:

if (pjob->ji_qs.ji_substate == JOB_SUBSTATE_PRERUN ||
pjob->ji_qs.ji_substate == JOB_SUBSTATE_RUNNING) {

I believe the bug in both versions of the if-statement is that it only protects running jobs from being killed due to possibly-transient pbs_comm trouble, but doesn\'t extend the protection to suspended jobs. My guess is that adding JOB_SUBSTATE_SUSPEND as another option in the if-statement will fix the bug and allow suspended multi-node jobs to survive. Based on the 13.1 code that modified if-statement might look like:

if ((((pjob->ji_qs.ji_svrflags & JOB_SVFLG_HERE) == 0) && (pjob->ji_qs.ji_substate == JOB_SUBSTATE_PRERUN)) ||
(pjob->ji_qs.ji_substate == JOB_SUBSTATE_RUNNING) ||
(pjob->ji_qs.ji_substate == JOB_SUBSTATE_SUSPEND)) {

I was able to reproduce the failure by suspending a multi-node job that spanned 2+ pbs_comm\'s (head node connected to pbs_comm A, some number of sister nodes attached to pbs_comm B) then stopping/restarting pbs_comm A.

Acceptance Criteria

None

Status

Assignee

Ram Pranesh

Reporter

Former user

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Components

Fix versions

Priority

Critical
Configure