Stranded array subjobs after communication hiccup

Description

We ran into an odd occurrence with subjobs of an array job. It started with a breakdown in 
communication between the server and MOM when trying to start two subjobs. 
06/01/2016 17:38:20;0008;Server@pbs-1;Job;42399[1187].pbs-1;Job Run at 
request of Scheduler@pbs-1 on exec_vnode (r311i0n2:ncpus=16) 
06/01/2016 17:38:20;0001;Server@pbs-1;Req;;Could not connect to Mom 
06/01/2016 17:38:20;0008;Server@pbs-1;Job;42399[1187].pbs-1;send of job to 
r311i0n2 failed error = 0 
06/01/2016 17:38:20;0008;Server@pbs-1;Job;42399[1187].pbs-1;Unable to 
Run Job, send to Mom failed 
06/01/2016 17:38:20;0008;Server@pbs-1;Job;42399[1188].pbs-1;Job Run at 
request of Scheduler@pbs-1 on exec_vnode (r311i0n3:ncpus=16) 
06/01/2016 17:38:20;0001;Server@pbs-1;Req;;Could not connect to Mom 
06/01/2016 17:38:20;0008;Server@pbs-1;Job;42399[1188].pbs-1;send of job to 
r311i0n3 failed error = 0 
06/01/2016 17:38:20;0008;Server@pbs-1;Job;42399[1188].pbs-1;Unable to 
Run Job, send to Mom failed 
Thereafter, during each scheduling cycle where nodes of the appropriate type were available, 
the scheduler kept trying to run the subjobs: 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[1187].pbs-1;Considering job 
to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[1188].pbs-1;Considering job 
to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1188].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1188].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
06/01/2016 17:41:15;0080;pbs_sched;Job;42399[].pbs-1;Considering job to run 
06/01/2016 17:41:15;0001;pbs_sched;Job;42399[1187].pbs-1;Transient job 
warning. Job may get held if issue persists:15016 
06/01/2016 17:41:15;0040;pbs_sched;Job;42399[1187].pbs-1;Job run 
Debug code to issue the "Transient job warning" messages was added…
Note, though, that the scheduler keeps trying to start the same subjob (1187), until it runs out 
of candidate nodes. Meanwhile, the server is rejecting the jobs with: 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
06/01/2016 17:41:15;0080;Server@pbs-1;Req;req_reject;Reject reply code=15016, aux=0, 
type=23, from Scheduler@pbs-1 
Eventually, these jobs got the scheduler so confused that it core dumped. 
Once the core dump issue was patched around, the scheduler settled down to trying each of 
the two subjobs at most once per cycle. 
Using gdb on the server, the error 15016 comes from this code in req_runjob(): 
} else if (jt == IS_ARRAY_Single) { 
/* single subjob, if queued, it can be run */ 
offset = subjob_index_to_offset(parent, get_index_from_jid(jid)); 
if (offset == -1) { 
req_reject(PBSE_UNKJOBID, 0, preq); 
return; 

i = get_subjob_state(parent, offset); 
if (i == -1) { 
req_reject(PBSE_IVALREQ, 0, preq); 
return; 
} else if ((i != JOB_STATE_QUEUED) || (find_job(jid) != NULL)) { 
/* job already running */ 
req_reject(PBSE_BADSTATE, 0, preq); <=== Here 
return; 

That is, because the subjobs were instantiated when the server first tried to run them, the 
find_jobs() part of the if finds an entry for them, causing the server to reject them. 
I'm not sure what the fix is. Either destroy the subjob instance if it cannot be started, or reuse 
it for the subsequent attempt to run the job. 
FWIW, qstat -f for the subjob: 
Job: 42399[1187].pbs-1 
Job_Name = parallel_jobarray.pbs 
Job_Owner = anon 
job_state = Q 
queue = normal 
server = pbs-1 
Checkpoint = u 
ctime = 1464795958 (Wed Jun 01 08:45:58 PDT 2016) 
Error_Path = anon^array_index^ 
Hold_Types = o 
Join_Path = n 
Keep_Files = n 
Mail_Points = a 
mtime = 1464990673 (Fri Jun 03 14:51:13 PDT 2016) 
Output_Path = anon^array_index^ 
Priority = 50 
qtime = 1464795958 (Wed Jun 01 08:45:58 PDT 2016) 
Rerunable = True 
Resource_List.ncpus = 16 
Resource_List.nodect = 1 
Resource_List.place = scatter:excl 
Resource_List.select = 1:ncpus=16
Resource_List.walltime = 08:00:00 
schedselect = 1:ncpus=16:aoe=sles11:bigmem=False:reboot=free 
substate = 20 
Variable_List = anon 
euser = anon 
egroup = anon 
queue_rank = 23287895 
queue_type = E 
comment = Job held by … on Thu Jun 2 17:34:13 2016 
eligible_time = 00:00:00 
accrue_type = 2 
Submit_arguments = <jsdl-hpcpa:Argument>parallel_jobarray.pbs</jsdl- 
hpcpa:Argument>

Acceptance Criteria

None

Status

Assignee

Shrinivas Harapanahalli

Reporter

Sam Goosen

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Priority

Low