pbs_mom tries to create an ALPS reservation for any task she starts on a Cray system, so any task beyond the initial job script launch fails

Description

we've been investigating why PAS's ability to list files via PBS Pro's PySpawn functionality is failing on a Cray XC system (running 13.0.406.170103).
This appears in the mom logs:
03/30/2017 17:55:25;0100;pbs_mom;Req;;Type 84 request received from root@10.131.255.253:15001, sock=1
03/30/2017 17:55:25;0080;pbs_mom;Job;40.hpc;Resource_List.place = pack
03/30/2017 17:55:25;0080;pbs_mom;Node;alps_create_reservation;Creating ALPS reservation for job.
03/30/2017 17:55:25;0100;pbs_mom;n/a;alps_request;Sending ALPS request: <?xml version="1.0"?> <BasilRequest protocol="1.4" method="RESERVE"> <ReserveParamArray user_name="pbsadmin" batch_id="40.hpc">
<ReserveParam architecture="XT" width="1" depth="1" nppn="1">
<NodeParamArray>
<NodeParam>8</NodeParam>
</NodeParamArray>
</ReserveParam>
</ReserveParamArray>
</BasilRequest>
03/30/2017 17:55:25;0080;pbs_mom;Node;BASIL;ERROR: ALPS error: apsched: resource temporarily unavailable
03/30/2017 17:55:25;0002;pbs_mom;Node;alps_request_parent;TRANSIENT BASIL error from BACKEND: ERROR: ALPS error: apsched: resource temporarily unavailable
03/30/2017 17:55:25;0008;pbs_mom;Job;40.hpc;Transient MPP reservation error on create.
03/30/2017 17:55:25;0001;pbs_mom;Job;40.hpc;task not started, Retry /opt/pbs/default/bin/pbs_python -3
03/30/2017 17:55:25;0001;pbs_mom;Job;40.hpc;req_py_spawn: FAILED pbs_spawn/SpawnWrapper.py ECUIZQIC9TULJM0L2ZEJYPTI hpccray:/var/spool/pas/temp/ECUIZQIC9TULJM0L2ZEJYPTI-script hpccray:/var/spool/pas/temp/ECUIZQIC9TULJM0L2ZEJYPTI-stdin hpccray:/var/spool/pas/temp/ECUIZQIC9TULJM0L2ZEJYPTI-stdout hpccray:/var/spool/pas/temp/ECUIZQIC9TULJM0L2ZEJYPTI-stderr 1 /opt/pbs/default/bin/pbs_python /home/pbsadmin/pbs.40.hpc.x8z -1 task 0000001C err 15010
03/30/2017 17:55:25;0080;pbs_mom;Req;req_reject;Reject reply code=15010, aux=0, type=84, from root@10.131.255.253:15001
Type 84 request is:
#define PBS_BATCH_PySpawn 84
It appears as though when the mom processes a Type 84 request she behaves in the same way as when a normal batch job starts and tries to create an ALPS reservation for the "job", but of course those resources are already in use by the actual job...

Additionally, if this is not the first task, it should behave as if you wanted to start a second TM API task on the mother superior, and that definitely should not be creating an ALPS reservation.
It's a more general error in set_job, BTW: this gets called by start_process so it gets called for any task, so the ALPS interface is simply broken whenever a job has more than one task on mother superior. That routine should suppress trying to make an ALPS reservation unless it's the first task (there are many ways to detect that).
But that ALPS reservation creation is at a bizarre spot. We don't have the same problem with cpuset creation, and that's because that's done in start_exec, which runs only once per job, and not set_job, which runs once per task on the node.

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Sam Goosen

Severity

3-High

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Priority

High
Configure