https://pbspro.atlassian.net/browse/PP-864
Overview:
Cray X* series systems have the ability to support suspending one or more jobs to run a higher priority job. PBS needs to modify the suspend pseudo signal (used by the qsig command and preemption) to support doing suspend and resume on a Cray X* series.
...
- Cray systems with a Gemini interconnect do NOT support suspend/resume
- Cray systems with an Aries interconnect and newer Cray X* series systems DO support suspend/resume
- in order to do suspend/resume set suspendResume 1 in /etc/opt/cray/alps/alps.conf (using xtopview on CLE 5.2 and prior CLEs) and then restart ALPS
- Please refer to Cray's System Administration Guide for more details about using suspend/resume on Cray X* series
- On Cray X series system PBS issues a request to ALPS to switch IN (resume) or OUT (suspend) an ALPS reservation.
Interface #1 -
Log messages that will appear in the MoM logs:
MoM log message #1: "ALPS reservation <ALPS reservation ID> SWITCH status is = 'EMPTY'"
...
New error code when ALPS fails to switch reservation from suspend to resume or resume to suspend
- Change Control: Stable
- Details:
- New error code "15219" will be returned for pbs_sigjob() IFL call when ALPS fails to switch reservation in mom.
- Command "qsig" will print the following error message when ALPS fails to switch reservation
"qsig: Switching ALPS reservation failed <job id>"
Interface #2 - New mom log messages
- Change Control: Unstable
- Details:
- Following mom log message is logged on Cray X series systems when
- PBS tries to suspend/resume a job (PBSEVENT_DEBUG2)-
"Switching ALPS reservation <ALPS reservation id> to <suspend/resume>" - ALPS fails to accept a reservation switch request (PBSEVENT_SYSTEM)-
"Failed to switch <OUT/IN> ALPS reservation" - PBS issues the ALPS reservation switch request successfully (PBSEVENT_DEBUG2)-
"Made the ALPS SWITCH request" - It is possible to incorrectly get an 'EMPTY' response (which means there is no claim on the ALPS reservation) when in reality there is a claim on the ALPS reservation. PBS will print this log message so it is possible to see how often the false 'EMPTY' response is received
- PBS tries to suspend/resume a job (PBSEVENT_DEBUG2)-
- Following mom log message is logged on Cray X series systems when
Interface 2 -
New error code when ALPS fails to switch reservation from suspend to resume or resume to suspend
- Stable
- New error code "15219" will be returned for pbs_sigjob() IFL call when ALPS fails to switch reservation in mom.
- Command "qsig" will print the following error message when ALPS fails to switch reservation
"qsig: Switching ALPS reservation failed <job id>"
...
- (PBSEVENT_DEBUG2).
"ALPS reservation <ALPS reservation ID> SWITCH status is = 'EMPTY'"
- (PBSEVENT_DEBUG2).