https://pbspro.atlassian.net/browse/PP-864
Overview:
Cray X* series systems have the ability to support suspending one or more jobs to run a higher priority job. PBS needs to modify the suspend pseudo signal (used by the qsig command and preemption) to support doing suspend and resume on a Cray X* series.
Important things to note:
- Cray systems with a Gemini interconnect do NOT support suspend/resume
- Cray systems with an Aries interconnect and newer Cray X* series systems DO support suspend/resume
- in order to do suspend/resume set suspendResume 1 in /etc/opt/cray/alps/alps.conf (using xtopview on CLE 5.2 and prior CLEs) and then restart ALPS
- Please refer to Cray's System Administration Guide for more details about using suspend/resume on Cray X* series
Interface #1 -
Log messages that will appear in the MoM logs:
MoM log message #1: "ALPS reservation <ALPS reservation ID> SWITCH status is = 'EMPTY'"
- Unstable
- Logged at PBSEVENT_DEBUG2
- It is possible to incorrectly get an 'EMPTY' response (which means there is no claim on the ALPS reservation) when in reality there is a claim on the ALPS reservation. PBS will print this log message so it is possible to see how often the false 'EMPTY' response is received.
Interface 2 -
New error code when ALPS fails to switch reservation from suspend to resume or resume to suspend
- Stable
- New error code "15219" will be returned for pbs_sigjob() IFL call when ALPS fails to switch reservation in mom.
- Command "qsig" will print the following error message when ALPS fails to switch reservation
"qsig: Switching ALPS reservation failed <job id>"