Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

https://pbspro.atlassian.net/browse/PP-864


Overview:

Cray X* series systems have the ability to support suspending one or more jobs to run a higher priority job.  PBS needs to modify the suspend pseudo signal (used by the qsig command and preemption) to support doing suspend and resume on a Cray X* series.

...

  • Cray systems with a Gemini interconnect do NOT support suspend/resume
  • Cray systems with an Aries interconnect and newer Cray X* series systems DO support suspend/resume 
  • in order to do suspend/resume set suspendResume 1 in /etc/opt/cray/alps/alps.conf (using xtopview on CLE 5.2 and prior CLEs) and then restart ALPS
    • Please refer to Cray's System Administration Guide for more details about using suspend/resume on Cray X* series
  • On Cray X series system PBS issues a request to ALPS to switch IN (resume) or OUT (suspend) an ALPS reservation.


Interface #1 -

Log messages that will appear in the MoM logs:

MoM log message #1:  "ALPS reservation <ALPS reservation ID> SWITCH status is = 'EMPTY'"

...

 New error code when ALPS fails to switch reservation from suspend to resume or resume to suspend

  • Change Control: Stable
  • Details:
    • New error code "15219" will be returned for pbs_sigjob() IFL call when ALPS fails to switch reservation in mom.
    • Command "qsig" will print the following error message when ALPS fails to switch reservation
      "qsig: Switching ALPS reservation failed <job id>"


Interface #2 - New mom log messages

  • Change Control: Unstable
  • Details:
    • Following mom log message is logged on Cray X series systems when 
      • PBS tries to suspend/resume a job (PBSEVENT_DEBUG2)-
        "Switching ALPS reservation <ALPS reservation id> to <suspend/resume>"
      • ALPS fails to accept a reservation switch request (PBSEVENT_SYSTEM)-
        "Failed to switch <OUT/IN> ALPS reservation"
      • PBS issues the ALPS reservation switch request successfully (PBSEVENT_DEBUG2)-
        "Made the ALPS SWITCH request"
      • It is possible to incorrectly get an 'EMPTY' response (which means there is no claim on the ALPS reservation) when in reality there is a claim on the ALPS reservation.  PBS will print this log message so it is possible to see how often the false 'EMPTY' response is received
    .

Interface 2 -

New error code when ALPS fails to switch reservation from suspend to resume or resume to suspend

  • Stable
  • New error code "15219" will be returned for pbs_sigjob() IFL call when ALPS fails to switch reservation in mom.
  • Command "qsig" will print the following error message when ALPS fails to switch reservation
    "qsig: Switching ALPS reservation failed <job id>"

...

      • (PBSEVENT_DEBUG2).
        "ALPS reservation <ALPS reservation ID> SWITCH status is = 'EMPTY'"