PP-389: Allow the admin to suspend jobs for node maintenance

Objective:

There are certain scenarios where a node may require maintenance that does not require powering it down.  In these cases it would be nice to be able to suspend jobs that are on those nodes so they aren't lost or requeued.  This will be accomplished by providing two new pseudo signals to qsig (like suspend or resume).  The first will suspend the job and put all of the job's nodes into a new state.  The second will resume the job and the nodes will leave the new state. 

Interface 1: New node state: "maintenance"

CHANGE CONTROL: Public/Stable

SYNOPSIS: New node state which will allow the admin to perform maintenance on the node.  

DETAILS:

A node enters the "maintenance" state when the first job is suspended with the new suspend signal.  If other jobs are running on a node in the 'maintenance' state, they will continue to run until they are subsequently suspended(or they end).  The node will leave the 'maintenance' state when the last job is resumed with the new resume pseudo signal.  The scheduler will not run new jobs on a node in this state.  It will also not resume any jobs in the suspended state.

Interface 2: 'admin-suspend' pseudo-signal option to qsig

Change Control: Public/Stable

PERMISSIONS: operator or manager

Synopsis: Suspend a job and put its nodes in the 'maintenance' state

Details: 

The pseudo signal is given by the admin when they want to suspend jobs to start maintenance on the job's nodes.  It is given via the standard -s option to qsig (qsig -s admin-suspend <job id>).  When the admin-suspend signal is given to a job, two things will happen.  First, job will be put in the suspended('S') state and the job's processes will be suspended.  Second, the job's vnodes will be put into the 'maintenance' state.

Interface 3: 'admin-resume' pseudo-signal option to qsig

CHANGE CONTROL: Public/Stable

PERMISSIONS: operator or manager

SYNOPSIS: Resume a job which was suspended with the admin-suspend pseudo signal

DETAILS: 

The admin-resume pseudo signal is different than the resume pseudo signal.  When a job receives the resume pseudo signal, it doesn't actually resume the job.  The job's substate is changed to let the scheduler know to resume the job.  The admin-resume pseudo signal will directly resume the job (no waiting for the scheduler).  When the last admin-suspended job is admin-resumed, the job's vnodes will leave the 'maintenance' state.

Interface 4: 'maintenance_jobs' vnode attribute

CHANGE CONTROL: Public/Stable

PERMISSIONS: manager read only

PYTHON TYPE: string

SYNOPSIS: New vnode attribute which contains a list of admin-suspended jobs on the vnode

TYPE: Array of strings

DETAILS:

PBS will keep a list of jobs that are on a vnode that are admin-suspended.  This attribute is read only for managers.  

Interactions with normal suspend/resume

  • If a job is suspended via normal means, it can not be resumed with an admin-resume pseudo signal.  The request will be rejected with the following error message: "Job can not be resumed with the requested resume signal"
  • If a job is suspended with the admin-suspend pseudo signal, it can not be resumed with the resume pseudo signal.  The request will be rejected with the following message: "Job can not be resumed with the requested resume signal"
  • If there are multiple jobs on a vnode, it is not recommended to mix and match suspend signals.  If this happens it is possible for a vnode to be put back into a schedulable state prior to all of the non admin-suspended jobs being resumed.  The scheduler could then run jobs on the resources owned by the non admin-suspended jobs that are still suspended.

Misc

  • Before admin-suspending jobs, it is recommended to disable scheduling and wait for the current scheduling cycle to finish.  The scheduler only queries the vnode state at the start of the cycle.  If a vnode moves into 'maintenance' after the cycle starts, the scheduler may still consider the vnode as schedulable.  It is possible for new jobs to start during the current cycle.
  • If an admin wants to perform maintenance on a vnode that has no jobs running on it, they should put the vnode in the offline state and perform maintenance.
  • Any reservations on vnodes in the maintenance state will be marked degraded.  PBS will search for alternate vnodes for the reservations.
  • Sub jobs are requeued upon server restart.  Any vnode which only had admin-suspended subjobs will return to the free state after a server restart.
  • As with all pseudo-signals, the new ones do not have a signal number associated with them.  Signal numbers are OS defined signals.  Pseudo-signals are PBS constructs that are special cases.
  • If a job is running on some but not all of the vnodes of a multi-vnoded host, only the vnodes the job is running on will be put into maintenance.
  • It is suggested that all jobs on all vnodes of a muilti-vnoded host be admin-suspended before starting maintenance.  If not, some vnodes may remain in a schedulable state and have new work started on them during maintenance.

 

 

Example:

## Submit some jobs

[bmann@mars pbspro]$ qsub -l select=1:ncpus=1 -- /bin/sleep 1000
1351.mars
[bmann@mars pbspro]$ qsub -l select=1:ncpus=1 -- /bin/sleep 1000
1352.mars

## See jobs in running state

[bmann@mars pbspro]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1351.mars STDIN user 00:00:00 R workq
1352.mars STDIN user 00:00:00 R workq

## Find all jobs running on a vnode

$ pbsnodes -v mars | grep jobs
 jobs = 1351.mars/0, 1352.mars/1

 

## admin-suspend jobs

$ qsig -s admin-suspend 1351
$ qsig -s admin-suspend 1352

[bmann@mars pbspro]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1351.mars STDIN user 00:00:00 S workq
1352.mars STDIN user 00:00:00 S workq

# See vnode in new state

$ pbsnodes -v mars | grep state
state = maintenance

<Perform Maintenance>

 

## See which jobs need to be resumed

$ pbsnodes -v mars | grep maintenance_jobs

 maintenance_jobs = 1351.mars, 1352.mars

 

## Resume the first job (and see the vnode state remain in 'maintenance')

$ qsig -s admin-resume 1351
$ pbsnodes -v mars | grep state
state = maintenance

## Resume the last job (and see the vnode state leave 'maintenance')

$ qsig -s admin-resume 1352
$ pbsnodes -v mars | grep state
state = free