This is to introduce the node ramp down feature, which basically releases no longer needed sister nodes/vnodes early from running jobs.
Release a particular set of sister nodes from a job:
Syntax: pbs_release_nodes -j <job_identifier> <host1_or_vnode1> [<host2_or_vnode2> [<host3_or_vnode3>] ...] ...
The 'host*_or_vnode*' argument is any of the sister nodes/vnodes that appear in the exec_vnode attribute of a running job. Example:
% qsub job.scr
241.borg
% qstat 241 | grep "exec|Resource_List|select"
exec_host = borg[0]/0*0+federer/0*0+lendl/0*2
exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576k:ncpus=1+federer[1]:ncpus=1)+(lendl:ncpus=2:mem=2097152kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 8
Resource_List.nodect = 3
Resource_List.place = scatter
Resource_List.select = ncpus=3:mem=2gb+ncpus=3:mem=2gb+ncpus=2:mem=2gb
schedselect = 1:ncpus=3:mem=2gb+1:ncpus=3:mem=2gb+1:ncpus=2:mem=2gb
exec_host = borg[0]/0*0+federer/0*0 <- no lendl as all assigned vnodes in lendl have been cleared.
exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576kb:ncpus=1) <- federer[1] and lendl removed.
Resource_List.mem = 4194304kb <- minus 2gb (from lendl)
Resource_List.ncpus = 5 <- minus 3 cpus (1 from federer[1] and 2 from lendl)
Resource_List.nodect = 2 <- minus 1 chunk (when lendl was taken out, its entire chunk assignment disappeared)
Resource_List.place = scatter
schedselect = 1:mem=2097152kb:ncpus=3+1:mem=2097152kb:ncpus=2
exec_host = borg[0]/0*0
exec_vnode = (borg[0]:mem=1048576kb:ncpus=1)+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)
Resource_List.mem = 2097152kb
Resource_List.ncpus = 3
Resource_List.nodect = 1
Resource_List.place = scatter
schedselect = 1:mem=2097152kb:ncpus=3
% qsub -l select=2:ncpus=1:mem=1gb -l place=scatter -I
qsub: waiting for job 247.borg.pbspro.com to start
qsub: job 247.borg.pbspro.com ready
% cat $PBS_NODEFILE
borg.pbspro.com
federer.pbspro.com
% pbs_release_nodes -j 247 federer
% cat $PBS_NODEFILE
borg.pbspro.com
Error reporting
pbs_release_nodes will report an error if any of the nodes specified are managed by a mother superior mom.
Example:
% pbs_release_nodes -j 241 borg[0]
pbs_release_nodes: Can't free 'borg[0]' since it's on an MS host
pbs_release_nodes will report an error if executed by a non-admin, non-manager, non-operator, or non-job owner user:
Example:
% pbs_release_nodes -j 248 federer
pbs_release_nodes: Unauthorized Request
% pbs_release_nodes -j 249 lendl
pbs_release_nodes: these nodes are not part of the job: lendl
% pbs_release_nodes -j 251 lendl
pbs_release_nodes: Request invalid for state of job
pbs_release_nodes will report an error if both the '-a' option and a list of nodes/vnodes are specified in the command line.
Example:
% pbs_release_nodes -j 252 -a federer
usage: pbs_release_nodes [-j job_identifier] host_or_vnode1 host_or_vnode2 ...
usage: pbs_release_nodes [-j job_identifier] -a
pbs_release_nodes --version
pbs_release_nodes will report an error message and exit if at least one of the hosts or vnodes specified is a Cray XC series node. The following message is returned:
Example:
% pbs_release_node -j 253 cray_node
"pbs_release_nodes is not currently supported on Cray XC systems nodes: <cray_node>"
Example:
% qsub -W stageout=my_stageout@federer:my_stageout.out -W release_nodes_on_stageout=true job.scr
'u' (for update) record.
Details: The 'u' record represents a just concluded phase of the job, which consists of a set of resources assigned to the job ( exec_vnode, exec_host, Resource_List items), and amount of resources used (resources_used) during that phase of the job.
% qsub -l select=3:ncpus=1:mem=1gb job.scr
242.borg
% qstat -f 242 | egrep "exec|Resource_List|select"
exec_host = borg/0+federer/0+lendl/0
exec_vnode = (borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb)
Resource_List.mem = 3gb
Resource_List.ncpus = 3
Resource_List.nodect = 3
Resource_List.place = scatter
Resource_List.select = 3:ncpus=1:mem=1gb
schedselect = 3:ncpus=1:mem=1gb
% pbs_release_nodes -j 241 lendl
Accounting logs show:
# tail -f /var/spool/PBS/server_priv/accounting/201701231
23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7503 run_count=1 exec_host=borg/0+federer/0+lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb Resource_List.ncpus=3 Resource_List.nodect=3 Resource_List.place=scatter Resource_List.select=3:ncpus=1:mem=1gb resources_used.cpupercent=5 resources_used.cput=00:04:35 resources_used.mem=4288kb resources_used.ncpus=3 resources_used.vmem=42928kb resources_used.walltime=00:00:26
Another pbs_release_nodes call yield:
% pbs_release_nodes -j 241 federer
# tail -f /var/spool/PBS/server_priv/accounting/201701231
01/23/2017 18:59:35;u;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215949 qtime=1485215949 etime=1485215949 start=1485215949 session=7773 run_count=1 exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb resources_used.cpupercent=3 resources_used.cput=00:03:35 resources_used.mem=2048kb resources_used.ncpus=2 resources_used.vmem=32928kb resources_used.walltime=00:00:26
Taking from previous example, support there's the following release of vnode:
% pbs_release_nodes -j 241 lendl
Accounting logs show (lendl vnode assignment gone):
# tail -f /var/spool/PBS/server_priv/accounting/201701231
23/2017 18:53:24;c;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7503 run_count=1 exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter updated_Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb resources_used_incr.cpupercent=5 r
Another pbs_release_nodes call yield 'federer' vnode assignment gone::
% pbs_release_nodes -j 241 federer
# tail -f /var/spool/PBS/server_priv/accounting/201701231
01/23/2017 18:59:35;c;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215949 qtime=1485215949 etime=1485215949 start=1485215949 session=7773 run_count=1 exec_host=borg/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb) Resource_List.mem=1048576kb Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb
A pbs_release_nodes request causes the server to send a job update to the mother superior (MS) of the job. The
MS in turn looks into the list of nodes being removed. If it's the last node from the same host, MS sends a new DELETE_JOB2 request to that owning sister mom. Upon receiving this request, the sister mom goes and kills job processes on the node, and sends back to the mother superior the summary
accounting information for the job on that node. Mom_logs will show the following DEBUG messages:
sister mom_logs: "DELETE_JOB2 received"
Mother superior log: "<reporting_sister_host>;cput=YY mem=ZZ"
Special server_logs messages:
"clearing job <job-id> from node <vnode-name>
"Node<sister-mom-hostname>;deallocating 1 cpus from job <job-id>