Follow the PBS Pro Design Document Guidelines.

Links

Overview

This is to enhance the "node ramp down" feature, by introducing a new option "-k <node count>" ("k" for "keep") to the pbs command "pbs_release_nodes". This will allow users or admins to retain some of the sister nodes (exec_host), while performing node ramp down operation. The number of sister nodes to be kept is specified by the argument to this new option.

Technical Details

Interface 1:  -k <node count>

Details:

pbs_release_nodes [-j <job ID>] <vnode> [<vnode> [<vnode>] ...]
pbs_release_nodes [-j <job ID>] -a
pbs_release_nodes [-j <job ID>]  -k  <select statement>
pbs_release_nodes [-j <job ID>]  -k  <node count>
pbs_release_nodes --version 
 

$ qsub -l select=4:model=abc:ncpus=5+3:model=abc:bigmem=true:ncpus=1+2:model=def:ncpus=32  job.scr
121.pbssrv

Now grepping for assigned vnodes we may see :

$ qstat -f 121 | egrep "exec_vnode|exec_host"
exec_host = nd_abc_1/0*5+nd_abc_2/0*5+nd_abc_3/0*5+nd_abc_3/1*5+nd_abc_4_bm/0*1+nd_abc_5_bm/0*1+nd_abc_6_bm/0*1+nd_def_1/0*32+nd_def_2/0*32
exec_vnode = (nd_abc_1:ncpus=5)+(nd_abc_2:ncpus=5)+(nd_abc_3[0]:ncpus=5)+(nd_abc_3[1]:ncpus=5)+(nd_abc_4_bm:ncpus=1)+(nd_abc_5_bm:ncpus=1)+(nd_abc_6_bm:ncpus=1)+(nd_def_1:ncpus=32)+(nd_def_2:ncpus=32)

Here, total of 9 nodes/hosts are assigned to the job. One mother superior node: first chunk "(nd_abc_1:ncpus=5)" and 8 sister nodes/hosts. (Note the host "nd_abc_3" is repeated twice in the exec_host)

and node statuses as :

$ pbsnodes -av
nd_abc_1
    Mom = nd_abc_1.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_2
    Mom = nd_abc_2.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_3[0]
    Mom = nd_abc_3.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_3[1]
    Mom = nd_abc_3.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_4_bm
    Mom = nd_abc_4_bm.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.bigmem = True
    resources_available.model = abc
    resources_available.ncpus = 1
    resources_assigned.ncpus = 1

nd_abc_5_bm
    Mom = nd_abc_5_bm.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.bigmem = True
    resources_available.model = abc
    resources_available.ncpus = 1
    resources_assigned.ncpus = 1

nd_abc_6_bm
    Mom = nd_abc_6_bm.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.bigmem = True
    resources_available.model = abc
    resources_available.ncpus = 1
    resources_assigned.ncpus = 1

nd_def_1
    Mom = nd_def_1.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = def
    resources_available.ncpus = 32
    resources_assigned.ncpus = 32

nd_def_2
    Mom = nd_def_2.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = def
    resources_available.ncpus = 32
    resources_assigned.ncpus = 32

Now if we do a pbs_release_nodes with the new "-k" option having a argument of "3":

$ pbs_release_nodes -j 121 -k 3

may release the nodes (nd_abc_2:ncpus=5)+(nd_abc_3[0]:ncpus=5)+(nd_abc_3[1]:ncpus=1)+(nd_def_1:ncpus=32)+(nd_def_2:ncpus=32) from the job while retaining the nodes (nd_abc_1:ncpus=5)+(nd_abc_4_bm:ncpus=1)+(nd_abc_5_bm:ncpus=1)+(nd_abc_6_bm:ncpus=1).

The new phase of the job will have below vnodes associated with it 

$ qstat -f 121| egrep "exec_vnode|exec_host"
exec_host = nd_abc_1/0*5+nd_abc_4_bm/0*1+nd_abc_5_bm/0*1+nd_abc_6_bm/0*1
exec_vnode = (nd_abc_1:ncpus=5)+(nd_abc_4_bm:ncpus=1)+(nd_abc_5_bm:ncpus=1)+(nd_abc_6_bm:ncpus=1)

The same result in the previous example can be achieved by using the below select string as an argument to the "-k' option see Ref[3].

$ pbs_release_nodes -j 121 -k select=3


API level details:








OSS Site Map

Project Documentation Main Page

Developer Guide Pages