PP-724: new "keep <node count>" option for "pbs_release_nodes"

Follow the PBS Pro Design Document Guidelines.

Overview

This is to enhance the "node ramp down" feature, by introducing a new option "-k <node count>" ("k" for "keep") to the pbs command "pbs_release_nodes". This will allow users or admins to retain some of the sister nodes (exec_host), while performing node ramp down operation. The number of sister nodes to be kept is specified by the argument to this new option.

Technical Details

Interface 1:  -k <node count>

  • Change Control: Stable
  • Synopsis: This new option to "pbs_release_nodes" specifies a number which will be the count of sister nodes which are to be kept assigned with the job, while releasing the remaining sister nodes. The nodes released will then be made available for scheduling other jobs. The number has to be a positive integer which is less than the total number of sister nodes currently assigned to the job. A node here means one instance of a mom's host name entry in the job's exec_host attribute.
  • Permission: as described in the Ref 1. above

Details:

  • New Syntax :

pbs_release_nodes [-j <job ID>] <vnode> [<vnode> [<vnode>] ...]
pbs_release_nodes [-j <job ID>] -a
pbs_release_nodes [-j <job ID>]  -k  <select statement>
pbs_release_nodes [-j <job ID>]  -k  <node count>
pbs_release_nodes --version 
 

  • Example of usage :
    Lets submit a job with a select string

$ qsub -l select=4:model=abc:ncpus=5+3:model=abc:bigmem=true:ncpus=1+2:model=def:ncpus=32  job.scr
121.pbssrv

Now grepping for assigned vnodes we may see :

$ qstat -f 121 | egrep "exec_vnode|exec_host"
exec_host = nd_abc_1/0*5+nd_abc_2/0*5+nd_abc_3/0*5+nd_abc_3/1*5+nd_abc_4_bm/0*1+nd_abc_5_bm/0*1+nd_abc_6_bm/0*1+nd_def_1/0*32+nd_def_2/0*32
exec_vnode = (nd_abc_1:ncpus=5)+(nd_abc_2:ncpus=5)+(nd_abc_3[0]:ncpus=5)+(nd_abc_3[1]:ncpus=5)+(nd_abc_4_bm:ncpus=1)+(nd_abc_5_bm:ncpus=1)+(nd_abc_6_bm:ncpus=1)+(nd_def_1:ncpus=32)+(nd_def_2:ncpus=32)

Here, total of 9 nodes/hosts are assigned to the job. One mother superior node: first chunk "(nd_abc_1:ncpus=5)" and 8 sister nodes/hosts. (Note the host "nd_abc_3" is repeated twice in the exec_host)

and node statuses as :

$ pbsnodes -av
nd_abc_1
    Mom = nd_abc_1.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_2
    Mom = nd_abc_2.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_3[0]
    Mom = nd_abc_3.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_3[1]
    Mom = nd_abc_3.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = abc
    resources_available.ncpus = 5
    resources_assigned.ncpus = 5

nd_abc_4_bm
    Mom = nd_abc_4_bm.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.bigmem = True
    resources_available.model = abc
    resources_available.ncpus = 1
    resources_assigned.ncpus = 1

nd_abc_5_bm
    Mom = nd_abc_5_bm.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.bigmem = True
    resources_available.model = abc
    resources_available.ncpus = 1
    resources_assigned.ncpus = 1

nd_abc_6_bm
    Mom = nd_abc_6_bm.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.bigmem = True
    resources_available.model = abc
    resources_available.ncpus = 1
    resources_assigned.ncpus = 1

nd_def_1
    Mom = nd_def_1.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = def
    resources_available.ncpus = 32
    resources_assigned.ncpus = 32

nd_def_2
    Mom = nd_def_2.pbspro.com
    state = job-busy
    jobs = 120.pbssrv/0
    resources_available.model = def
    resources_available.ncpus = 32
    resources_assigned.ncpus = 32

Now if we do a pbs_release_nodes with the new "-k" option having a argument of "3":

$ pbs_release_nodes -j 121 -k 3

may release the nodes (nd_abc_2:ncpus=5)+(nd_abc_3[0]:ncpus=5)+(nd_abc_3[1]:ncpus=1)+(nd_def_1:ncpus=32)+(nd_def_2:ncpus=32) from the job while retaining the nodes (nd_abc_1:ncpus=5)+(nd_abc_4_bm:ncpus=1)+(nd_abc_5_bm:ncpus=1)+(nd_abc_6_bm:ncpus=1).

The new phase of the job will have below vnodes associated with it 

$ qstat -f 121| egrep "exec_vnode|exec_host"
exec_host = nd_abc_1/0*5+nd_abc_4_bm/0*1+nd_abc_5_bm/0*1+nd_abc_6_bm/0*1
exec_vnode = (nd_abc_1:ncpus=5)+(nd_abc_4_bm:ncpus=1)+(nd_abc_5_bm:ncpus=1)+(nd_abc_6_bm:ncpus=1)

  • Using a select string argument :

The same result in the previous example can be achieved by using the below select string as an argument to the "-k' option see Ref[3].

$ pbs_release_nodes -j 121 -k select=3

  • Errors and Return codes :
    • When the command with the new option executes successfully, the below output is put on the console. With exit code set to 0
             pbs_release_nodes: <sub select string>
    • Cannot be used in conjunction "-a" option. If used so, pbs_release_nodes will print below error along with usage strings.

      pbs_release_nodes: -a and -k options cannot be used together

    • Cannot be used in conjunction with supplying host/vnode list arguments (<vnode> [<vnode> [<vnode>] ...]). If used so, pbs_release_nodes will print below error along with usage strings.

      pbs_release_nodes: cannot supply node list with -k option

    • For all other failures, including the case where the integer argument is not less than the job's current sister node count, the below error will get printed
              pbs_release_nodes: Server returned error 15010 for job
  • Accounting Logs :
    • No new accounting logs introduced. See Ref 2. above.
  • Caveats :
    • The order of selection of nodes/vnodes to be released or kept by the "-k <node count>" option is "Undefined". Hence user/admin or his/her scripts/tools should not depend/predict on the order of release/keep operation on the nodes/vnodes.
    • If one or more nodes targeted for release have one or more job chunks/processes still running in them, then the release operation will result in their abrupt termination.
    • Clubbing the previous two caveats: user/admin should be aware that by using this new option, the running job may lose some of its running job chunks.
    • Since the mother superior cannot be ramp-ed down, the user need to provide one less number than the number of nodes he/she intents to be assigned to the job.
    • Since a node can appear multiple times in the exec_host, the node will only be released from the job after the ramp down operation only if all the occurrences of the same node in the job's exec_host attribute is considered as released. i.e in the previous example the node "nd_abc_3" appears twice in the exec_host, owing to two chunks of the same job assigned to its two vnodes. Now if after the release node operation if only one instance of "nd_abc_3" is released, while the other instance "nd_abc_3" is kept, then even though the job attributes of resource usage shows proper accounting, the chunk thought to be released is still assigned to the vnode and the vnode resources will actually not be released for other jobs.


API level details:

  • The count argument is internally converted to a "select" string parameter which will be passed to "pbs_relnodesjob()" using its "extend" argument which is of type "char * "








OSS Site Map

Project Documentation Main Page

Developer Guide Pages