Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649

Interface 1: 

...

select_reliable_startup job resource

  • Visibility: Public
  • Change Control: Stable
  • Python type: pbs.select
  • Input Format: rjselect select_reliable_startup="[N:][chunk specification][+[N:]chunk specification]" (same as select spec)
    where N is a multiple of chunks, and chunk specification is of the form:
    "<resource name>=<value>[:<resource name>=<value>...]"
    rjselect select_reliable_startup can be set in one of the following ways:
    • qsub -l rjselect=select_reliable_startup=[N:][chunk specification][+[N:]chunk specification]
    • qalter -l rjselectselect_reliable_startup=[N:][chunk specification][+[N:]chunk specification]
    • As a PBS directive in a job script:
                   #PBS -l rjselectselect_reliable_startup=[N:][chunk specification][+[N:]chunk_specification]
    • Within a Python hook script, use the pbs.select() type.
                 pbs.event().job.Resource_List["rjselectselect_reliable_startup"] = pbs.select("[N:][chunk specification][+[N:]chunk specification]")
  • Privilege: only root, PBS admin, PBS operator can set the rjselect select_reliable_startup value
  • Details:
    This is a builtin resource that is used to cause a job to be started reliably. rjselect  select_reliable_startup must be a mirror of the original 'select' request, but with extra chunks added to cause assignment of extra vnodesnodes. When rjselect select_reliable_startup is set, the schedselect job attribute is set to map to this value instead of the orignal 'select', causing the scheduler to the original 'select' value along with default resources missing from the value is first saved to a new resource 'select_requested' (see interface 6), and the 'select_reliable_startup' value replaces the 'select' value, causing the scheduler to allocate the extra nodes to the job.  At the start of the job, before the user's program/job script is executedThe adjusted assignments are reflected in Resource_List, exec_vnode, and exec_host values.  When this is set, any failure of sister nodes nodes are tolerated . But just before the execjob_launch hook (if any) is executed, and also before the user's program/job script is started, the job's assigned nodes are pruned back (take away the failed nodes) to a minimum list, enough to satisfy the original 'select' request.

    When the job is updated, the MS mom will tell the sister moms (via an IM_UPDATE_JOB request) that the nodes mapping has changed. This is needed so that spawning multi node tasks via TM interface continues to work.

    Normal case:The rjselect value is a mirror of the original select value with the 'N' part of a chunk specification (i.e. [N:][chunk specification][throughout the life of the job. That is, the job is allowed to run even after mother superior mom has detected bad nodes.
  • Normal case:
    • The select_reliable_startup value is a mirror of the original select value with the 'N' part of a chunk specification (i.e. [N:][chunk specification][+[N:]chunk specification\) changed to reflect additional chunks.
        Example, given
        • Interface 7 below introduces a helper method for adding more instances of chunks (pbs.select.increment_chunks()).
      • Example, given qsub -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb" where there's 1 chunk of "ncpus=3:mem=1gb" spec, 1 chunk of "ncpus=2:mem=2gb" spec, and 1 chunk of "chpus=1:mem=3gb" spec. an rjselect select_reliable_startup value can bump up the second chunk to have 2 of 'ncpus=2:mem=2gb" spec and the third chunk to have 2 of "ncpus=1:mem=3gb" spec as follows:    rjselectselect_reliable_startup=ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb  The first chunk assigned is always preserved since that is allocated from the MS host, so there's really no need to bump the number of chunks for the first chunk in the original select request.
    • Log/Error messages:If 'rjselect' select_reliable_startup' is set by a regular, non-privileged user, then there'll be an error message from qsub, qalter as follows:
                     "not “Only PBS managers and operators are allowed to set 'rjselect' unless PBS operator or manager"
    • When job's assigned nodes get pruned, mom_logs will show the following info under PBSEVENT_JOB log level::

      ";Job;<jobid>;pruned from exec_vnode=<original value_satisfying_rjselect>"
      ";Job;<jobid>;pruned to exec_node=<new value_mapping_original_select>"
      ";Job;<jobid>;pruned from schedselect=<original schedselect value_satisfying_rjselect>"
      ";Job;<jobid>;pruned to schedselect=<new schedselect value_mapping_original_select>"

    • When mother superior fails to prune currently assigned chunk resources (which satisfied the rjselect request) into a minimum set that satisfies the original select specification, then the job is requeued, and the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:

      1. "<jobid>;job not started, Retry 3" under PBSEVENT_ERROR log level

      2. "could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the original select spec could not be satisfied

      3. "could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..."  when a secondary (sister) chunk from the original select spec could not be satisfied

      4. "job node_list_fail: node <node_name1>" which shows a job's vnode that is managed by an unhealthy mom

      5. "job node_list: node <node_name1>" which shows a job's vnode managed by a fully functionining mom.

    • When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

          ";pbs_mom;Job;<jobid>;updated nodes info"

    • When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the    
       PBSEVENT_DEBUG2 message: "NOT all job updates to sister moms completed" Note that eventually, all the nodes would automatically complete updating its info.
    • if a sister mom receives a TM request but its nodes data have not been updated yet, then mom_logs on the sister host will show under PBSEVENT_JOB log level:
      ""job reliably started but missing updated nodes list. Try spawn() again later." 
      NOTE: The client would get an "error on spawn" message while doing tm_spawn.

    • Examples:

    Given a queuejob hook where it sets rjselect to allow another node to be added to the 2nd chunk and third chunk:
    % cat qjob.py

    import pbs
    e=pbs.event()
    e.job.Resource_List["rjselect"] = pbs.select("ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")
    # qmgr -c "c h qjob event=queuejob"
    # qmgr -c "i h qjob application/x-python default qjob.py"
    And a job of the form:
    % cat jobr.scr
    #PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
    #PBS -l place=scatter:excl

    echo $PBS_NODEFILE
    cat $PBS_NODEFILE
    echo END
    echo "HOSTNAME tests"
    echo "pbsdsh -n 0 hostname"
    pbsdsh -n 0 hostname
    echo "pbsdsh -n 1 hostname"
    pbsdsh -n 1 hostname
    echo "pbsdsh -n 2 hostname"
    pbsdsh -n 2 hostname
    echo "PBS_NODEFILE tests"
    for host in `cat $PBS_NODEFILE`
    do
    echo "HOST=$host"
    echo "pbs_tmrsh $host hostname"
    pbs_tmrsh $host hostname
    echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
    ssh $host pbs_attach -j $PBS_JOBID hostname
    done

    ...

    % qstat -f 20
    Job Id: 20.borg.pbspro.com
    ...
    exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
    exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
    Resource_List.mem = 11gb
    Resource_List.ncpus = 9
    Resource_List.nodect = 5
    Resource_List.place = scatter:excl
    Resource_List.rjselect = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
    Resource_List.select = ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus1:mem=3gb
    schedselect = 1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb

    Now if say nodes federer and sampras were detected to be unhealthy, then 
    a snapshot of the job's output would show the pruned list of nodes:

    /var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
    borg.pbspro.com
    lendl.pbspro.com
    agassi.pbspro.com
    END

    HOSTNAME tests

    pbsdsh -n 0 hostname
    borg.pbspro.com
    pbsdsh -n 1 hostname
    lendl.pbspro.com
    pbsdsh -n 2 hostname
    agassi.pbspro.com

    PBS_NODEFILE tests
    HOST=borg.pbspro.com
    pbs_tmrsh borg.pbspro.com hostname
    borg.pbspro.com
    ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    borg.pbspro.com
    HOST=lendl.pbspro.com
    pbs_tmrsh lendl.pbspro.com hostname
    lendl.pbspro.com
    ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    lendl.pbspro.com
    HOST=agassi.pbspro.com
    pbs_tmrsh agassi.pbspro.com hostname
    agassi.pbspro.com
    ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    agassi.pbspro.com

    Interface 2: New server accounting record: 's' for the start record of a job that was submitted with the rjselect request

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: When a job was reliably started (had an rjselect value) and its nodes assignment were changed to match the original select spec, then there'll be this accounting record that will reflect the new values to 'exec_vnode', 'exec_host', and Resource_List.
    • Example:

      04/07/2016 17:08:09;s;20.borg.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 exec_host=borg/0*3+lendl/0*2+agassi/0 exec_vnode=(borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb) Resource_List.mem=6gb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.rjselect=ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb Resource_List.select=ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb session=0 run_count=1

    Interface 3: sister_join_job_alarm mom config option

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is the number of seconds that mother superior mom will wait to receive acknowledgement from all the sister moms for the IM_JOIN_JOB requests sent, if job is started reliably. When job runs, just before a job officially launches the program, pbs_mom will ignore any errors from sister moms including failed IM_JOIN_JOB requests. pbs_mom will use this value before proceeding to pre-starting the job (calling finish_exec()).
    • Default value: 30 seconds
          To change value, add the following line in mom's config file:
                              $sister_join_job_alarm <# of seconds> 
    • Log/Error messages:
      1. When $sister_join_job_alarm value exceeded for a job, mom_logs shows the PBSEVENT_ADMIN level message: ";Job;<jobid>;join job wait time exceeded"

    Interface 4: job_launch_delay mom config option

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is the number of seconds that a child mom waits on a pipe attached to a parent mom, or vice versa, for information about sister moms assigned to a reliably-started job.There are 3 cases that this happens:
      1.  before prologue hook execution
      2. before execjob_launch hook execution
      3. while waiting for confirmation from main mom that nodes information have been successfully updated and communicated to all the sister moms.
    • Default value: 35 seconds
      To change value, add the following line in mom's config file:
                     $job_launch_delay <number of seconds>
    • Log/Error messages:
      The following messages are logged in PBSEVENT_DEBUG2 level:
      1. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: before prologue"

      2. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: before launch"

      3. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: waiting on job update acks from sister moms"

    Interface 5: pbs.event().vnode_list_fail[] hook parameter

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. And one can walk through this list and start offlining the vnodes. For example:

      for vn in e.vnode_list_fail.keys():
          v = e.vnode_list_fail[vn]
          pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
          v.state = pbs.ND_OFFLINE

    • Log/Error messages:
      1. If an execjob_prologue hook or an execjob_launch hook requested to offline a vnode, server_logs would show the message under PBSEVENT_DEBUG2 level:

        ";Server@borg;Node;<node name>;Updated vnode <node_name>'s attribute state=offline per mom hook request" 

    ...

    hiddentrue
    idPP-design

    Objective

    Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

    Interface 1: rjselect job resource

    • Visibility: Public
    • Change Control: Stable
    • Python type: pbs.select
    • Input Format: rjselect="[N:][chunk specification][+[N:]chunk specification]" (same as select spec)
      where N is a multiple of chunks, and chunk specification is of the form:
      "<resource name>=<value>[:<resource name>=<value>...]"
      rjselect can be set in one of the following ways:
      • qsub -l rjselect=[N:][chunk specification][+[N:]chunk specification]
      • qalter -l rjselect=[N:][chunk specification][+[N:]chunk specification]
      • As a PBS directive in a job script:
                     #PBS -l rjselect=[N:][chunk specification][+[N:]chunk specification
      • Within a Python hook script, use the pbs.select() type.
                   pbs.event().job.Resource_List["rjselect"] = pbs.select("[N:][chunk specification][+[N:]chunk specification]")
    • Privilege: only root, PBS admin, PBS operator can set the rjselect value
    • Details:
      This is a builtin resource that is used to cause a job to be started reliably. rjselect must be a mirror of the original 'select' request, but with extra chunks added to cause assignment of extra vnodes. When rjselect is set, the schedselect job attribute is set to map to this value instead of the orignal 'select', causing the scheduler to allocate the extra nodes to the job. At the start of the job, before the user's program/job script is executed, any failure of sister nodes are tolerated. But just before the execjob_launch hook (if any) is executed, and also before the user's program/job script is started, the job's assigned nodes are pruned back (take away the failed nodes) to a minimum list, enough to satisfy the original 'select' request.  The first chunk assigned originally is always preserved since that is allocated from the MS host.

      When the job is updated, the MS mom will tell the sister moms (via an IM_UPDATE_JOB request) that the nodes mapping has changed. This is needed so that spawning multi node tasks via TM interface continues to work.

    • Log/Error messages:
      1. If 'rjselect' is set by a regular, non-privileged user, then there'll be an error message from qsub, qalter as follows:
                       "not allowed to set 'rjselect' unless PBS operator or manager"
      2. When job's assigned nodes get pruned, mom_logs will show the following info under PBSEVENT_JOB log level::

        ";Job;<jobid>;pruned from exec_vnode=<original value_satisfying_rjselect>"
        ";Job;<jobid>;pruned to exec_node=<new value_mapping_original_select>"
        ";Job;<jobid>;pruned from schedselect=<original schedselect value_satisfying_rjselect>"
        ";Job;<jobid>;pruned to schedselect=<new schedselect value_mapping_original_select>"

      3. When mother superior fails to prune currently assigned chunk resources (which satisfied the rjselect request) into a minimum set that satisfies the original select specification, then the job is requeued, and the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:

        1. "<jobid>;job not started, Retry 3" under PBSEVENT_ERROR log level

        2. "could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the original select spec could not be satisfied

        3. "could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..."  when a secondary (sister) chunk from the original select spec could not be satisfied

        4. "job node_list_fail: node <node_name1>" which shows a job's vnode that is managed by an unhealthy mom

        5. "job node_list: node <node_name1>" which shows a job's vnode managed by a fully functionining mom.

      4. When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

            ";pbs_mom;Job;<jobid>;updated nodes info"

      5. When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the    
         PBSEVENT_DEBUG2 message: "NOT all job updates to sister moms completed" Note that eventually, all the nodes would automatically complete updating its info.
      6. if a sister mom receives a TM request but its nodes data have not been updated yet, then mom_logs on the sister host will show under PBSEVENT_JOB log level:
        ""job reliably started but missing updated nodes list. Try spawn() again later." 
        NOTE: The client would get an "error on spawn" message while doing tm_spawn.

    Examples:

    Given a queuejob hook where it sets rjselect to allow another node to be added to the 2nd chunk and third chunk:

    ...

    import pbs
    e=pbs.event()
    e.job.Resource_List["rjselect"] = pbs.select("ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")
    # qmgr -c "c h qjob event=queuejob"
    # qmgr -c "i h qjob application/x-python default qjob.py"
    And a job of the form:
    % cat jobr.scr
    #PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
    #PBS -l place=scatter:excl

    echo $PBS_NODEFILE
    cat $PBS_NODEFILE
    echo END
    echo "HOSTNAME tests"
    echo "pbsdsh -n 0 hostname"
    pbsdsh -n 0 hostname
    echo "pbsdsh -n 1 hostname"
    pbsdsh -n 1 hostname
    echo "pbsdsh -n 2 hostname"
    pbsdsh -n 2 hostname
    echo "PBS_NODEFILE tests"
    for host in `cat $PBS_NODEFILE`
    do
    echo "HOST=$host"
    echo "pbs_tmrsh $host hostname"
    pbs_tmrsh $host hostname
    echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
    ssh $host pbs_attach -j $PBS_JOBID hostname
    done

    ...

    % qstat -f 20
    Job Id: 20.borg.pbspro.com
    ...
    exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
    exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
    Resource_List.mem = 11gb
    Resource_List.ncpus = 9
    Resource_List.nodect = 5
    Resource_List.place = scatter:excl
    Resource_List.rjselect = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
    Resource_List.select = ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus1:mem=3gb
    schedselect = 1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb

    Now if say nodes federer and sampras were detected to be unhealthy, then
    a snapshot of the job's output would show the pruned list of nodes:

    /var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
    borg.pbspro.com
    lendl.pbspro.com
    agassi.pbspro.com
    END

    HOSTNAME tests

    pbsdsh -n 0 hostname
    borg.pbspro.com
    pbsdsh -n 1 hostname
    lendl.pbspro.com
    pbsdsh -n 2 hostname
    agassi.pbspro.com

    PBS_NODEFILE tests
    HOST=borg.pbspro.com
    pbs_tmrsh borg.pbspro.com hostname
    borg.pbspro.com
    ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    borg.pbspro.com
    HOST=lendl.pbspro.com
    pbs_tmrsh lendl.pbspro.com hostname
    lendl.pbspro.com
    ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    lendl.pbspro.com
    HOST=agassi.pbspro.com
    pbs_tmrsh agassi.pbspro.com hostname
    agassi.pbspro.com
    ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    agassi.pbspro.com

    Interface 2: New server accounting record: 's' for the start record of a job that was submitted with the rjselect request

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: When a job was reliably started (had an rjselect value) and its nodes assignment were changed to match the original select spec, then there'll be this accounting record that will reflect the new values to 'exec_vnode', 'exec_host', and Resource_List.
    • Example:

      04/07/2016 17:08:09;s;20.borg.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 exec_vnode=(borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb) Resource_List.mem=6gb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.rjselect=ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb Resource_List.select=ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb session=0 run_count=1

    Interface 3: sister_join_job_alarm mom config option

    ...

    • select_reliable_startup'"


    Interface 2: New server accounting record: 's' for the start record of a job that was submitted with the select_reliable_startup request

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: When a job was reliably started (had a select_reliable_startup value), there'll be this new accounting record that will reflect the adjusted values to select, 'exec_vnode', 'exec_host', and Resource_List, along with values to select_reliable_startup (inteface 1), and select_requested (interface 6).
    • Note:  This is a new accounitng record; the start of job record ('S') remains as before.
    • Example:

      04/07/2016 17:08:09;s;20.borg.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 exec_host=borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0 exec_vnode=(borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)  Resource_List.mem=11gb Resource_List.ncpus=9 Resource_List.nodect=5 Resource_List.place=scatter:excl Resource_List.select=1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb Resource_List.select_requested=1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus=1:mem=3gb session=0 run_count=1

    Interface 3: sister_join_job_alarm mom config option

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is the number of seconds that mother superior mom will wait to receive acknowledgement from all the sister moms for the IM_JOIN_JOB requests sent, if job is started reliably. When job runs, just before a job officially launches the program,The primary pbs_mom will ignore any errors from sister moms including failed IM_JOIN_JOB requests. pbs_mom will use this value before proceeding to pre-starting the job (calling finish_exec()).
    • Default value: set to to the total amount of 'alarm' values associated with enabled, execjob_begin hooks. Example, if there are 2 execjob_begin hooks with first hook having alarm=30 and second hook having alarm=20, then the default value of sister_join_job_alarm will be 50 seconds. If there are no execjob_begin hooks, then this is set to 30 seconds.
          To change value, add the following line in mom's config file:
                              $sister_join_job_alarm <# of seconds>
    • Log/Error messages:
      1. When $sister_join_job_alarm value is set, there'll be PBSEVENT_SYSTEM level message: "join job <alarm_value>"

    Interface 4: job_launch_delay mom config option

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is the number of seconds that mother superior will wait before launching the job, executing the job script or executable. Wait time can be used to let execjob_prologue hooks finish execution  to capture or report any node failures, or for mother superior to notice of any communication problems with other nodes. Once this wait time passed, the execjob_launch hook can proceed to execute.
    • Default value: set to to the total amount of 'alarm' values associated with enabled execjob_prologue hooks. For example, if there are 2 execjob_prologue hooks, where first hook has alarm=30 and second hook has alarm=60, then the default job_launch_delay value will be 90 seconds. If there are no execjob_prologue hooks, then this is set to 30 seconds.
      To change value, add the following line in mom's config file:
                   

    ...

    •  $job_launch_delay <number of seconds>
    • Log/Error messages:
      The following messages are logged in PBSEVENT_DEBUG2 level:
      1. When

    ...

      1. $job_

    ...

      1. launch_

    ...

      1. delay value is set, there'll be PBSEVENT_SYSTEM level message: "job_launch_delay <delay_value>"
      2. mom_logs: ";Job;<jobid>;

    ...

      1. job reliable startup job_launch_delay=<$job_launch_delay value> secs"


    Interface

    ...

    5

    ...

    pbs.event().vnode_list_fail[] hook parameter

    ...

    • Visibility: Public
    • Change Control: Stable

    ...

    1.  before prologue hook execution
    2. before execjob_launch hook execution
    3. while waiting for confirmation from main mom that nodes information have been successfully updated and communicated to all the sister moms.

    ...

    1. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: before prologue"

    2. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: before launch"

    3. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: waiting on job update acks from sister moms"

    • Python Type: dict (dictionary of pbs.vnode objects keyed by vnode name)
    • Details:
      This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. This can include those vnodes form sister moms that failed to join the job. This dictionary object is keyed by vnode name. And one can walk through this list and start offlining the vnodes. For example:

      for vn in e.vnode_list_fail.keys():
          v = e.vnode_list_fail[vn]
          pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
          v.state = pbs.ND_OFFLINE

    • Log/Error messages:
      1. If an execjob_prologue hook or an execjob_launch hook requested to offline a vnode, server_logs would show the message under PBSEVENT_DEBUG2 level:

        ";Server@borg;Node;<node name>;Updated vnode <node_name>'s attribute state=offline per mom hook request" 

    Interface 6: select_requested job resource

    • Visibility: Public
    • Change Control: Stable
    • Python Type: pbs.select
    • Privilege: This is a read-only resource that cannot be set by client or hook.
    • Details:
      This is a new

    ...

    for vn in e.vnode_list_fail.keys():
        v = e.vnode_list_fail[vn]
        pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
        v.state = pbs.ND_OFFLINE

    ...

    1. If an execjob_prologue hook or an execjob_launch hook requested to offline a vnode, server_logs would show the message under PBSEVENT_DEBUG2 level:

      ";Server@borg;Node;<node name>;Updated vnode <node_name>'s attribute state=offline per mom hook request" 

    ...

    hiddentrue
    idPP-design

    Objective

    Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

    Interface 1: rjselect job resource

    • Visibility: Public
    • Change Control: Stable
    • Python type: pbs.select
    • Input Format: rjselect="[N:][chunk specification][+[N:]chunk specification]" (same as select spec)
      where N is a multiple of chunks, and chunk specification is of the form:
      "<resource name>=<value>[:<resource name>=<value>...]"
      rjselect can be set in one of the following ways:
      • qsub -l rjselect=[N:][chunk specification][+[N:]chunk specification]
      • qalter -l rjselect=[N:][chunk specification][+[N:]chunk specification]
      • As a PBS directive in a job script:
                     #PBS -l rjselect=[N:][chunk specification][+[N:]chunk specification
      • Within a Python hook script, use the pbs.select() type.
                   pbs.event().job.Resource_List["rjselect"] = pbs.select("[N:][chunk specification][+[N:]chunk specification]")
    • Privilege: only root, PBS admin, PBS operator can set the rjselect value
    • Details:
      This is a builtin resource that is used to cause a job to be started reliably. rjselect must be a mirror of the original 'select' request, but with extra chunks added to cause assignment of extra vnodes. When rjselect is set, the schedselect job attribute is set to map to this value instead of the orignal 'select', causing the scheduler to allocate the extra nodes to the job. At the start of the job, before the user's program/job script is executed, any failure of sister nodes are tolerated. But just before the execjob_launch hook (if any) is executed, and also before the user's program/job script is started, the job's assigned nodes are pruned back (take away the failed nodes) to a minimum list, enough to satisfy the original 'select' request.  The first chunk assigned originally is always preserved since that is allocated from the MS host.

      When the job is updated, the MS mom will tell the sister moms (via an IM_UPDATE_JOB request) that the nodes mapping has changed. This is needed so that spawning multi node tasks via TM interface continues to work.

    • Log/Error messages:
      1. If 'rjselect' is set by a regular, non-privileged user, then there'll be an error message from qsub, qalter as follows:
                       "not allowed to set 'rjselect' unless PBS operator or manager"
      2. When job's assigned nodes get pruned, mom_logs will show the following info under PBSEVENT_JOB log level::

        ";Job;<jobid>;pruned from exec_vnode=<original value_satisfying_rjselect>"
        ";Job;<jobid>;pruned to exec_node=<new value_mapping_original_select>"
        ";Job;<jobid>;pruned from schedselect=<original schedselect value_satisfying_rjselect>"
        ";Job;<jobid>;pruned to schedselect=<new schedselect value_mapping_original_select>"

      3. When mother superior fails to prune currently assigned chunk resources (which satisfied the rjselect request) into a minimum set that satisfies the original select specification, then the job is requeued, and the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:

        1. "<jobid>;job not started, Retry 3" under PBSEVENT_ERROR log level

        2. "could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the original select spec could not be satisfied

        3. "could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..."  when a secondary (sister) chunk from the original select spec could not be satisfied

        4. "job node_list_fail: node <node_name1>" which shows a job's vnode that is managed by an unhealthy mom

        5. "job node_list: node <node_name1>" which shows a job's vnode managed by a fully functionining mom.

      4. When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

            ";pbs_mom;Job;<jobid>;updated nodes info"

      5. When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the    
         PBSEVENT_DEBUG2 message: "NOT all job updates to sister moms completed" Note that eventually, all the nodes would automatically complete updating its info.
      6. if a sister mom receives a TM request but its nodes data have not been updated yet, then mom_logs on the sister host will show under PBSEVENT_JOB log level:
        ""job reliably started but missing updated nodes list. Try spawn() again later." 
        NOTE: The client would get an "error on spawn" message while doing tm_spawn.

    Examples:

    Given a queuejob hook where it sets rjselect to allow another node to be added to the 2nd chunk and third chunk:

    ...

    import pbs
    e=pbs.event()
    e.job.Resource_List["rjselect"] = pbs.select("ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")
    # qmgr -c "c h qjob event=queuejob"
    # qmgr -c "i h qjob application/x-python default qjob.py"
    And a job of the form:
    % cat jobr.scr
    #PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
    #PBS -l place=scatter:excl

    echo $PBS_NODEFILE
    cat $PBS_NODEFILE
    echo END
    echo "HOSTNAME tests"
    echo "pbsdsh -n 0 hostname"
    pbsdsh -n 0 hostname
    echo "pbsdsh -n 1 hostname"
    pbsdsh -n 1 hostname
    echo "pbsdsh -n 2 hostname"
    pbsdsh -n 2 hostname
    echo "PBS_NODEFILE tests"
    for host in `cat $PBS_NODEFILE`
    do
    echo "HOST=$host"
    echo "pbs_tmrsh $host hostname"
    pbs_tmrsh $host hostname
    echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
    ssh $host pbs_attach -j $PBS_JOBID hostname
    done

    ...

    % qstat -f 20
    Job Id: 20.borg.pbspro.com
    ...
    exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
    exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
    Resource_List.mem = 11gb
    Resource_List.ncpus = 9
    Resource_List.nodect = 5
    Resource_List.place = scatter:excl
    Resource_List.rjselect = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
    Resource_List.select = ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus1:mem=3gb
    schedselect = 1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb

    Now if say nodes federer and sampras were detected to be unhealthy, then
    a snapshot of the job's output would show the pruned list of nodes:

    /var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
    borg.pbspro.com
    lendl.pbspro.com
    agassi.pbspro.com
    END

    HOSTNAME tests

    pbsdsh -n 0 hostname
    borg.pbspro.com
    pbsdsh -n 1 hostname
    lendl.pbspro.com
    pbsdsh -n 2 hostname
    agassi.pbspro.com

    PBS_NODEFILE tests
    HOST=borg.pbspro.com
    pbs_tmrsh borg.pbspro.com hostname
    borg.pbspro.com
    ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    borg.pbspro.com
    HOST=lendl.pbspro.com
    pbs_tmrsh lendl.pbspro.com hostname
    lendl.pbspro.com
    ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    lendl.pbspro.com
    HOST=agassi.pbspro.com
    pbs_tmrsh agassi.pbspro.com hostname
    agassi.pbspro.com
    ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    agassi.pbspro.com

    Interface 2: New server accounting record: 's' for the start record of a job that was submitted with the rjselect request

    ...

    • builtin resource that the server use to save job's original  'select' specification but in a complete way, containing any default resources missing from the original chunks as well as the chunk counts.
    • Example:

    Given: qsub -l select=3:ncpus=1+mem=5gb+ncpus=2:mem=2gb

                      Resource_List.select_requested would return (noticed the default ncpus=1 in second chunk when non was specified):
                                     3:ncpus=1+1:mem=5gb:ncpus=1+1:ncpus=2:mem=2gb

    Interface 7: pbs.select.increment_chunks(increment, first_chunk=False)

    • Visibility: Public
    • Change Control: Stable
    • Return Python Type: pbs.select
    • Details:
      This is a new method in the pbs.select type where 'increment' number of chunks are added to each chunk in the chunk specification. So given a select spec of "[N:][chunk specification][+[N:]chunk specification]", this function will return [N+increment:][chunk specification][+[N+increment:]chunk specification]" By default, first_chunk=False means no increment is added to the first chunk in the specs. Example: Given pbs.event().job.Resource_List["select"]=ncpus=2:mem=2gb+ncpus=2:mem=2gb+2:ncpus=1:mem=1gb
                  new_select = pbs.event().job.Resource_List["select"].increment_chunks(2)         ← first_chunk=False by default
                   where new_select is now: ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gbOtherwise, if 'first_chunk=True', then the resulting new select also includes 2 additional increments to first chunk:             new_select: 3:ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb

    Interface 8: pbs.event().job.release_nodes() method

    • Visibility: Public
    • Change Control: Stable
    • Return Python type: dict  (the return value's type)
    • Call syntax 1: pbs.event().job.release_nodes(nodes_list)
      • Input: nodes_list - dictionary of pbs.vnode objects keyed by 'vnode_name'  of those nodes to release from job 
      • Detail: Release the given nodes in 'nodes_list'  that have been assigned to a job. This is a hook front-end to pbs_release_nodes command (PP-339). This returns a dictionary of pbs.vnode objects representing the nodes that have been released. If error encountered performing the action, this methos returns None.
    • Call syntax 2 (pruning option): pbs.event().job.release_nodes(keep_select)
      • Input:
        • keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept. Mother superior mom will remove node resources from nodes that have been detected to be bad, using nodes that have been seen as healthy and functional as replacement when necessary. Common value passed would be the pbs.event().job.select_requested (interface 6).
      • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad.
      • Returns a dictionary of pbs.vnode objects representing the nodes that have been released from a job.
      • With a successful execution of release_nodes() call, the accounting records that happen in the node ramp down feature (PP-339) are generated, and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the change.

    • Examples:

      Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:


      e=pbs.event()
      j = e.job
      j.release_nodes(e.vnode_list_fail)

              Given an execjob_launch hook, a hook writer can specify that nodes should be released in such a way that it satisfy the user's original select request

                e=pbs.event()

                j = e.job

                rel_nodes=j.release_nodes(keep_select=j.select_requested)

                if rel_nodes is None:   # error occurred

                     j.rerun()   # requeue the job

                    e.reject("Failed to prune job)

    • Log/Error messages:
      • When job's assigned nodes get pruned (nodes released to satisfy 'keep_select') , mom_logs will show the following info under PBSEVENT_JOB log level::

        ";Job;<jobid>;pruned from exec_vnode=<original value>"
        ";Job;<jobid>;pruned to exec_node=<new value>"

      • When mother superior fails to prune currently assigned chunk resources then the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:

        1. "could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the keep_select spec could not be satisfied

        2. "could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..."  when a secondary (sister) chunk from the keep_select spec could not be satisfied

        3. "job node_list_fail: node <node_name1>" which shows what mom is consulting as the bad_nodes list. (consulted by mom in release_nodes() call).

        4. "job node_list:_good node <node_name1>" which shows what mom is consulting as the good_nodes_list (consulted by mom in release_nodes() call).

      • When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

            ";pbs_mom;Job;<jobid>;updated nodes info"

      • When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the PBSEVENT_DEBUG2 message: "NOT all job updates to sister moms completed." Note that eventually, all the nodes would automatically complete updating its info.
      • if a sister mom receives a TM request but its nodes data have not been updated yet, then mom_logs on the sister host will show under PBSEVENT_JOB log level:
        ""job reliably started but missing updated nodes list. Try spawn() again later." 
        NOTE: The client would get an "error on spawn" message while doing tm_spawn.

    • Examples:

    Given a queuejob hook where it sets select_reliable_startup to allow another node to be added to the 2nd chunk and third chunk of the spec:

    # First, introduce a queue job hook:
    % cat qjob.py

    import pbs
    e=pbs.event()

    j = e.job
    j.Resource_List["select_reliable_startup"] = j.Resource_List["select"].increment_chunks(1)


    # qmgr -c "c h qjob event=queuejob"
    # qmgr -c "i h qjob application/x-python default qjob.py"

    # Second, introduce an execjob_launch hook so that before job officially runs its program, the job's current assigned resources is pruned to match the original 'select' request of user:

    % cat launch.py

    import pbs
    e=pbs.event()

    j = e.job
    relnodes = j.release_nodes(keep_select=j.select_requested)

    if relnodes is None:          # was not successful pruning the nodes

        j.rerun()        # rerun (requeue) the job

       e.reject("something went wrong pruning the job back to its original select request"

    # Otherwise, free up the nodes detected already as bad

    for

    # qmgr -c "c h launch event=execjob_launch"
    # qmgr -c "i h launch application/x-python default launch.py"


    And a job of the form:


    % cat jobr.scr
    #PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
    #PBS -l place=scatter:excl

    echo $PBS_NODEFILE
    cat $PBS_NODEFILE
    echo END
    echo "HOSTNAME tests"
    echo "pbsdsh -n 0 hostname"
    pbsdsh -n 0 hostname
    echo "pbsdsh -n 1 hostname"
    pbsdsh -n 1 hostname
    echo "pbsdsh -n 2 hostname"
    pbsdsh -n 2 hostname
    echo "PBS_NODEFILE tests"
    for host in `cat $PBS_NODEFILE`
    do
        echo "HOST=$host"
        echo "pbs_tmrsh $host hostname"
        pbs_tmrsh $host hostname
        echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
        ssh $host pbs_attach -j $PBS_JOBID hostname
    done


    When job first starts, it will get assigned 5 nodes first, as the "select_reliable_startup" was specified to add 2 extra nodes:

    % qstat -f 20
    Job Id: 20.borg.pbspro.com
    ...
    exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
    exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
    Resource_List.mem =

    ...

    11gb
    Resource_List.ncpus =

    ...

    9
    Resource_List.nodect =

    ...

    5
    Resource_List.place = scatter:excl
    Resource_List.

    ...

    select_reliable_startup = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb

    Interface 3: sister_join_job_alarm mom config option

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is the number of seconds that mother superior mom will wait to receive acknowledgement from all the sister moms for the IM_JOIN_JOB requests sent, if job is started reliably. When job runs, just before a job officially launches the program, pbs_mom will ignore any errors from sister moms including failed IM_JOIN_JOB requests. pbs_mom will use this value before proceeding to pre-starting the job (calling finish_exec()).
    • Default value: 30 seconds
          To change value, add the following line in mom's config file:
                              $sister_join_job_alarm <# of seconds> 
    • Log/Error messages:
      1. When $sister_join_job_alarm value exceeded for a job, mom_logs shows the PBSEVENT_ADMIN level message: ";Job;<jobid>;join job wait time exceeded"

    Interface 4: job_launch_delay mom config option

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is the number of seconds that a child mom waits on a pipe attached to a parent mom, or vice versa, for information about sister moms assigned to a reliably-started job.There are 3 cases that this happens:
      1.  before prologue hook execution
      2. before execjob_launch hook execution
      3. while waiting for confirmation from main mom that nodes information have been successfully updated and communicated to all the sister moms.
    • Default value: 35 seconds
      To change value, add the following line in mom's config file:
                     $job_launch_delay <number of seconds>
    • Log/Error messages:
      The following messages are logged in PBSEVENT_DEBUG2 level:
      1. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: before prologue"

      2. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: before launch"

      3. mom_logs: ";Job;<jobid>;job reliable startup job_launch_delay=<$job_launch_delay value> secs: waiting on job update acks from sister moms"

    Interface 5: pbs.event().vnode_list_fail[] hook parameter

    • Visibility: Public
    • Change Control: Stable
    • Details:
      This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. And one can walk through this list and start offlining the vnodes. For example:

      for vn in e.vnode_list_fail.keys():
          v = e.vnode_list_fail[vn]
          pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
          v.state = pbs.ND_OFFLINE

    • Log/Error messages:
      1. If an execjob_prologue hook or an execjob_launch hook requested to offline a vnode, server_logs would show the message under PBSEVENT_DEBUG2 level:

        ";Server@borg;Node;<node name>;Updated vnode <node_name>'s attribute state=offline per mom hook request" 

    Site Map

    ...

    Resource_List.select = 1:ncpus=3:mem-=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb

    ...


    Resource_List.select_requested = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

    Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:

    % qstat -f 20
    Job Id: 20.borg.pbspro.com
    ...
    exec_host = borg/0*3+lendl/0*2+agassi/0*2
    exec_vnode = (borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)
    Resource_List.mem = 6gb
    Resource_List.ncpus = 6
    Resource_List.nodect = 3
    Resource_List.place = scatter:excl
    Resource_List.select = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

    A snapshot of the job's output would show the pruned list of nodes:

    /var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
    borg.pbspro.com
    lendl.pbspro.com
    agassi.pbspro.com
    END

    HOSTNAME tests

    pbsdsh -n 0 hostname
    borg.pbspro.com
    pbsdsh -n 1 hostname
    lendl.pbspro.com
    pbsdsh -n 2 hostname
    agassi.pbspro.com

    PBS_NODEFILE tests
    HOST=borg.pbspro.com
    pbs_tmrsh borg.pbspro.com hostname
    borg.pbspro.com
    ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    borg.pbspro.com
    HOST=lendl.pbspro.com
    pbs_tmrsh lendl.pbspro.com hostname
    lendl.pbspro.com
    ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    lendl.pbspro.com
    HOST=agassi.pbspro.com
    pbs_tmrsh agassi.pbspro.com hostname
    agassi.pbspro.com
    ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
    agassi.pbspro.com

    Page Properties
    hiddentrue
    idPP-design