Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Objective

This is to introduce the node ramp down feature, which basically releases no longer needed sister nodes/vnodes early from running jobs.

Interface 1: New command: 'pbs_release_nodes'

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: Release a specified set of sister nodes or vnodes, or all sister nodes or vnodes assigned to the specified running job. The nodes released will then be made available for scheduling other jobs. 
  • Permission: Only job owner, admin, PBS manager, or PBS operator will be allowed to perform the release nodes action.
  • Details:
    • Two types of  actions:
      1.  Release a particular set of sister nodes from a job:
        Syntax:   pbs_release_nodes -j <job_identifier> <host1_or_vnode1> [<host2_or_vnode2> [<host3_or_vnode3>] ...] ...

        • The 'host*_or_vnode*' argument is any of the sister nodes/vnodes that appear in the exec_vnode attribute of a running job. Example:
          % qsub job.scr
          241.borg
          % qstat  241 | grep "exec|Resource_List|select"

          exec_host = borg[0]/0*0+federer/0*0+lendl/0*2
          exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576k:ncpus=1+federer[1]:ncpus=1)+(lendl:ncpus=2:mem=2097152kb)
          Resource_List.mem = 6gb
          Resource_List.ncpus = 8
          Resource_List.nodect = 3
          Resource_List.place = scatter
          Resource_List.select = ncpus=3:mem=2gb+ncpus=3:mem=2gb+ncpus=2:mem=2gb
          schedselect = 1:ncpus=3:mem=2gb+1:ncpus=3:mem=2gb+1:ncpus=2:mem=2gb


          %  pbs_release_nodes -j 241 federer[1] lendl

          % qstat  241 | grep "exec|Resource_List|select"

          exec_host = borg[0]/0*0+federer/0*0 <- no lendl as all assigned vnodes in lendl have been cleared.
          exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576kb:ncpus=1) <- federer[1] and lendl removed.

          Resource_List.mem = 4194304kb <- minus 2gb (from lendl)
          Resource_List.ncpus = 5 <- minus 3 cpus (1 from federer[1] and 2 from lendl)
          Resource_List.nodect = 2 <- minus 1 chunk (when lendl was taken out, its entire chunk assignment disappeared)
          Resource_List.place = scatter
          schedselect = 1:mem=2097152kb:ncpus=3+1:mem=2097152kb:ncpus=2


      2. Release all sister nodes from a job:
        Syntax:   pbs_release_nodes -j <job_identifier> -a
        • Example:
          % pbs_release_nodes -j 241 -a
          % qstat -f 241

          exec_host = borg[0]/0*0
          exec_vnode = (borg[0]:mem=1048576kb:ncpus=1)+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)
          Resource_List.mem = 2097152kb
          Resource_List.ncpus =  3
          Resource_List.nodect = 1
          Resource_List.place = scatter
          schedselect = 1:mem=2097152kb:ncpus=3


    • After issuing pbs_release_nodes, a running job's $PBS_NODEFILE content will  no longer show the released nodes.
      Example:

      % qsub -l select=2:ncpus=1:mem=1gb -l place=scatter -I
      qsub: waiting for job 247.borg.pbspro.com to start
      qsub: job 247.borg.pbspro.com ready

      % cat $PBS_NODEFILE
      borg.pbspro.com
      federer.pbspro.com
      %  pbs_release_nodes -j 247 federer
      % cat $PBS_NODEFILE
      borg.pbspro.com

    • The server will continue to hold on to the job on a released node, until receiving a confirmation that the job has been cleaned up from the node
    •  The PBS licenses will be updated accordingly once the job has been completely taken out of the released node.
    • When a node is released, it reports to the mother superior (MS) its resources_used* values for the job as the final action. That released node would no longer update the resources_used values for that job since it's no longer part of the job. But MS will hold onto the data, and will be added during final aggregation of resources_used values when job exits.
    • pbs_release_nodes is not currently supported with nodes/vnodes that are tied to Cray XC systems,  as the ALPS reservation cannot be modified right now. 
    •  Error reporting

        •  pbs_release_nodes will report an error if any of the nodes specified are managed by a mother superior mom.

          Example:

          % pbs_release_nodes -j 241 borg[0]

            pbs_release_nodes: Can't free 'borg[0]' since it's on an MS host

        • pbs_release_nodes will report an error if executed by a non-admin, non-manager, non-operator, or non-job owner user:

          Example:

          %  pbs_release_nodes -j 248 federer
          pbs_release_nodes: Unauthorized Request

        • pbs_release_nodes will report an error if the vnode being released is not part of the job:
          Example:

          %  pbs_release_nodes -j 249 lendl
          pbs_release_nodes: these nodes are not part of the job: lendl

        • pbs_release_nodes will report an error if issued on a job that is not in a running state:
          Example:

          % pbs_release_nodes -j 251 lendl
          pbs_release_nodes: Request invalid for state of job

        • pbs_release_nodes will report an error if both the '-a' option and a list of nodes/vnodes are specified in the command line.

          Example:

          % pbs_release_nodes -j 252 -a federer
          usage: pbs_release_nodes [-j job_identifier] host_or_vnode1 host_or_vnode2 ...
          usage: pbs_release_nodes [-j job_identifier] -a
          pbs_release_nodes --version

        • pbs_release_nodes will report an error message and exit if at least one of the hosts or vnodes specified is a Cray XC series node. The following message is returned:

          Example:

          % pbs_release_node -j 253 cray_node

          "pbs_release_nodes is not currently supported on Cray XC systems nodes: <cray_node>"

Interface 2: New job attribute 'release_nodes_on_stageout'

  • Visibility: Public
  • Change Control: Stable
  • Value: 'true' or 'false'
  • Synopsis: When set to 'true', this will do an equivalent of 'pbs_release_nodes -a' for releasing all the sister vnodes when stageout operation begins.
  • Example:
    %  qsub -W stageout=my_stageout@federer:my_stageout.out -W release_nodes_on_stageout=true job.scr

Interface 3: New server acounting record: 'u' for update record

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: For every release nodes action, there'll be an accounting_logs record written, which is called the

    'u' (for update) record.

  • Details: The 'u' record represents a just concluded phase of the job, which  consists of a set of resources assigned to the job ( exec_vnode, exec_host, Resource_List items), amount of resources used during that phase (resources_used_incr.* keywords), a snapshot of the resources used values within the life of the job so far (resources_used.* items), and a set of resources to be assigned in the new phase of the job (i.e. updated_exec_vnode, updated_exec_host, updated_Resource_List). 

  • Example:

% qsub -l select=3:ncpus=1:mem=1gb job.scr

242.borg

% qstat -f 242 | egrep "exec|Resource_List|select"

exec_host = borg/0+federer/0+lendl/0

exec_vnode = (borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb)

Resource_List.mem = 3gb

Resource_List.ncpus = 3

Resource_List.nodect = 3

Resource_List.place = scatter

Resource_List.select = 3:ncpus=1:mem=1gb

schedselect = 3:ncpus=1:mem=1gb

% pbs_release_nodes -j 241 lendl

Accounting logs show:

# tail -f /var/spool/PBS/server_priv/accounting/201701231

23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572  session=7503 run_count=1 exec_host=borg/0+federer/0+lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb Resource_List.ncpus=3 Resource_List.nodect=3 Resource_List.place=scatter Resource_List.select=3:ncpus=1:mem=1gb updated_exec_host=borg/0+federer/0 updated_exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) updated_Resource_List.mem=2097152kb updated_Resource_List.ncpus=2 updated_Resource_List.nodect=2 updated_Resource_List.place=scatter updated_Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb resources_used_incr.cpupercent=5 resources_used_incr.cput=00:04:35 resources_used_incr.mem=4288kb resources_used_incr.ncpus=3 resources_used_incr.vmem=42928kb resources_used_incr.walltime=00:00:26 resources_used.cpupercent=10 resources_used.cput=00:10:05 resources_used.mem=8192kb resources_used.ncpus=4 resources_used.vmem=82928kb resources_used.walltime=00:11:26

Another pbs_release_nodes call yield:

% pbs_release_nodes -j 241 federer

# tail -f /var/spool/PBS/server_priv/accounting/201701231

01/23/2017 18:59:35;u;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215949 qtime=1485215949 etime=1485215949 start=1485215949 session=7773 run_count=1  exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb updated_exec_host=borg/0 updated_exec_vnode=(borg[0]:ncpus=1:mem=1048576kb) updated_Resource_List.mem=1048576kb updated_Resource_List.ncpus=1 updated_Resource_List.nodect=1 updated_Resource_List.place=scatter updated_Resource_List.select=1:ncpus=1:mem=1048576kb resources_used_incr.cpupercent=3 resources_used_incr.cput=00:03:35 resources_used_incr.mem=2048kb resources_used_incr.ncpus=2 resources_used_incr.vmem=32928kb resources_used_incr.walltime=00:00:26 resources_used.cpupercent=10 resources_used.cput=00:10:05 resources_used.mem=8192kb resources_used.ncpus=4 resources_used.vmem=82928kb resources_used.walltime=00:12:26

01/23/2017 19:00:00;L;license;floating license hour:3 day:3 month:3 max:10

Interface 4: Additional keywords in the 'E' (end) accounting record

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: The 'E' accounting record will show what was still assigned to the job when the job completed (exec_vnode, exec_host, Resource_List items), how much resources the job had used when it reached the end point (resources_used.* keywords), and resources_used_incr.* items  representing the amount of resources used from the time the last 'u' record was generated (the concluded phase of the job). It will be up to the log parser to take all the 'u' records and 'E' record of the job, and either sum up the resources_used_incr.* values (e.g. resources_used_incr.walltime) or to average them out when it makes sense (e.g. resources_used.ncpus)

Interface 5: New job attributes 'Resource_List_orig', 'schedselect_orig', 'resources_used_acct', 'exec_host_acct', 'exec_vnode_acct', 'Resource_List_acct', and 'exec_vnode_deallocated'

  • Visibility: Private
  • Change Control: Unstable
  • Synopsis: In this feature of releasing nodes early and providing some trace of action in the accounting logs, the *_orig,  *_acct, and *_deallocated  internal job attributes are used to save interim data. These attributes  could go away in the future on a re-implementation,  so it's best not to depend on them.

  • Details: 
    • The *_orig attributes are the original values before a pbs_release_nodes action was done. These values are what PBS server allocate when rerunning a job (i.e. qrerun, server restart).
    • The *_acct attributes hold the original values before the pbs_release_nodes action was taken.
    • The exec_vnode_deallocated attribute holds the resources that are assigned to recently released nodes that have not completely removed the job from their system.

Interface 6: Additions to log messages

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • Special mom_logs messages:
      • A pbs_release_nodes request causes the server to send a job update to the mother superior (MS)  of the job. The
        MS in turn looks into the list of nodes being removed. If it's the last node from the same host, MS sends a new DELETE_JOB2 request to that owning sister mom. Upon receiving this request, the sister mom goes and kills job processes on the node, and sends back to the mother superior the summary
        accounting information for the job on that node. Mom_logs will show the following DEBUG messages:

        sister mom_logs: "DELETE_JOB2 received"

        Mother superior log: "<reporting_sister_host>;cput=YY mem=ZZ"

    • Special server_logs messages:

      • When a job has been completely removed from an early released vnode, the following DEBUG2 message will be shown:

        "clearing job <job-id> from node <vnode-name>

        "Node<sister-mom-hostname>;deallocating 1 cpus from job <job-id>


  • No labels