Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

This is to introduce the node ramp down feature, which basically releases no longer needed sister nodes/vnodes early from running jobs.

Interface 1: New command: 'pbs_release_nodes'

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: Release a specified set of sister nodes or vnodes, or all sister nodes or vnodes assigned to the specified running job. The nodes released will then be made available for scheduling other jobs. 
  • Permission: Only job owner, admin, PBS manager, or PBS operator will be allowed to perform the release nodes action.
  • Details:Two types of  actions:

     Release a particular set of sister nodes from a job:
    Syntax:   pbs_release_nodes -j <job_identifier> <host1_or_vnode1> [<host2_or_vnode2> [<host3_or_vnode3>] ...] ...

    The 'host*_or_vnode*' argument is any of the sister nodes/vnodes that appear in the exec_vnode attribute of a running job. Example:
    % qsub job.scr
    241.borg
    % qstat  241 | grep "exec|Resource_List|select"

    exec_host = borg[0]/0*0+federer/0*0+lendl/0*2
    exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576k:ncpus=1+federer[1]:ncpus=1)+(lendl:ncpus=2:mem=2097152kb)
    Resource_List.mem = 6gb
    Resource_List.ncpus = 8
    Resource_List.nodect = 3
    Resource_List.place = scatter
    Resource_List.select = ncpus=3:mem=2gb+ncpus=3:mem=2gb+ncpus=2:mem=2gb
    schedselect = 1:ncpus=3:mem=2gb+1:ncpus=3:mem=2gb+1:ncpus=2:mem=2gb

    %  pbs_release_nodes -j 241 federer[1] lendl


    http://community.pbspro.org/t/pp-339-and-pp-647-release-vnodes-early-from-running-jobs/419/44

    Objective

    This is to introduce the node ramp down feature, which basically releases no longer needed sister nodes/vnodes early from running jobs.

    (Forum Discussion: http://community.pbspro.org/t/pp-339-and-pp-647-release-vnodes-early-from-running-jobs)

    Interface 1: New command: 'pbs_release_nodes'

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: Release a specified set of sister nodes or vnodes, or all sister nodes or vnodes assigned to the specified running job. The nodes released will then be made available for scheduling other jobs. 
    • Permission: Only job owner, admin, PBS manager, or PBS operator will be allowed to perform the release nodes action.
    • Details:
      • Two types of  actions:
        1.  Release a particular set of sister nodes from a job:
          Syntax:   pbs_release_nodes [-j <job_identifier>] <host1_or_vnode1> [<host2_or_vnode2> [<host3_or_vnode3>] ...] ...

          • Without the '-j' option, pbs_release_nodes will use as job identifier the value of environment variable $PBS_JOBID. So this command can be called inside a PBS job script where such an environment will exist.

          • The 'host*_or_vnode*' argument is any of the sister nodes/vnodes that appear in the exec_vnode attribute of a running job. Example:
            % qsub job.scr
            241.borg
            % qstat  241 | grep "exec|Resource_List|select"

          • Example:
            % pbs_release_nodes -j 241 -a
            % qstat -f 241

            exec_host = borg[0]/0*0+federer/0*0 <- no lendl as all assigned vnodes in lendl have been cleared.+lendl/0*2
            exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576kb1048576k:ncpus=1) <- +federer[1] and lendl removed.:ncpus=1)+(lendl:ncpus=2:mem=2097152kb)
            Resource_List.mem = 4194304kb <- minus 2gb (from lendl)6gb
            Resource_List.ncpus = 5 <- minus 3 cpus (1 from federer[1] and 2 from lendl)8
            Resource_List.nodect = 2 <- minus 1 chunk (when lendl was taken out, its entire chunk assignment disappeared)3
            Resource_List.place = scatter
            Resource_List.place = scatterselect = ncpus=3:mem=2gb+ncpus=3:mem=2gb+ncpus=2:mem=2gb
            schedselect = 1:ncpus=3:mem=2097152kb2gb+1:ncpus=3+1:mem=2097152kb2gb+1:ncpus=2

          Release all sister nodes from a job:
          Syntax:   pbs_release_nodes -j <job_identifier> -a
          • :mem=2gb


            %  pbs_release_nodes -j 241 federer[1] lendl

            % qstat  241 | grep "exec|Resource_List|select"

            exec_host = borg[0]/0*0+federer/0*0 <- no lendl as all assigned vnodes in lendl have been cleared.
            exec_vnode = (borg[0]:mem=1048576kb:ncpus=1)+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)
            +(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576kb:ncpus=1) <- federer[1] and lendl removed.

            Resource_List.mem = 2097152kb4194304kb <- minus 2gb (from lendl)
            Resource_List.ncpus =  35 <- minus 3 cpus (1 from federer[1] and 2 from lendl)
            Resource_List.nodect = 1
            Resource_List.2 <- minus 1 chunk (when lendl was taken out, its entire chunk assignment disappeared)
            Resource_List.place = scatter
            schedselect = 1:mem=2097152kb:ncpus=3

        After issuing pbs
          • +1:mem=2097152kb:ncpus=2


        1. Release all sister nodes from a job:
          Syntax:   pbs_release_nodes
        , a running job's $PBS_NODEFILE content will  no longer show the released nodes.
        Example:

        % qsub -l select=2:ncpus=1:mem=1gb -l place=scatter -I
        qsub: waiting for job 247.borg.pbspro.com to start
        qsub: job 247.borg.pbspro.com ready

        % cat $PBS_NODEFILE
        borg.pbspro.com
        federer.pbspro.com
        %  pbs_release_nodes -j 247 federer
        % cat $PBS_NODEFILE
        borg.pbspro.com

      • The server will continue to hold on to the job on a released node, until receiving a confirmation that the job has been cleaned up from the node
      •  The PBS licenses will be updated accordingly once the job has been completely taken out of the released node.
      • When a node is released, it reports to the mother superior (MS) its resources_used* values for the job as the final action. That released node would no longer update the resources_used values for that job since it's no longer part of the job. But MS will hold onto the data, and will be added during final aggregation of resources_used values when job exits.
      • pbs_release_nodes is not currently supported with nodes/vnodes that are tied to Cray XC systems,  as the ALPS reservation cannot be modified right now. 
        1. [-j <job_identifier>] -a
          • Without the '-j' option, pbs_release_nodes will use as job identifier the value of environment variable $PBS_JOBID. So this command can be called inside a PBS job script where such an environment will exist.
          • Example:
            % pbs_release_nodes -j 241 -a
            % qstat -f 241

            exec_host = borg[0]/0*0
            exec_vnode = (borg[0]:mem=1048576kb:ncpus=1)+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)
            Resource_List.mem = 2097152kb
            Resource_List.ncpus =  3
            Resource_List.nodect = 1
            Resource_List.place = scatter
            schedselect = 1:mem=2097152kb:ncpus=3


      • After issuing pbs_release_nodes, a running job's $PBS_NODEFILE content will  no longer show the released nodes.
        Example:

        % qsub -l select=2:ncpus=1:mem=1gb -l place=scatter -I
        qsub: waiting for job 247.borg.pbspro.com to start
        qsub: job 247.borg.pbspro.com ready

        % cat $PBS_NODEFILE
        borg.pbspro.com
        federer.pbspro.com
        %  pbs_release_nodes -j 247 federer
        % cat $PBS_NODEFILE
        borg.pbspro.com

      • The server will continue to hold on to the job on a released node, until receiving a confirmation that the job has been cleaned up from the node
      •  The PBS licenses will be updated accordingly once the job has been completely taken out of the released node.
      • When a node is released, it reports to the mother superior (MS) its resources_used* values for the job as the final action. That released node would no longer update the resources_used values for that job since it's no longer part of the job. But MS will hold onto the data, and will be added during final aggregation of resources_used values when job exits.
      • pbs_release_nodes is not currently supported with nodes/vnodes that are tied to Cray X* series systems,  as the ALPS reservation cannot be modified right now. These are the nodes/vnodes whose vntype matches the "cray_" prefix.
      • pbs_release_nodes is also not supported with nodes/vnodes managed by cpuset moms, given that partial release of vnodes may result in leftover cpusets.
      • If cgroups support is enabled, and pbs_release_nodes is called to release some of the vnodes but not all the vnodes from the same mom host, resources on those vnodes that are part of a cgroup would not get be automatically releasedtaken out.
      •  Error reporting

          •  pbs_release_nodes will report an error if any of the nodes specified are managed by a mother superior mom.

            Example:

            % pbs_release_nodes -j 241 borg[0]

              pbs_release_nodes: Can't free 'borg[0]' since it's on an MS a a primary execution host

          • pbs_release_nodes will report an error if executed by a non-admin, non-manager, non-operator, or non-job owner user:

            Example:

            %  pbs_release_nodes -j 248 federer
            pbs_release_nodes: Unauthorized Request

            There'll also be server_logs entry of the form:

            07/13/2017 04:29:23;0020;Server@corretja;Job;<job-id>;Unauthorized Request, request type: 90, Object: Job, Name: <jobid>, request from: <requesting user>@<requestor host>

          • pbs_release_nodes will report an error if the vnode being released is not part of the job:
            Example:

            %  pbs_release_nodes -j 249 lendl
            pbs_release_nodes: these nodes are not part node(s) requested to be released not part of the job: lendl

          • pbs_release_nodes will report an error if issued on a job that is not in a running state:
            Example:

            % pbs_release_nodes -j 251 lendl
            pbs_release_nodes: Request invalid for state of job

          • pbs_release_nodes will report an error if both the '-a' option and a list of nodes/vnodes are specified in the command line.

            Example:

            % pbs_release_nodes -j 252 -a federer
            usage: pbs_release_nodes [-j job_identifier] host_or_vnode1 host_or_vnode2 ...
            usage: pbs_release_nodes [-j job_identifier] -a
            pbs_release_nodes --version


          • pbs_release_nodes will report an error if it cannot find a job identifier to associate the release node action:
            Example:

            Execute the following from outside a PBS job:

            % pbs_release_nodes -a
               pbs_release-nodes: No jobid given

          • pbs_release_nodes will report an error message and exit if at least one of the hosts or vnodes specified is a Cray XC X* series node. The following message is returned:

            Example:

            % pbs_release_node nodes -j 253 cray_node

            "pbs_release_nodes is : not currently supported on Cray XC systems X*  series nodes: <cray_node>"

        At every successful
          • pbs_release_nodes

        call, qstat will show the updated exec_host, exec_vnode, Resource_List* values.
        When releasing vnodes, if all vnodes
          • will report an error message and exit if at least one of the hosts or vnodes specified is managed by a pbs_mom running cpusets (i.e. resources_available.arch='linux_cpuset'). The following message is returned:

            Example:

            % pbs_release_nodes -j 253 cray_node

            "pbs_release_nodes: not currently supported on nodes whose resources are part of a cpuset: <cpuset_node>"


      • At every successful pbs_release_nodes call, qstat will show the updated exec_host, exec_vnode, Resource_List* values.

      • When releasing vnodes, if all vnodes assigned coming from the same mom host have been released, then the job would be completely removed from that mom host. This will result in 1) execjob_epilogue hook script (if it exists) to execute, 2) job processes are killed on that mom host, 3) any job-specific specific files including job temporary directories are removed, and 4) cpusets and Cray alps reservations on that mom host are cleared. The execjob_end hook (if it exists) will also execute on the host.

      • If one (or more) but not all the vnodes from a mom host assigned to the job have been released (partial release of vnodes), then job does not get removed from the mom host yet. If those released vnodes have been configured to be shared, then they can be reassigned to other jobs.

    Interface 2: New job attribute 'release_nodes_on_stageout'

    • Visibility: Public
    • Change Control: Stable
    • Value: 'true' or 'false'
    • Synopsis: When set to 'true', this will do an equivalent of 'pbs_release_nodes -a' for releasing all the sister vnodes when stageout operation begins.
    • Example:
      %  qsub -W stageout=my_stageout@federer:my_stageout.out -W release_nodes_on_stageout=true job.scr

    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with release_nodes_stageout set by default.

    • This attribute can also be set in a queuejob, modifyjob hook, and the Python type is boolean with valid values 'True' or 'False'.

    Example:
     # cat qjob.py
    import pbs
    e=pbs.event()
    e.job.release_nodes_on_stageout = True
    # qmgr -c "create hook qjob event=queuejob"
    # qmgr -c "import hook application/x-python default qjob.py"
    % qsub job.scr
    23.borg
    % qstat -f 23
    ...
    release_nodes_on_stageout = True

    Interface 3: New server accounting record: 'u' for update record

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: For every release nodes action, there'll be an accounting_logs record written, which is called the

      'u' (for update) record.

    • Details: The 'u' record represents a just concluded phase of the job, which  consists of a set of resources assigned to the job ( exec_vnode, exec_host, Resource_List items), and amount of resources used (resources_used) during that phase of the job.

    • Example:

    % qsub -l select=3:ncpus=1:mem=1gb job.scr

    242.borg

    % qstat -f 242 | egrep "exec|Resource_List|select"

    exec_host = borg/0+federer/0+lendl/0

    exec_vnode = (borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb)

    Resource_List.mem = 3gb

    Resource_List.ncpus = 3

    Resource_List.nodect = 3

    Resource_List.place = scatter

    Resource_List.select = 3:ncpus=1:mem=1gb

    schedselect = 3:ncpus=1:mem=1gb
        • If an exclusively-assigned vnode is released from a job, and there are still other vnodes from the same mom host assigned, the released vnode  would still not be made available for other jobs. It will be reflected in the pbsnodes -av output.

          For example, say a job is submitted as:

          % qsub -l select=ncpus=1+2:ncpus=1 -l place=excl -- /bin/sleep 300
          152.corretja

          Job is seen as running and assigned exclusively to the following vnodes:

          % qstat -f | egrep exec_vnode
          exec_vnode = (corretja:ncpus=1)+(federer[0]:ncpus=1)+(federer[1]:ncpus=1)

          % pbsnodes -av
          corretja
          Mom = corretja.pbspro.com
          state = job-exclusive
          jobs = 152.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[0]
          Mom = federer.pbspro.com
          state = job-exclusive
          jobs = 152.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[1]
          Mom = federer.pbspro.com
          state = job-exclusive
          jobs = 152.corretja/0
          resources_available.ncpus = 4
          resources_assigned.ncpus = 1


          Suppose we release vnode federer[1]:

          % pbs_release_nodes -j 152 federer[1]

          Even though the job shows federer[1] has been taken out:

          % qstat -f 152 | grep exec_vnode
          exec_vnode = (corretja:ncpus=1)+(federer[0]:ncpus=1)

          pbsnodes would still show job assigned to the vnode and not available for other jobs as another vnode from same mom host, federer[0], is still assigned:

          % pbsnodes -av
          corretja
          Mom = corretja.pbspro.com
          state = job-exclusive
          jobs = 152.corretja/0
          resources_available.ncpus = 1
          federer[0]
          Mom = federer.pbspro.com
          state = job-exclusive
          jobs = 152.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[1]
          Mom = federer.pbspro.com
          state = job-exclusive
          jobs = 152.corretja/0
          resources_available.ncpus = 4
          resources_assigned.ncpus = 1

          If the other vnode is released, then federer[1] would be made available:

          % pbs_release_nodes -j

    241 lendl

    Accounting logs show:

    # tail -f /var/spool/PBS/server_priv/accounting/201701231

    23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572  session=7503 run_count=1 exec_host=borg/0+federer/0+lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb Resource_List.ncpus=3 Resource_List.nodect=3 Resource_List.place=scatter Resource_List.select=3:ncpus=1:mem=1gb resources_used.cpupercent=5 resources_used.cput=00:04:35 resources_used.mem=4288kb resources_used.ncpus=3 resources_used.vmem=42928kb resources_used.walltime=00:00:26

    Another pbs_release_nodes call yield:

    % pbs_release_nodes -j 241 federer

    # tail -f /var/spool/PBS/server_priv/accounting/201701231

    01/23/2017 18:59:35;u;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215949 qtime=1485215949 etime=1485215949 start=1485215949 session=7773 run_count=1  exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb resources_used.cpupercent=3 resources_used.cput=00:03:35 resources_used.mem=2048kb resources_used.ncpus=2 resources_used.vmem=32928kb resources_used.walltime=00:00:26 

    Interface 5: New server accounting record: 'c' for continue record:

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: The 'c' accounting record will show  the next assigned exec_vnode, exec_host, Resource_List. along with the job attributes in the new/next phase of the job.

      Given the following example:                                                                                                                                                                                  % qsub -l select=3:ncpus=1:mem=1gb job.scr

      242.borg

      % qstat -f 242 | egrep "exec|Resource_List|select"

        • 152 federer[0]

          % pbsnodes -av
          corretja
          Mom = corretja.pbspro.com
          state = job-exclusive
          jobs = 154.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[0]
          Mom = federer.pbspro.com
          state = free
          resources_available.ncpus = 1
          resources_assigned.ncpus = 0
          federer[1]
          Mom = federer.pbspro.com
          state = free
          resources_available.ncpus = 4
          resources_assigned.ncpus = 0

          If the vnode has not been assigned exclusively, then other resources (i.e. cpus/mem) from the released vnode can be allocated to other jobs. It is shown in the following scenario:

          % qsub -l select=ncpus=1+2:ncpus=1 -- /bin/sleep 300
          155.corretja

          % qstat -f | grep exec_vnode
          exec_vnode = (corretja:ncpus=1)+(federer[0]:ncpus=1)+(federer[1]:ncpus=1)

          % pbsnodes -av
          corretja
          Mom = corretja.pbspro.com
          state = job-busy
          jobs = 155.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[0]
          Mom = federer.pbspro.com
          state = job-busy
          jobs = 155.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[1]
          Mom = federer.pbspro.com
          state = free
          jobs = 155.corretja/0
          resources_available.ncpus = 4
          resources_assigned.ncpus = 1


          Now release node vnode federer[1], and it will be reflected in exec_vnode, but pbsnodes would still show the job assigned to the vnode:

          % pbs_release_nodes -j 155 federer[1]

          % qstat -f | grep exec_vnode
          exec_vnode = (corretja:ncpus=1)+(federer[0]:ncpus=1)


          But since federer[1] has not been assigned exclusively, and there are 3 other cpus available, more instances of the vnode can be assigned up to the number of available cpus:

          Here's another job:
          % qsub -l select=vnode=federer[1] -- /bin/sleep 300
          156.corretja

          bayucan@corretja:~/tmp> qstat
          Job id Name User Time Use S Queue
          ---------------- ---------------- ---------------- -------- - -----
          155.corretja STDIN bayucan 00:00:00 R workq
          156.corretja STDIN bayucan 00:00:00 R workq


          % pbsnodes -av
          corretja
          Mom = corretja.pbspro.com
          state = job-busy
          jobs = 155.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[0]
          Mom = federer.pbspro.com
          state = job-busy
          jobs = 155.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[1]
          Mom = federer.pbspro.com
          state = free
          jobs = 155.corretja/0, 156.corretja/1
          resources_available.ncpus = 4
          resources_assigned.ncpus = 2


          Now releasing federer[0] from job 155, would completely remove assignment of job 155 from the mom host federer.pbspro.com managing the vnodes.

          % pbs_release_nodes -j 155 federer[0]

          % pbsnodes -av
          corretja
          Mom = corretja.pbspro.com
          state = job-busy
          jobs = 155.corretja/0
          resources_available.ncpus = 1
          resources_assigned.ncpus = 1
          federer[0]
          Mom = federer.pbspro.com
          state = free
          resources_available.ncpus = 1
          resources_assigned.ncpus = 0
          federer[1]
          Mom = federer.pbspro.com
          state = free
          jobs = 156.corretja/1
          resources_available.ncpus = 4
          resources_assigned.ncpus = 1

      • API:

        NAME
        pbs_relnodesjob - release a set of sister nodes or vnodes,
        or all sister nodes or vnodes assigned to the specified PBS
        batch job.

        SYNOPSIS
        #include <pbs_error.h>
        #include <pbs_ifl.h>

        int pbs_relnodesjob(int connect, char *job_id, char *node_list,
        char *extend)

        DESCRIPTION
        Issue a batch request to release sister vnodes from a batch job.

        A RelnodesJob batch request is generated and sent to the server over
        the connection specified by connect which is the return value of
        pbs_connect().

        The argument, job_id, identifies the job to release nodes or vnodes
        from; it is specified in the form:
        sequence_number.server

        The parameter, node_list, is a plus ('+') separated list of vnode names
        whose whose parent mom is a sister mom. If node_list is NULL, then
        this refers to all the sister vnodes assigned to the job.

        The parameter, extend, is reserved for implementation-defined exten-
        sions.

        DIAGNOSTICS
        When the pbs_relnodesjob() function has been co
        mpleted successfully by a batch server, the routine will return 0
        (zero). Otherwise, a non zero error is returned. The error number is
        also set in pbs_errno.

    Interface 2: New job attribute 'release_nodes_on_stageout'

    • Visibility: Public
    • Change Control: Stable
    • Value: 'true' or 'false'
    • Synopsis: When set to 'true', this will do an equivalent of 'pbs_release_nodes -a' for releasing all the sister nodes when stageout operation begins.
    • Example:
      %  qsub -W stageout=my_stageout@federer:my_stageout.out -W release_nodes_on_stageout=true job.scr

    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with release_nodes_stageout set by default.

    • If there was no stageout parameter specified, then release_nodes_on_stageout is not consulted even if it is set to true.
    • The use of this attribute  is not currently supported with nodes/vnodes that are tied to Cray X* series systems. These are nodes/vnodes whose vntype matches "cray_" prefix.,
    • This is also not supported with nodes/vnodes managed by cpuset moms, given that partial release of vnodes may result in leftover cpusets. These are the vnodes whose 'arch' attribute value is "linux_cpuset".
    • If cgroups support is enabled,  and this option is used to release some of the vnodes but not all the vnodes from the same mom host, resources on those vnodes that are part of a cgroup would not get automatically released, until entire cgroup is released.
    • This attribute can also be set in a queuejob, modifyjob hook, and the Python type is boolean with valid values 'True' or 'False'.

    Example:
     # cat qjob.py
    import pbs
    e=pbs.event()
    e.job.release_nodes_on_stageout = True
    # qmgr -c "create hook qjob event=queuejob"
    # qmgr -c "import hook application/x-python default qjob.py"
    % qsub job.scr
    23.borg
    % qstat -f 23
    ...
    release_nodes_on_stageout = True

    For the following 3 interfaces related to new accounting records, they apply to what is termed as "phased" job.  Think of a job that is running with assigned resources (exec_vnode/exec_host/Resource_List) and that's one phase of the job. Then issue a pbs_release_nodes of the job's assigned vnodes to begin the next, new phase of the job. Accounting info from the just concluded phase are reflected in 'u' record, while the "lessened" set of vnodes and their resources are reflected in the 'c' record, showing the next assigned exec_vnode/exec_host/Resource_List.,

    Interface 3: New server accounting record: 'u' for update record

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: For every release nodes action, there'll be an accounting_logs record written, which is called the

      'u' (for update) record.

    • Details: The 'u' record represents a just concluded phase of the job, which  consists of a set of resources assigned to the job ( exec_vnode, exec_host, Resource_List items), and amount of resources used (resources_used) during that phase of the job.

    • Example:

    % qsub -l select=3:ncpus=1:mem=1gb job.scr

    242.borg

    % qstat -f 242 | egrep "exec|Resource_List|select"

    exec_host = borg/0+federer/0+lendl/0

    exec_vnode = (borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb)

    Resource_List.mem = 3gb

    Resource_List.ncpus = 3

    Resource_List.nodect = 3

    Resource_List.place = scatter

    Resource_List.select = 3:ncpus=1:mem=1gb

    schedselect = 3:ncpus=1:mem=1gb

    % pbs_release_nodes -j 241 lendl

    Accounting logs show:

    # tail -f /var/spool/PBS/server_priv/accounting/201701231

    23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572  session=7503 run_count=1 exec_host=borg/0+federer/0+lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb Resource_List.ncpus=3 Resource_List.nodect=3 Resource_List.place=scatter Resource_List.select=3:ncpus=1:mem=1gb resources_used.cpupercent=5 resources_used.cput=00:04:35 resources_used.mem=4288kb resources_used.ncpus=3 resources_used.vmem=42928kb resources_used.walltime=00:00:26

    Another pbs_release_nodes call yield:

    % pbs_release_nodes -j 241 federer

    # tail -f /var/spool/PBS/server_priv/accounting/201701231

    01/23/2017 18:59:35;u;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572 session=7773 run_count=1  exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1048576kb resources_used.cpupercent=3 resources_used.cput=00:03:35 resources_used.mem=2048kb resources_used.ncpus=2 resources_used.vmem=32928kb resources_used.walltime=00:00:26 

    Interface 4: New server accounting record: 'c' for continue record:

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: The 'c' accounting record will show  the next assigned exec_vnode, exec_host, Resource_List. along with the job attributes in the new/next phase of the job. This is generated for every release nodes action done, and is paired up with the 'u' accounting record (interface 3).

      Given the following example:                                                                                                                                                                                  % qsub -l select=3:ncpus=1:mem=1gb job.scr

      242.borg

      % qstat -f 242 | egrep "exec|Resource_List|select"

      exec_host = borg/0+federer/0+lendl/0

      exec_vnode = (borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb)

      Resource_List.mem = 3gb

      Resource_List.ncpus = 3

      Resource_List.nodect = 3

      Resource_List.place = scatter

      Resource_List.select = 3:ncpus=1:mem=1gb

      schedselect = 3:ncpus=1:mem=1gb

      % pbs_release_nodes -j 241 lendl

      Accounting logs show:

      # tail -f /var/spool/PBS/server_priv/accounting/201701231

      23/2017 18:53:24;u;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572  session=7503 run_count=1 exec_host=borg/0+federer/0+lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb)+(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb Resource_List.ncpus=3 Resource_List.nodect=3 Resource_List.place=scatter Resource_List.select=3:ncpus=1:mem=1gb

      schedselect = 3:ncpus=1:mem=1gb

      % pbs_release_nodes -j 241 lendl

      Accounting logs show:

      # tail -f /var/spool/PBS/server_priv/accounting/201701231 resources_used.cpupercent=5 resources_used.cput=00:04:35 resources_used.mem=4288kb resources_used.ncpus=3 resources_used.vmem=42928kb resources_used.walltime=00:00:26

      23/2017 18:53:24;uc;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572  session=7503 run_count=1 exec_host=borg/0+federer/0 +lendl/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) +(lendl:ncpus=1:mem=1048576kb) Resource_List.mem=3gb 2097152kb Resource_List.ncpus=3 2 Resource_List.nodect=3 2 Resource_List.place=scatter updated_Resource_List.select=3select=1:ncpus=1:mem=1048576kb+1:ncpus=1:mem=1gb1048576kb resources_used_incr.cpupercent=5 resources_used.cput=00:04:35 resources_used.mem=4288kb resources_used.ncpus=3 resources_used.vmem=42928kb resources_used.walltime=00:00:26r

      Another pbs_release_nodes call yield 'federer' vnode assignment gone::

      % pbs_release_nodes -j 241 lendlAccounting logs show (lendl vnode assignment gone):-j 241 federer

      # tail -f /var/spool/PBS/server_priv/accounting/201701231

      01/23/2017 18:53:24;cu;242.borg.user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq ctime=1485215572 qtime=1485215572 etime=1485215572 start=1485215572  session=7503 run_count=1 exec_host=borg/0+federer/0 exec_vnode=(borg[0]:ncpus=1:mem=1048576kb)+(federer:ncpus=1:mem=1048576kb) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter updated_Resource_List.select=1=(borg[0]:ncpus=1:mem=1048576kb)+1(federer:ncpus=1:mem=1048576kb resources_used_incr.cpupercent=5 r

      Another pbs_release_nodes call yield 'federer' vnode assignment gone::

      % pbs_release_nodes -j 241 federer

      # tail -f /var/spool/PBS/server_priv/accounting/201701231) Resource_List.mem=2097152kb Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=scatter

      01/23/2017 2017 18:5953:3524;c;242.borg;user=bayucan group=users project=_pbs_project_default jobname=STDIN queue=workq cctime=1485215949 1485215572 qtime=1485215949 1485215572 etime=1485215949 start1485215572 start=1485215949 1485215572 session=7773 run_count=1  exec_host=borg/0 exec_vnode=borg/0 exec_vnode=(borg[0](borg[0]:ncpus=1:mem=1048576kb) Resource_List.mem=1048576kb Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb) Resource_List.mem=1048576kb Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=scatter Resource_List.select=1:ncpus=1:mem=1048576kb

    Interface 5: New server accounting record: 'e' (end) for end of job record for a phased job
    • =1048576kb

    Interface 5: New server accounting record: 'e' (end) for end of job record for a phased job

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: The 'e' accounting record will show the resources assigned to the job ( exec_vnode, exec_host, Resource_List items), and amount of resources used (resources_used) during that last phase of the job.
    • Details: It will be up to the log parser to take all the 'u' records and 'e' record of the job and either sum up the resources_used.* values (e.g. resources_used.walltime) or to average them out whichever makes sense (e.g. resources_used.ncpus). Note that the regular 'E' (end) accounting record will continue to be generated for a job, whether it has released nodes or not, showing the job's values in total at the end of the job.

    Interface 6: Additions to log messages

    • Visibility: Public
    • Change Control: Stable
    • Synopsis: The 'e' accounting record will show the resources assigned to the job ( exec_vnode, exec_host, Resource_List items), and amount of resources used (resources_used) during that last phase of the job.
    • Details: It will be up to the log parser to take all the 'u' records and 'e' record of the job and either sum up the resources_used.* values (e.g. resources_used.walltime) or to average them out whichever makes sense (e.g. resources_used.ncpus). Note that the regular 'E' (end) accoutning record will continue to be generated for a job, whether it has released nodes or not, showing the job's values in total at the end of the job:

    Interface 6: Additions to log messages

    • Visibility: Public
    • Change Control: Stable
    • Details:
    • Special mom_logs messages:
      • A pbs_release_nodes request causes the server to send a job update to the mother superior (MS)  of the job. The
        MS in turn looks into the list of nodes being removed. If it's the last node from the same host, MS sends a new DELETE_JOB2 request to that owning sister mom. Upon receiving this request, the sister mom goes and kills job processes on the node, and sends back to the mother superior the summary
        accounting information for the job on that node. Mom_logs will show the following DEBUG messages:

        sister mom_logs: "DELETE_JOB2 received"

        Mother superior log: "<reporting_sister_host>;cput=YY mem=ZZ"

    • Special server_logs messages:

      When a job has been completely removed from an early released vnode, the following DEBUG2 message will be shown:

      "clearing job <job-id> from node <vnode-name>

      "Node<sister-mom-hostname>;deallocating 1 cpus from job <job-id>: Stable
    • Details:
      • Special mom_logs messages:
        • A pbs_release_nodes request causes the server to send a job update to the mother superior (MS)  of the job. The
          MS in turn looks into the list of nodes being removed. If it's the last node from the same host, MS sends a new DELETE_JOB2 request to that owning sister mom. Upon receiving this request, the sister mom goes and kills job processes on the node, and sends back to the mother superior the summary
          accounting information for the job on that node. Mom_logs will show the following DEBUG messages:

          sister mom_logs: "DELETE_JOB2 received"

          Mother superior log: "<reporting_sister_host>;cput=YY mem=ZZ"

      • Special server_logs messages:

        • When a job has been completely removed from an early released vnode, the following DEBUG2 message will be shown:

          "clearing job <job-id> from node <vnode-name>

          "Node<sister-mom-hostname>;deallocating 1 cpu(s) from job <job-id>

    Interface 7: New server attribute 'show_hidden_attribs'

    • Visibility: Private
    • Change Control: Unstable
    • Value: 'true' or 'false'
    • Synopsis: When set to 'true', this allows qstat -f to also show values of internal attributes created by the server to implement the node ramp down feature. Some example internal job attributes that may show its value are exec_vnode_orig, exec_vnode_acct, exec_vnode_deallocated, exec_host_orig, exec_host_acct, Resource_List_orig, Resource_List_acct.

    • Note: This attribute is provided as an aid to debugging PBS.