Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Objective

This is to enhance PBS reporting of resources_used values, in particular, have MoM accumulate resources_used values that are set in a hook, whether builtin resource or custom resource.

Interface 1: For multi-node jobs, report accumulated resources_used values in accounting logs/qstat -f output, for those resources set in a hook.

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: Display accumulated resources_used values in accounting logs and qstat -f output, for resources that are set in an execjob_prologue, execjob_epilogue, or exechost_periodic hook.
  • Details:
    • Resources_used resources 'cput', 'mem', 'cpupercent' will continue to be aggregated and reported as before.
    • The additional resources that can be accumulated are those that are set in a hook, which can be a builtin resource (e.g vmem), or a custom resource.

      • Builtin resource: If a builtin resource is set in a hook, then any polling done (if any) by MoM for its value will automatically be discontinued. The hook then becomes in charge of updating the value.

      • Custom resource: For a custom resource to be set in a hoo1k, the resource must have already been added to PBS in one of 2 ways:

        1. Via qmgr:

          # qmgr -c "create resource <res_name> type=<res_type>,flag=h

        2. Via a mom exechost_startup hook as follows: 

          # qmgr -c "create hook start event=exechost_startup"
          # qmgr -c "import hook start application/x-python default start.py" 
          # qmgr -c "export hook start application/x-python default"
          import pbs
          e=pbs.event()
          localnode=pbs.get_local_nodename()

          e.vnode_list[localnode].resources_available['foo_i'] = 7
          e.vnode_list[localnode].resources_available['foo_f'] = 5.0
          e.vnode_list[localnode].resources_available['foo_str'] = "seventyseven"
          e.vnode_list[localnode].resources_available['stra'] = "pears"
          ,

    • Aggregation of values: The resource value collected in mother superior mom is aggregated with each of the values obtained from the sister moms whose nodes are part of the job.

    • For resources of type float, long, and size, the value will be reported in accounting logs and qstat -f as:

                          resources_used.<resource_name> = <summed total>      

      If for some reason a sister node did not report back the resources_used value for the resource, then the last know value will be used.

    • For resources of type string and string_array, the value is aggregated into a JSON format style, as follows:

                           resources_used.<resource_name> = {"<node1>": "<str_val>", "<node2>": "<str_val>", ...}

                           NOTE: The quotes are included to disambiguate embedded spaces, commas and brackets.

      If  one or more moms did not report on that resource, the last known value sent by that mom will be used. If the mom has not reported a value at all, then the keyword 'None' will be reported as <str_val>.
                           resources_used.<resource_name> = {"<node1>": "<str_val>", "<node2>":None, ...}

Examples:

Given an epilogue hook that runs on all the mom nodes, setting different resources_used values based on whether executing on a MS mom or sister mom:

.#: qmgr -c "list hook epi"

Hook epi
type = site
enabled = true
event = execjob_epilogue
user = pbsadmin
alarm = 30
order = 1
debug = false
fail_action = none

# qmgr -c "e h epi application/x-python default"
import pbs
e=pbs.event()
pbs.logmsg(pbs.LOG_DEBUG, "executed epilogue hook")
if e.job.in_ms_mom(): #set in MS mom
    e.job.resources_used["vmem"] = pbs.size("9gb")
    e.job.resources_used["foo_i"] = 9
    e.job.resources_used["foo_f"] = 0.09
    e.job.resources_used["foo_str"] = "nine"
    e.job.resources_used["cput"] = 10
    e.job.resources_used["stra"] = '"broccoli,tomatoes"'
else: # set in sister mom
    e.job.resources_used["vmem"] = pbs.size("10gb")
    e.job.resources_used["foo_i"] = 10
    e.job.resources_used["foo_f"] = 0.10
    e.job.resources_used["foo_str"] = "ten"
    e.job.resources_used["cput"] = 20
    e.job.resources_used["stra"] = '"carrots,onions"'

Now with 2 nodes: corretja (server/MS), and nadal:

Submit the following job:

% cat job.scr2
PBS -l select=2:ncpus=1
pbsdsh -n 1 hostname
sleep 300


bayucan@corretja:~/bugs/pbs_13914> qsub job.scr2
102.corretja.pbspro.com

When the job completes, the following resources_used values are shown:

 

% qstat -f 102

...

resources_used.cpupercent = 0
resources_used.cput = 00:00:30
resources_used.vmem = 19gb
resources_used.foo_f = 0.19
resources_used.foo_i = 19
resources_used.foo_str = {"corretja.pbspro.com":"nine",
"nadal.pbspro.com":"ten"}
resources_used.mem = 0kb
resources_used.ncpus = 2
resources_used.stra = {"corretja.pbspro.com":"broccoli,tomatoes",
"nadal.pbspro.com":"carrots,onions"}
resources_used.walltime = 00:00:05


NOTE: Those in bold show values accumulated between the MS value and the sister value. 

The accounting_logs show the same values:
8/03/2016 18:28:13;E;102.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=job.scr2 queue=workq ctime=1470263288 qtime=1470263288 etime=1470263288 start=1470263288 exec_host=corretja/0+nadal/0 exec_vnode=(corretja:ncpus=1)+(nadal:ncpus=1) Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=free Resource_List.select=2:ncpus=1 session=16986 end=1470263293 Exit_status=143 resources_used.cpupercent=0 resources_used.cput=00:00:30 resources_used.vmem=19gb resources_used.foo_f=0.19 resources_used.foo_i=19 resources_used.foo_str={"corretja.pbspro.com":"nine","nadal.pbspro.com":"ten"} resources_used.mem=0kb resources_used.ncpus=2 resources_used.stra={"corretja.pbspro.com":"broccoli,tomatoes","nadal.pbspro.com":"carrots,onions"}resources_used.walltime=00:00:05 run_count=1

Now supposed that I change the execjob_epilogue hook to only set resources_used values from the MS mom:

# corretja:/home/bayucan/bugs/pbs_13914 # qmgr -c "e h epi application/x-python default"
import pbs
e=pbs.event()
pbs.logmsg(pbs.LOG_DEBUG, "executed epilogue hook")
if e.job.in_ms_mom():
    e.job.resources_used["vmem"] = pbs.size("9gb")
    e.job.resources_used["foo_i"] = 9
    e.job.resources_used["foo_f"] = 0.09
    e.job.resources_used["foo_str"] = "nine"
    e.job.resources_used["cput"] = 10
    e.job.resources_used["stra"] = '"broccoli,tomatoes"'

Then submit the job and then deleting it to force execjob_epilogue hook execution, resulted in:

bayucan@corretja:~/bugs/pbs_13914> !qsub
qsub job.scr2
103.corretja.pbspro.com


bayucan@corretja:~/bugs/pbs_13914> qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
103.corretja job.scr2 bayucan 00:00:00 R workq


bayucan@corretja:~/bugs/pbs_13914> qdel 103


bayucan@corretja:~/bugs/pbs_13914> qstat -f -x 103
Job Id: 103.corretja.pbspro.com
Job_Name = job.scr2
Job_Owner = bayucan@corretja.pbspro.com
resources_used.cpupercent = 0
resources_used.cput = 00:00:10
resources_used.vmem = 9gb
resources_used.foo_f = 0.09
resources_used.foo_i = 9
resources_used.foo_str = {"corretja.pbspro.com":"nine",
"nadal.pbspro.com":None}
resources_used.mem = 0kb
resources_used.ncpus = 2
resources_used.stra = {"corretja.pbspro.com":"broccoli,tomatoes",
"nadal.pbspro.com":None}
resources_used.walltime = 00:00:06

NOTE: Since it's a multinode job, then nadal reports 'None' for string or string_array values that were not updated by the sister mom.

Accounting logs show:
08/03/2016 18:36:14;E;103.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=job.scr2 queue=workq ctime=1470263768 qtime=1470263768 etime=1470263768 start=1470263768 exec_host=corretja/0+nadal/0 exec_vnode=(corretja:ncpus=1)+(nadal:ncpus=1) Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=free Resource_List.select=2:ncpus=1 session=17114 end=1470263774 Exit_status=143 resources_used.cpupercent=0 resources_used.cput=00:00:10 resources_used.vmem=9gb resources_used.foo_f=0.09 resources_used.foo_i=9 resources_used.foo_str={"corretja.pbspro.com":"nine","nadal.pbspro.com":None} resources_used.mem=0kb resources_used.ncpus=2 resources_used.stra={"corretja.pbspro.com":"broccoli,tomatoes","nadal.pbspro.com":None} resources_used.walltime=00:00:06 run_count=1



 

 

 

  • No labels