Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Objective

This is to enhance PBS reporting of resources_used values, in particular, have MoM accumulate resources_used values that are set in a hook, whether builtin resource or custom resource.

Interface 1: For multi-node jobs, report accumulated resources_used values in accounting logs/qstat -f output, for those resources set in a hook.

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: Display accumulated resources_used values in accounting logs and qstat -f output, for resources that are set in an execjob_prologue, execjob_epilogue, or exechost_periodic hook.
  • Details:
    • Resources_used resources 'cput', 'mem', 'cpupercent' will continue to be aggregated and reported as before.
    • The additional resources that can be accumulated are those that are set in a hook, which can be a builtin resource (e.g vmem), or a custom resource.

      • Builtin resource: If a builtin resource is set in a hook, then any polling done (if any) by MoM for its value will automatically be discontinued. The hook then becomes in charge of updating the value.

      • Custom resource:For a custom resource to be set in a hoo1k, the resource must have already been added to PBS in one of 2 ways:

        1. Via qmgr:

          # qmgr -c "create resource <res_name> type=<res_type>,flag=h

        2. Via a mom exechost_startup hook as follows: 

          # qmgr -c "create hook start event=exechost_startup"
          # qmgr -c "import hook start application/x-python default start.py" 
          # qmgr -c "export hook start application/x-python default"
          import pbs
          e=pbs.event()
          localnode=pbs.get_local_nodename()

          e.vnode_list[localnode].resources_available['foo_i'] = 7
          e.vnode_list[localnode].resources_available['foo_f'] = 5.0
          e.vnode_list[localnode].resources_available['foo_str'] = "seventyseven"
          e.vnode_list[localnode].resources_available['stra'] = "pears"
          ,

    • Aggregation of values: The resource value collected in mother superior mom is aggregated with each of the values obtained from the sister moms whose nodes are part of the job.

    • For resources of type float, long, and size, the value will be reported in accounting logs and qstat -f as:

                          resources_used.<resource_name> = <summed total>      

      If for some reason a sister node did not report back the resources_used value for the resource, then the last know value will be used.

    • For resources of type string and string_array, the value is aggregated into a JSON format style, as follows:

                           resources_used.<resource_name> = {"<node1>": "<str_val>", "<node2>": "<str_val>", ...}

                           NOTE: The quotes are included to disambiguate embedded spaces, commas and brackets.

      If  one or more moms did not report on that resource, the last known value sent by that mom will be used. If the mom has not reported a value at all, then the keyword 'None' will be reported as <str_val>.
                           resources_used.<resource_name> = {"<node1>": "<str_val>", "<node2>":None, ...}

  1. Examples:

    Given an epilogue hook that runs on all the mom nodes, setting different resources_used values based on whether executing on a MS mom or sister mom.


     #: qmgr -c "list hook epi"

    Hook epi
    type = site
    enabled = true
    event = execjob_epilogue
    user = pbsadmin
    alarm = 30
    order = 1
    debug = false
    fail_action = none

    qmgr -c "e h epi application/x-python default"
    import pbs
    e=pbs.event()
    pbs.logmsg(pbs.LOG_DEBUG, "executed epilogue hook")
    if e.job.in_ms_mom(): #set in MS mom
        e.job.resources_used["vmem"] = pbs.size("9gb")
        e.job.resources_used["foo_i"] = 9
        e.job.resources_used["foo_f"] = 0.09
        e.job.resources_used["foo_str"] = "nine"
        e.job.resources_used["cput"] = 10
        e.job.resources_used["stra"] = '"broccoli,tomatoes"'
    else: # set in sister mom
        e.job.resources_used["vmem"] = pbs.size("10gb")
        e.job.resources_used["foo_i"] = 10
        e.job.resources_used["foo_f"] = 0.10
        e.job.resources_used["foo_str"] = "ten"
        e.job.resources_used["cput"] = 20
        e.job.resources_used["stra"] = '"carrots,onions"'

    Now with 2 nodes: corretja (server/MS), and nadal.

    Submit the following job:

    % cat job.scr2
    PBS -l select=2:ncpus=1
    pbsdsh -n 1 hostname
    sleep 300
    bayucan@corretja:~/bugs/pbs_13914> qsub job.scr2
    102.corretja.pbspro.com

    When the job completes, the following resources_used values are shown:

    resources_used.cpupercent = 0
    resources_used.cput = 00:00:30
    resources_used.vmem = 19gb
    resources_used.foo_f = 0.19
    resources_used.foo_i = 19
    resources_used.foo_str = {"corretja.pbspro.com":"nine",
    "nadal.pbspro.com":"ten"}
    resources_used.mem = 0kb
    resources_used.ncpus = 2
    resources_used.stra = {"corretja.pbspro.com":"broccoli,tomatoes",
    "nadal.pbspro.com":"carrots,onions"}
    resources_used.walltime = 00:00:05


    NOTE: Those in bold show values accumulated between the MS value and the sister value. 

    The accounting_logs show the same values:
    8/03/2016 18:28:13;E;102.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=job.scr2 queue=workq ctime=1470263288 qtime=1470263288 etime=1470263288 start=1470263288 exec_host=corretja/0+nadal/0 exec_vnode=(corretja:ncpus=1)+(nadal:ncpus=1) Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=free Resource_List.select=2:ncpus=1 session=16986 end=1470263293 Exit_status=143 resources_used.cpupercent=0 resources_used.cput=00:00:30 resources_used.vmem=19gb resources_used.foo_f=0.19 resources_used.foo_i=19 resources_used.foo_str={"corretja.pbspro.com":"nine","nadal.pbspro.com":"ten"} resources_used.mem=0kb resources_used.ncpus=2 resources_used.stra={"corretja.pbspro.com":"broccoli,tomatoes","nadal.pbspro.com":"carrots,onions"}resources_used.walltime=00:00:05 run_count=1

    Now supposed that I change the execjob_epilogue hook to only set resources_used values from the MS mom:

    # corretja:/home/bayucan/bugs/pbs_13914 # qmgr -c "e h epi application/x-python default"
    import pbs
    e=pbs.event()
    pbs.logmsg(pbs.LOG_DEBUG, "executed epilogue hook")
    if e.job.in_ms_mom():
    e.job.resources_used["vmem"] = pbs.size("9gb")
    e.job.resources_used["foo_i"] = 9
    e.job.resources_used["foo_f"] = 0.09
    e.job.resources_used["foo_str"] = "nine"
    e.job.resources_used["cput"] = 10
    e.job.resources_used["stra"] = '"broccoli,tomatoes"'

    Sow submitting a job and then deleting it to force execjob_epilogue hook execution resulted in:

    bayucan@corretja:~/bugs/pbs_13914> !qsub
    qsub job.scr2
    103.corretja.pbspro.com
    bayucan@corretja:~/bugs/pbs_13914> qstat
    Job id Name User Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    103.corretja job.scr2 bayucan 00:00:00 R workq
    bayucan@corretja:~/bugs/pbs_13914> qdel 103
    bayucan@corretja:~/bugs/pbs_13914> qstat -f -x 103
    Job Id: 103.corretja.pbspro.com
    Job_Name = job.scr2
    Job_Owner = bayucan@corretja.pbspro.com
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:10
    resources_used.vmem = 9gb
    resources_used.foo_f = 0.09
    resources_used.foo_i = 9
    resources_used.foo_str = {"corretja.pbspro.com":"nine",
    "nadal.pbspro.com":None}
    resources_used.mem = 0kb
    resources_used.ncpus = 2
    resources_used.stra = {"corretja.pbspro.com":"broccoli,tomatoes",
    "nadal.pbspro.com":None}
    resources_used.walltime = 00:00:06

    NOTE: Since it's a multinode job, then nadal reports 'None' for string or string_array values that were not updated by the sister mom.

    Accounting logs show:
    08/03/2016 18:36:14;E;103.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=job.scr2 queue=workq ctime=1470263768 qtime=1470263768 etime=1470263768 start=1470263768 exec_host=corretja/0+nadal/0 exec_vnode=(corretja:ncpus=1)+(nadal:ncpus=1) Resource_List.ncpus=2 Resource_List.nodect=2 Resource_List.place=free Resource_List.select=2:ncpus=1 session=17114 end=1470263774 Exit_status=143 resources_used.cpupercent=0 resources_used.cput=00:00:10 resources_used.vmem=9gb resources_used.foo_f=0.09 resources_used.foo_i=9 resources_used.foo_str={"corretja.pbspro.com":"nine","nadal.pbspro.com":None} resources_used.mem=0kb resources_used.ncpus=2 resources_used.stra={"corretja.pbspro.com":"broccoli,tomatoes","nadal.pbspro.com":None} resources_used.walltime=00:00:06 run_count=1



 

 

 

  • No labels