Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

...

  • Visibility: Public
  • Change Control: Stable
  • Python type: pbs.select
  • Input Format: select_reliable_startup="[N:][chunk specification][+[N:]chunk specification]" (same as select spec)
    where N is a multiple of chunks, and chunk specification is of the form:
    "<resource name>=<value>[:<resource name>=<value>...]"
    select_reliable_startup can be set in one of the following ways:
    • qsub -l select_reliable_startup=[N:][chunk specification][+[N:]chunk specification]
    • qalter -l select_reliable_startup=[N:][chunk specification][+[N:]chunk specification]
    • As a PBS directive in a job script:
                   #PBS -l select_reliable_startup=[N:][chunk specification][+[N:]chunk_specification]
    • Within a Python hook script, use the pbs.select() type.
                 pbs.event().job.Resource_List["select_reliable_startup"] = pbs.select("[N:][chunk specification][+[N:]chunk specification]")
  • Privilege: only root, PBS admin, PBS operator can set the select_reliable_startup value
  • Details:
    This is a builtin resource that is used to cause a job to be started reliably. select_reliable_startup must be a mirror of the original 'select' request, but with extra chunks added to cause assignment of extra nodes. When select_reliable_startup is set, the original 'select' value along with default resources missing from the value is first saved to a new resource 'select_requested' (see interface 6), and the 'select_reliable_startup' value replaces is consulted instead of the 'select' value, causing the scheduler to allocate the extra nodes to the job. The adjusted assignments are reflected in Resource_List, exec_vnode, and exec_host values.  When this The original 'select' and 'Resource_List' values are kept intact to reflect what the user originally requested.When 'select_reliable_startup' is set, any failure of nodes are tolerated throughout the life of the job. That is, the job is allowed to run even after mother superior mom has detected bad nodes.
  • Normal case:
    • The select_reliable_startup value is a mirror of the original select value with the 'N' part of a chunk specification (i.e. [N:][chunk specification][+[N:]chunk specification) changed to reflect additional chunks.
      • Interface 7 below introduces a helper method for adding more instances of chunks (pbs.select.increment_chunks()).
    • Example, given qsub -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb" where there's 1 chunk of "ncpus=3:mem=1gb" spec, 1 chunk of "ncpus=2:mem=2gb" spec, and 1 chunk of "chpus=1:mem=3gb" spec. an select_reliable_startup value can bump up the second chunk to have 2 of 'ncpus=2:mem=2gb" spec and the third chunk to have 2 of "ncpus=1:mem=3gb" spec as follows:    select_reliable_startup=ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
  • Log/Error messages:If 'select_reliable_startup' is set by a regular, non-privileged user, then there'll be an error message from qsub, qalter as follows:
                   "“Only PBS managers and operators are allowed to set 'select_reliable_startup'"

Interface 2: New server accounting record: 's' for the start record of a job that was submitted with the select_reliable_startup request

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: When a job was reliably started (had a select_reliable_startup value), there'll be this new accounting record that will reflect the adjusted values to select, 'exec_vnode', 'exec_host', and Resource_List, along with values to select_reliable_startup (inteface 1), and select_requested (interface 6).
  • Note:  This is a new accounitng record; the start of job record ('S') remains as before.
  • Example:

    04/07/2016 17:08:09;s;20.borg.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 exec_host=borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0 exec_vnode=(borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)  Resource_List.mem=11gb 6gb Resource_List.ncpus=9 6 Resource_List.nodect=5 3 Resource_List.place=scatter:excl Resource_List.select=1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb Resource_List.select_requested=1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus=1:mem=3gb session=0 run_count=1

...

  • Visibility: Public
  • Change Control: Stable
  • Python Type: dict (dictionary of pbs.vnode objects keyed by vnode name)
  • Details:
    This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. This can include those vnodes form sister moms that failed to join the job. This dictionary object is keyed by vnode name. And one can walk through this list and start offlining the vnodes. For example:

    for vn in e.vnode_list_fail.keys():
        v = e.vnode_list_fail[vn]
        pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
        v.state = pbs.ND_OFFLINE

  • Log/Error messages:
    1. If an execjob_prologue hook or an execjob_launch hook requested to offline a vnode, server_logs would show the message under PBSEVENT_DEBUG2 level:

      ";Server@borg;Node;<node name>;Updated vnode <node_name>'s attribute state=offline per mom hook request" 

Interface 6: select_requested job

...

attribute

  • Visibility: Public
  • Change Control: Stable
  • Python Type: pbs.select
  • Privilege: This is a read-only resource attribute that cannot be set by client or hook.
  • Details:
    This is a new builtin resource job attribute that the server use to save job's original  original  resource 'select' specification but in a complete way, containing any default resources missing from the original chunks as well as the chunk counts.
  • Example:

Given: qsub -l select=3:ncpus=1+mem=5gb+ncpus=2:mem=2gb

                  Resource_List.select select_requested would return (noticed the default ncpus=1 in second chunk and third chunks when non none was specified):
                                 3:ncpus=1+1:mem=5gb:ncpus=1+1:ncpus=2:mem=2gb

...

  • Visibility: Public
  • Change Control: Stable
  • Return Python Type: pbs.select
  • Details:
    This is a new method in the pbs.select type where 'increment' number of chunks are added to each chunk in the chunk specification. So given a select spec of "[N:][chunk specification][+[N:]chunk specification]", this function will return [N+increment:][chunk specification][+[N+increment:]chunk specification]" A missing 'N' values means 1. By default, first_chunk=False means no increment is added to the first chunk in the specs. Example: Given pbs.event().job.Resource_List["select"]=ncpus=2:mem=2gb+ncpus=2:mem=2gb+2:ncpus=1:mem=1gb
                new_select = pbs.event().job.Resource_List["select"].increment_chunks(2)         ← first_chunk=False by default
                 where new_select is now: ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb1gb Otherwise, if 'first_chunk=True', then the resulting new select also includes 2 additional increments to first chunk:             new  new_select: 3:ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb

...

  • Visibility: Public
  • Change Control: Stable
  • Return Python type: dict  (the return value's type)Call syntax 1: pbs.event().job.release_nodes(nodes_list)Input: nodes_list PBS job object (i.e. the modified PBS job object)
  • Restriction: This is currently callable only from mom hooks execjob_launch and execjob_prologue and makes sense only when executed from the hook ran by the primary mom. It is advisable to put this call in an 'if pbs.event().job.in_ms_mom()' clause.
  • Call syntax 1: pbs.event().job.release_nodes(nodes_list)
    • Input: nodes_list - dictionary of pbs.vnode objects keyed by 'vnode_name'  of those nodes to release from job 
    • Detail: Release the given nodes in 'nodes_list'  that have been assigned to a job. This is a hook front-end to pbs_release_nodes command (PP-339). This returns a dictionary of pbs.vnode objects representing the nodes that have been released. If error encountered performing the action, this methos method returns None.
  • Call syntax 2 (pruning option): pbs.event().job.release_nodes(keep_select)
    • Input:
      • keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept. Mother superior mom will remove node resources from nodes that have been detected to be bad, using nodes that have been seen as healthy and functional as replacement when necessary. Common value passed would be the pbs.event().job.select_requested (interface 6).
    • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad.
    • Returns a dictionary of pbs.vnode objects representing the nodes that have been released from a job.
    • With a successful execution of release_nodes() call , the accounting records that happen in the node ramp down feature (PP-339) are generated, from execjob_prologue and execjob_launch hooks, the 's' accounting record (interface 2) is generated. and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the changetable, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the change. 
    • Returns: the modified PBS job object reflecting the new values to the attributes: 'exec_host', 'exec_host2',  'exec_vnode', and 'schedselect'.
  • Examples:

    Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:


    e=pbs.event()
    j = e.job
    if j.in_ms_mom():
    j.release_nodes(e.vnode_list_fail)

          Given an execjob_launch hook, a hook writer can specify that nodes should be released in such a way that it satisfy the user's original select request

            e=pbs.event()

            j = e.job

            if j.in_ms_mom():

        j = e.job            rel_nodes=j.release_nodes(keep_select=j.select_requested)

                if  if rel_nodes is None:   # error occurred

                     j.rerun()   # requeue the job

                    e.reject("Failed to prune job)

  • Log/Error messages:
    • When job's assigned nodes get pruned (nodes released to satisfy 'keep_select') , mom_logs will show the following info under PBSEVENT_JOB log level::

      ";Job;<jobid>;pruned from exec_vnode=<original value>"
      ";Job;<jobid>;pruned to exec_node=<new value>"

    • When mother superior fails to prune currently assigned chunk resources then the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:

      1. "could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the keep_select spec could not be satisfied

      2. "could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..."  when a secondary (sister) chunk from the keep_select spec could not be satisfied

      3. "job node_list_fail: node <node_name1>" which shows what mom is consulting as the bad_nodes list. (consulted by mom in release_nodes() call).

      4. "job node_list:_good node <node_name1>" which shows what mom is consulting as the good_nodes_list (consulted by mom in release_nodes() call).

    • When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

          ";pbs_mom;Job;<jobid>;updated nodes info"

    • When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the PBSEVENT_DEBUG2 message: "NOT all job updates to sister moms completed." Note that eventually, all the nodes would automatically complete updating its info.if a sister mom receives a TM request but its nodes data have not been updated yet, then mom_logs on the sister host will show under PBSEVENT_JOB log level:
      ""job reliably started but missing updated nodes list. Try spawn() again later." 
      NOTE: The all the nodes would automatically complete updating its info.
    • If a sister mom receives a TM request but its nodes data have not been updated yet, the client would get an "error on spawn" message while doing tm_spawn.

    • Calling release_nodes() from a hook that is not execjob_prologue or execjob_launch hook would return None as this is currently not supported.
  • Examples:

Given a queuejob hook where it sets select_reliable_startup to allow another node to be added to the 2nd chunk and third chunk of the spec:

...

Resource_List.select = 1:ncpus=3:mem-=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.select_requested = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:

...