Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

...

Objective

Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649

...

NOTE: This page is actively being modified. Please don't review it yet...

Interface 1: New job attribute 'tolerate_node_failures'

  • Visibility: Public
  • Change Control: Stable
  • Value:  'true' or 'false' (default is 'false')
  • Python type: bool
  • Synopsis:  
    • When set to 'true', any failure of nodes are tolerated throughout the life of the job. That is, the job is allowed to run even after mother superior mom has detected bad nodes.
    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.select.increment_chunks() method (interface 7).
  • Privilege: user, admin, or operator can set it
  • Examples:
    • Via qsub:
      • qsub -W tolerate_node_failures=true <job_script>
    • Via qalter:
      • qalter -W tolerate_node_failures=false <job_script>
    • Via a hook:
      •  # cat qjob.py
        import pbs
        e=pbs.event()
        e.job.tolerate_node_failures = True
        # qmgr -c "create hook qjob event=queuejob"
        # qmgr -c "import hook application/x-python default qjob.py"
        % qsub job.scr
        23.borg
        % qstat -f 23
        ...
        tolerate_node_failures = True
  • Log/Error messages:
    • When a job that has tolerant_node_failures attribute set to 'true', sister moms that failed to join job due to communication error or execjob_begin hook rejects, or when a sister mom fails to setup a job like cpuset creation failed, or when a sister mom rejected an execjob_prologue hook,or when the primary mom fails to poll a sister mom for status, or for any communication error to the sister mom, then the following mom_logs message will be shown:
      • DEBUG: "ignoring error as job is tolerant of node failures"
      • DEBUG3: "ignoring POLL error from failed mom <mom_host> as job is tolerant of node failures
      • DEBUG3: ignoring lost communication with <mom_host> as job is tolerant of node failures"

...

  • Visibility: Public
  • Change Control: Stable
  • Return Python Type: pbs.select
  • Details:
    This is a new method in the pbs.select type where 'increment' number of chunks are added to each chunk in the chunk specification. So given a select spec of "[N:][chunk specification][+[N:]chunk specification]", this function will return [N+increment:][chunk specification][+[N+increment:]chunk specification]". A missing 'N' values means 1. By default, first_chunk=False means no increment is added to the first chunk in the specs. Example:

           Given  pbs.event().job.Resource_List["select"]=ncpus=2:mem=2gb+ncpus=2:mem=2gb+2:ncpus=1:mem=1gb
                new_select = pbs.event().job.Resource_List["select"].increment_chunks(2)         ← first_chunk=False by default
             where new_select is now: ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb Otherwise, if 'first_chunk=True', then the resulting new select also includes 2 additional 2                   additional increments to first chunk:  new_select: 3:ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb

...

  • Visibility: Public
  • Change Control: Stable
  • Return Python type: PBS job object (i.e. the modified PBS job object)
  • Restriction: This is currently callable only from mom hooks execjob_launch and execjob_prologue and makes sense only when executed from the hook ran by the primary mom. It is advisable to put this call in an 'if pbs.event().job.in_ms_mom()' clause.
  • Call syntax 1: pbs.event().job.release_nodes(nodes_list)
    • Input: nodes_list - dictionary of pbs.vnode objects keyed by 'vnode_name'  of those nodes to release from job 
    • Detail: Release the given nodes in 'nodes_list'  that have been assigned to a job. This is a hook front-end to pbs_release_nodes command (PP-339). This returns a dictionary of pbs.vnode objects representing the nodes that have been released. If error encountered performing the action, this method returns None.
  • Call syntax 2 (pruning option): pbs.event().job.release_nodes(keep_select)
    • Input:
      • keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept. Mother superior mom will remove node resources from nodes that have been detected to be bad, using nodes that have been seen as healthy and functional as replacement when necessary. Common value passed would be the pbs.event().job.select_requested (interface 6).
    • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad. With a successful execution of release_nodes() call from execjob_prologue and execjob_launch hooks, the 's' accounting record (interface 2) is generated. and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the change. 
    • Returns: the modified PBS job object reflecting the new values to the attributes: 'exec_host', 'exec_host2',  'exec_vnode', and 'schedselect'.
  • Examples:

    Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:

    e=pbs.event()
    j = e.job
    if j.in_ms_mom():
    j.release_nodes(e.vnode_list_fail)

          Given an execjob_launch hook, a hook writer can specify that nodes should be released in such a way that it satisfy the user's original select request

            e=pbs.event()

            j = e.job

            if j.in_ms_mom():

                  rel_nodes=j.release_nodes(keep_select=j.select_requested)

                 if rel_nodes is None:   # error occurred

    • Returns: the modified PBS job object reflecting the updated values to some of the attributes like 'exec_vnode'.
    • Examples:

                     e = pbs.event()

                     rel_vnode_list =  {"federer": None, "murray": None}               # dictionary keyed by vnode names. The key name is what is significant.
                     pj = e.job.release_nodes(nodes_list=rel_vnode_list)             # will do the cmd line equivalent: pbs_release_nodes federer murray
                    if pj != None:                                                                            # can check what the updated values to 'exec_vnode', 'exec_host', 'exec_host2'
                        pbs.logmsg(pbs.LOG_DEBUG, "pj.exec_vnode=%s" % (pj.exec_vnode,)

                    NOTE: One can also just do: e.job.release_nodes(nodes_list=e.vnode_list_fail) to release vnodes managed by moms that were seen by primary mom as unhealthy.

  • Call syntax 2 (pruning option): pbs.event().job.release_nodes(keep_select)
    • Input:
      • keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept. Mother superior mom will remove node resources from nodes that have been detected to be bad, using nodes that have been seen as healthy and functional as replacement when necessary. Common value passed would be the pbs.event().job.select_requested (interface 6).
    • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad  (in pbs.event().vnode_list_fail). With a successful execution of release_nodes() call from execjob_prologue and execjob_launch hooks, the 's' accounting record (interface 2) is generated. and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the change. 
    • Returns: the modified PBS job object reflecting the new values to some of the attributes like 'exec_vnode'.
  • Examples:

           Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:

                pj = e.job.release_nodes(keep_select="ncpus=2:mem=2gb+ncpus=2:mem=2gb+ncpus=1:mem=1gb")
                if pj != None:
                    pbs.logmsg(pbs.LOG_DEBUG, "pj.exec_vnode=%s" % (pj.exec_vnode,))
                else:               # returned None job object, so we can put a hold on the job and requeue it, rejecting the hook event
                    e.job.Hold_Types = pbs.hold_types("s")
                    e.job.rerun()
                    e.reject("unsuccessful at LAUNCH")



  • Log/Error messages:
    • When job's assigned nodes get pruned (nodes released to satisfy 'keep_select') , mom_logs will show the following info under PBSEVENT_JOB log level:

      ";Job;<jobid>;pruned from exec_vnode=<original value>"
      ";Job;<jobid>;pruned to exec_node=<new value>"

    • When a multinode job's assigned resources have been modified, primary mom will do a quick 5 seconds wait  for an acknowledgement from the sister moms that they have updated their nodes table.  There's be this DEBUG2 level mom_logs message:

                     

...

 "waiting up to 5 secs for job update acks from sister moms"

                   

...

And when not all acknowledgements were received by primary mom during that 5 seconds wait, then there'll be this additional DEBUG2 level mom_logs message:

                       "not all job updates to sister moms completed"

                   Eventually, all node updates will complete in the background. 

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:
    1. "could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the keep_select spec could not be satisfied

    2. "could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..."  when a secondary (sister) chunk from the keep_select spec could not be satisfied

    3. "job node_list_fail: node <node_name1>" which shows what mom is consulting as the bad_nodes list. (consulted by mom in release_nodes() call).

    4. "job node_list:_good node <node_name1>" which shows what mom is consulting as the good_nodes_list (consulted by mom in release_nodes() call).

  • When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

        ";pbs_mom;Job;<jobid>;updated nodes info"

  • When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the PBSEVENT_DEBUG2 message: "not all job updates to sister moms completed." Note that eventually, all the nodes would automatically complete updating its info.
  • If a sister mom receives a TM request but its nodes data have not been updated yet, the client would get an "error on spawn" message while doing tm_spawn.

  • Calling release_nodes() from a hook that is not execjob_prologue or execjob_launch hook would return None as this is currently not supported.
  • Examples:

Given a queuejob hook where it sets select_reliable_startup to allow another node to be added to the 2nd chunk and third chunk of the spec:

...