Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649

Interface 1: New job attribute 'tolerate_node_failures'

  • Visibility: Public
  • Change Control: Stable
  • Value: 'all', 'job_start', or 'none'
  • Python type: str
  • Synopsis:  
    • When set to 'all', this means to tolerate all node failures resulting from communication problems (e.g. polling) between the primary mom and the sister moms assigned to the job, as well as due to rejections from execjob_begin, or execjob_prologue hook execution by remote moms.
    • When set to 'job_start', this  means to tolerate node failures that occur only during job start like an assigned sister mom failing to join job, communication errors that happen between the primary mom or sister moms, just before the job executes the execjob_launch hook and/or the top level shell or executable.
    • When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).

    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
    Privilege
    • The ‘tolerate_node_failures’ job option is currently not supported on Cray systems. If specified, a Cray primary mom would ignore the setting.
  • Privilegeuser, admin, or operator can set it
  • Examples:
    • Via qsub:

...

                            qalter -W tolerate_node_failures="job_start" <jobid>

    • Via a hook:

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = "all"
                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23
                              ...
                              tolerate_node_failures = all

  • Log/Error messages:
    • When a job that has toleranttolerate_node_failures attribute set to 'true'all' or 'job_start', the following mom_logs messages will appear in the following conditions: sister moms that failed to join job due to either communication error or execjob_begin hook rejects, when a sister mom fails to setup a job like cpuset creation failure, when a sister mom rejects an execjob_prologue hook, when the primary mom fails to poll a sister mom for status, or for any communication error to the sister mom:
      • DEBUG level: "ignoring from <node_host> error as job is tolerant of node failures"

...

  • Visibility: Public
  • Change Control: Stable
  • Python Type: dict (dictionary of pbs.vnode objects keyed by vnode name)
  • Details:
    This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. This can include those vnodes from sister moms that failed to join the job, that rejected an execjob_begin hook or execjob_prologue hook request, encountered communication error while primary mom is polling the sister mom host. This dictionary object is keyed by vnode name. And one can walk through this list and start offlining the vnodes, for example:

    for vn in e.vnode_list_fail:
        v = e.vnode_list_fail[vn]
        pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
        v.state = pbs.ND_OFFLINE

  • Additional Details:
    • Any sister nodes that are able to join the job will be considered as healthy.
    • The sucess of join job request maybe the result of a check made by a remote execjob_begin hook. After successfully joining the job, the node may further check its status via a remote execjob_prologue hook. A reject by the remote prologue hook will cause primary mom to treat the sister node as a problem node, and will be marked as unhealthy. Unhealthy nodes are not selected when pruning a job'srequest s request via the pbs. release_nodes(keep_select) call (see interface 8 below).
    • If there's an execjob_prologue hook in place, the primary mom would track node hosts that have given IM_ALL_OKAY acknowledgement for their execution of the execjob_prologue hook. Then after some ‘job_launch_delay’ amount of time of job startup (interface 4), primary mom would start reporting as failed nodes those who have not given their positive acknowledgement during prologue hook execution. This info is communicated to the child mom running on behalf of the job, so that vnodes from the failed hosts would not be used when pruning a job (i.e. pbs. release_nodes(keep_select=X) call).
    • If after some time, a node's host comes back with an acknowledgement of successful prologue hook execution, the primary mom would add back the host to the healthy list.

...

                   ...

                  pbs.event().job.release_nodes(keep_select=...)

...

NOTE: On Windows, where PBS_NODEFILE would always appear in pbs.event().env, need to put the following on top of the execjob_launch hook:


if any("mom_open_demux.exe") in s for s in e.argv):
      e.accept()


    • This call makes sense only when job is node failure tolerant (i.e. tolerate_node_failures=job_start or tolerate_node_failures=all) since it is when the
      list of healthy and failed nodes are gathered to be consulted by release_nodes() for determining which chunk should be assigned, freed.
  • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the
    • If it is invoked and yet the job is not tolerant of node failures, the following message is displayed in mom_logs under DEBUG level:

                     "<jobid>: no nodes released as job does not tolerate node failures"

  • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad  (in pbs.event().vnode_list_fail). With a successful execution of release_nodes() call from execjob_prologue and execjob_launch hooks, the 's' accounting record (interface 2) is generated. and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the change.

    Examples:

           Given an execjob_prologue hook, a hook if pbs_cgroups is enabled ( PP-325 Support Cgroups), the cgroup already created for the job is also updated to match the job's new resources. If the kernel rejects the update to the job's cgroup resources, then the job will be aborted on the execution host side, and requeued/rerun on the server side.
  • Examples:

           Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:

                pj = e.job.release_nodes(keep_select="ncpus=2:mem=2gb+ncpus=2:mem=2gb+ncpus=1:mem=1gb")
                if pj != None:
                    pbs.logmsg(pbs.LOG_DEBUG, "pj.exec_vnode=%s" % (pj.exec_vnode,))
                else:               # returned None job object, so we can put a hold on the job and requeue it, rejecting the hook event
                    e.job.Hold_Types = pbs.hold_types("s")
                    e.job.rerun()
                    e.reject("unsuccessful at LAUNCH")


  • Log/Error messages:
    • When job's assigned nodes get pruned (nodes released to satisfy 'keep_select') , mom_logs will show the following info under PBSEVENT_JOB log level:

      ";Job;<jobid>;pruned from exec_vnode=<original value>"
      ";Job;<jobid>;pruned to exec_nodevnode=<new value>"

    • When a multinode job's assigned resources have been modified, primary mom will do a quick 5 seconds wait  for an acknowledgement from the sister moms that they have updated their nodes table. When not all acknowledgements were received by primary mom during that 5 seconds wait, then there'll be this DEBUG2 level mom_logs message:

...

                   Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are shown in DEBUG2 level:
      • "could not satisfy select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) 

      • "NEED chunks for keep_select (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)
      • "HAVE chunks from job's exec_vnode (<exec_vnode value>
    • When a sister mom updated its internal nodes table due to some nodes getting released as  a result of the release_nodes() call, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:                                                                                                                                                                                     ";<jobid>;updated nodes info"
    • Calling release_nodes() from a hook that is not execjob_prologue or execjob_launch hook would return None as this is currently not supported.
    • Upon successful execution of release_nodes() call, it is normal to receive messages in the mom_logs of the form:

                          " stream <num> not found to job nodes"
                    "im_eof, No error from addr <ipaddr>:<port> on stream <num>                 which corresponds to the connection stream of a released mom host" stream <num> not found to job nodes"
                    "im_eof, No error from addr <ipaddr>:<port> on stream <num>

                 which corresponds to the connection stream of a released mom host.

Interface 9: new hook event: execjob_resize

  • Visibility: Public
  • Change Control: Stable
  • Python constant: pbs.EXECJOB_RESIZE
  • Event Parameters: 
    • pbs.event().job - This is a pbs.job object representing the job whose resources has been updated. This job object cannot be modified under this hook.
    • pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
  • Restriction: The execjob_resize hook will run under the security context of Admin user.
  • Details:
    • An execjob_resize event has been introduced primarily as a new event for pbs_cgroups hook to be executed when there's an update to the job's assigned resources, as a result of the release_nodes() call. This would allow pbs_cgroups to act on a change to job's resources. The action would be to update the limits of the job's cgroup.
    • If the pbs_cgroups hook is executing in response to an execjob_resize event,  calling pbs.event().reject(<message>),  encountering an exception, or terminating due to an alarm call, would result in the following DEBUG2 mom_logs message, and the job is aborted on the mom side, and requeued/rerun on the server side:

      “execjob_resize” request rejected by ‘pbs_cgroups”
      <message>

  • New qmgr output:
    • The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):

      # qmgr –c “set hook <hook_name> event = <bad_event>”

      from:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach or "" for no event

      to:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no event

  • External documentation:
    • This hook event is intentionally not added to the external documentation (as of 2021.1.3), because it is intended for use primarily by the cgroups hook.

Case of Reliable Job Startup:

...

j.tolerate_node_failures = "job_start"

Then, save the current of 'select' in a builtin resource "site". 

...