Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

...

  • Visibility: Public
  • Change Control: Stable
  • Value: 'all', 'job_start', or 'none'
  • Python type: str
  • Synopsis:  
    • When set to 'all', this means to tolerate all node failures resulting from communication problems (e.g. polling) between the primary mom and the sister moms assigned to the job, as well as due to rejections from execjob_begin, or execjob_prologue hook execution by remote moms.
    • When set to 'job_start', this  means to tolerate node failures that occur only during job start like an assigned sister mom failing to join job, communication errors that happen between the primary mom or sister moms, just before the job executes the execjob_launch hook and/or the top level shell or executable.
    • When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).

    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
  • Privilege: user, admin, or operator can set it
  • Examples:
    • Via qsub:

...

                            qalter -W tolerate_node_failures="job_start" <jobid>

    • Via a hook:

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = "all"
                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23
                              ...
                              tolerate_node_failures = all

...

  • Visibility: Public
  • Change Control: Stable
  • Details:
    This is the number of seconds that the primary mom will wait before launching (executing the job script or executable), if the job that has tolerate_node_failures set to "all" or "job_start". This wait time can be used to let execjob_prologue hooks finish execution  to capture or report any node failures, or for mother superior to notice of any communication problems with other nodes. pbs_mom will not necessarily wait fot the entire time but proceed to execute execjob_launch hook (when specified) once all prologue hook acknowledgements have been received from sister moms.
  • Default value: set to to the total amount of 'alarm' values associated with enabled execjob_prologue hooks. For example, if there are 2 execjob_prologue hooks, where first hook has alarm=30 and second hook has alarm=60, then the default job_launch_delay value will be 90 seconds. If there are no execjob_prologue hooks, then this is set to 30 seconds.
    To change value, add the following line in mom's config file:
                   $job_launch_delay <number of seconds>
  • Restriction:
    • This option is currently not supported under Windows. NOTE: Allowing it would cause the primary mom to hang waiting on the job_launch_delay timeout, preventing other jobs from starting. This is because jobs are not pre-started in a forked child process, unlike in Linux/Unix systems. 
  • Log/Error messages:
  1. When $job_launch_delay value is set, there'll be PBSEVENT_SYSTEM level message upon mom startup or when it is kill -HUPed:                                                                                                      "job_launch_delay;<delay_value>"
  2. When primary mom notices that not all acks were received from the sister moms in regards to execjob_prologue hook execution, then mom_logs would show the DEBUG2 level message:                                                                                                                                                                                                                                                                                                         "not all prologue hooks to sister moms completed, but job will proceed to execute"

...

  • Visibility: Public
  • Change Control: Stable
  • Python Type: dict (dictionary of pbs.vnode objects keyed by vnode name)
  • Details:
    This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. This can include those vnodes from sister moms that failed to join the job, that rejected an execjob_begin hook or execjob_prologue hook request, encountered communication error while primary mom is polling the sister mom host. This dictionary object is keyed by vnode name. And one can walk through this list and start offlining the vnodes, for example:

    for vn in e.vnode_list_fail:
        v = e.vnode_list_fail[vn]
        pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
        v.state = pbs.ND_OFFLINE

Interface 6: Allow execjob_launch hooks to modify job and vnode attributes

  • Visibility: Public
  • Change Control: Stable
  • Detail: With this feature, execjob_launch hooks are now allowed to modify job and vnode attributes, in particular, job's Execution_Time, Hold_Types, resources_used, and run_count values. This is the same with vnode object attributes like state and resources_available.
  • Examples:

                           Set a job's Hold_Types in case the hook script rejects the execjob_launch event:

  • Additional Details:
    • Any sister nodes that are able to join the job will be considered as healthy.
    • The sucess of join job request maybe the result of a check made by a remote execjob_begin hook. After successfully joining the job, the node may further check its status via a remote execjob_prologue hook. A reject by the remote prologue hook will cause primary mom to treat the sister node as a problem node, and will be marked as unhealthy. Unhealthy nodes are not selected when pruning a job'srequest via the pbs.release_nodes(keep_select) call (see interface 8 below).
    • If sister nodes go radio silent while executing the execjob_prologue hook, eventually this will be detected by primary mom as a communication failure and would mark the node as unhealthy.

Interface 6: Allow execjob_launch hooks to modify job and vnode attributes

  • Visibility: Public
  • Change Control: Stable
  • Detail: With this feature, execjob_launch hooks are now allowed to modify job and vnode attributes, in particular, job's Execution_Time, Hold_Types, resources_used, and run_count values. This is the same with vnode object attributes like state and resources_available.
  • Examples:

                                pbs.event().job.Hold_Types = pbs. Set a job's Hold_Types in case the hook script rejects the execjob_launch event:

                                pbs.event().job.Hold_Types = pbs.hold_types('s')

                           Set a vnode's state to offline:

...

  • Visibility: Public
  • Change Control: Stable
  • Return Python type: PBS job object reflecting the new values to some of the attributes like 'exec_vnode', Resource_List.* as a result of nodes getting released.
  • Input: keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept.
  • Restriction: This is currently callable only from mom hooks execjob_launch and execjob_prologue and makes sense only when executed from the hook ran by the primary mom.
    • It is advisable to put this call in an 'if pbs.event().job.in_ms_mom()' clause.
    • Also, since execjob_launch hook will also get called when spawning tasks via pbsdsh, or tm_spawn, ensure the execjob_launch hook invoking release_nodes() call has 'PBS_NODEFILE' in the pbs.event().env list. The presence of invoking release_nodes() call has 'PBS_NODEFILE' in the pbs.event().env list. The presence of 'PBS_NODEFILE' in the environment ensures that the primary mom is executing on behalf of starting the top level job, and not spawning a sister task. One can just add at the top of the hook:                                                                                                                                                  e = pbs.event()

      if 'PBS_NODEFILE'

      in the environment ensures that the primary mom is executing on behalf of starting the top level job, and not spawning a sister task. One can just add at the top of the hook:

      not in e.env:
          e.accept()

    • This call makes sense only when job is node failure tolerant (i.e. tolerant_node_failures=job_start or tolerate_node_failures=all) since it is when the
      list of healthy and failed nodes are tracked, to be consulted by release_nodes() for determining which chunk should be assigned, freed.

...

                   Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are shown in DEBUG2 level:
      • "could not satisfy select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) 

      • "NEED chunks for keep_select (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)
      • "HAVE chunks from job's exec_vnode (<exec_vnode value>

...