Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

...

  • Visibility: Public
  • Change Control: Stable
  • Value: 'all', 'job_start', or 'none'
  • Python type: str
  • Synopsis:  
    • When set to 'all', this means to tolerate all node failures resulting from communication problems (e.g. polling) between the primary mom and the sister moms assigned to the job, as well as due to rejections from execjob_begin, or execjob_prologue hook execution by remote moms.
    • When set to 'job_start', this  means to tolerate node failures that occur only during job start like an assigned sister mom failing to join job, communication errors that happen between the primary mom or sister moms, just before the job executes the execjob_launch hook and/or the top level shell or executable.
    • When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).

    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
  • Privilege: user, admin, or operator can set it
  • Examples:
    • Via qsub:

...

                            qalter -W tolerate_node_failures="job_start" <jobid>

    • Via a hook:

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = "all"
                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23
                              ...
                              tolerate_node_failures = all

  • Log/Error messages:
    • When a job that has tolerant_node_failures attribute set to 'true', the following mom_logs messages will appear in the following conditions: sister moms that failed to join job due to either communication error or execjob_begin hook rejects, when a sister mom fails to setup a job like cpuset creation failure, when a sister mom rejects an execjob_prologue hook, when the primary mom fails to poll a sister mom for status, or for any communication error to the sister mom:
      • DEBUG level: "ignoring from <node_host> error as job is tolerant of node failures"

...

  • Visibility: Public
  • Change Control: Stable
  • Return Python type: PBS job object reflecting the new values to some of the attributes like 'exec_vnode', Resource_List.* as a result of nodes getting released.
  • Input: keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept.
  • Restriction: This is currently callable only from mom hooks execjob_launch and execjob_prologue and makes sense only when executed from the hook ran by the primary mom.
    • It is advisable to put this call in an 'if pbs.event().job.in_ms_mom()' clause.
    • Also, since execjob_launch hook will also get called when spawning tasks via pbsdsh, or tm_spawn, ensure the execjob_launch hook invoking release_nodes() call has 'PBS_NODEFILE' in the pbs.event().env list. The presence of 'PBS_NODEFILE' in the environment ensures that the primary mom is executing on behalf of starting the top level job, and not spawning a sister task. One can just add at the top of the hook:
    • This call makes sense only when job is node failure tolerant (i.e. tolerant_node_failures=job_start or tolerate_node_failures=all) since it is when the
      list of healthy and failed nodes are tracked to be consulted by release_nodes() for determining which chunk should be assigned, freed.
  • Detail: Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad  (in pbs.event().vnode_list_fail). With a successful execution of release_nodes() call from execjob_prologue and execjob_launch hooks, the 's' accounting record (interface 2) is generated. and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh) will be aware of the change.

...

                   Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are shown in DEBUG2 level:
      • "could not satisfy select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) 

      • "NEED chunks for keep_select (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)
      • "HAVE chunks from job's exec_vnode (<exec_vnode value>

...