Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

...

  • Visibility: Public
  • Change Control: Stable
  • Value: 'trueall' or , 'false' (default is 'false')job_start', or 'none'
  • Python type: boolstr
  • Synopsis:  
    • When set to 'true', any failure of nodes are tolerated throughout the life of the job. That is, the job is allowed to run even after mother superior mom has detected bad nodes.
    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute all', this means to tolerate all node failures resulting from communication problems (e.g. polling) between the primary mom and the sister moms assigned to the job, as well as due to rejections from execjob_begin, or execjob_prologue hook execution by remote moms.
    • When set to 'job_start', this  means to tolerate node failures that occur only during job start like an assigned sister mom failing to join job, communication errors that happen between the primary mom or sister moms, just before the job executes the execjob_launch hook and/or the top level shell or executable.
    • When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).

    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
  • Privilege: user, admin, or operator can set it
  • Examples:
    • Via qsub:

                            qsub -W tolerate_node_failures=true "all" <job_script>

    • Via qalter:

                            qalter -W tolerate_node_failures=false "job_start" <jobid>

    • Via a hook:

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = True"all"
                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23
                              ...
                              tolerate_node_failures = Trueall

  • Log/Error messages:
    • When a job that has tolerant_node_failures attribute set to 'true', the following mom_logs messages will appear in the following conditions: sister moms that failed to join job due to either communication error or execjob_begin hook rejects, when a sister mom fails to setup a job like cpuset creation failure, when a sister mom rejects an execjob_prologue hook, when the primary mom fails to poll a sister mom for status, or for any communication error to the sister mom:
      • DEBUG level: "ignoring error as job is tolerant of node failures"

...

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: When a job has tolerate_node_failures attribute set to 'true'all' or 'job_start', there'll be this new accounting record that will reflect the adjusted (pruned) values to the job's assigned resources, as a result of the call to pbs.event().job.release_nodes() inside execjob_prologue or execjob_launch hooks.
  • Note:  This is a new accounitng record; the start of job record ('S') remains as before.
  • Example:

    04/07/2016 17:08:09;s;20.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203  exec_host=corretja/0*3+lendl/0*2+nadal/0 exec_vnode=(corretja:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(nadal:ncpus=1:mem=3145728kb) Resource_List.mem=6291456kb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.select=1:ncpus=3:mem=1048576kb+1:ncpus=2:mem=2097152kb+1:ncpus=1:mem=3145728kb Resource_List.site=ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb resource_assigned.mem=24gb resource_assigned.ncpus=9

...

  • Visibility: Public
  • Change Control: Stable
  • Details:
    This is the number of seconds that the primary mom will wait to receive acknowledgement from all the sister moms for the IM_JOIN_JOB requests sent, if job's tolerate_node_failures attribute is set to 'true'all' or 'job_start'. That is, just before the job officially launches its program (script/executable), the primary pbs_mom will ignore any errors from sister moms including failed IM_JOIN_JOB requests. Once all the IM_JOIN_JOB requests have been acknowledged or when the 'sister_join_job_alarm' value wait time has been exceeded, then  pre-starting the job (calling finish_exec()) continues. 
  • Default value: set to to the total amount of 'alarm' values associated with enabled execjob_begin hooks. Example, if there are 2 execjob_begin hooks with first hook having alarm=30 and second hook having alarm=20, then the default value of sister_join_job_alarm will be 50 seconds. If there are no execjob_begin hooks, then this is set to 30 seconds.
        To change value, add the following line in mom's config file:
                            $sister_join_job_alarm <# of seconds>
  • Log/Error messages:

...

  • Visibility: Public
  • Change Control: Stable
  • Details:
    This is the number of seconds that the primary mom will wait before launching (executing the job script or executable), if the job that has tolerate_node_failures set to "trueall" or "job_start". This wait time can be used to let execjob_prologue hooks finish execution  to capture or report any node failures, or for mother superior to notice of any communication problems with other nodes. pbs_mom will not necessarily wait fot the entire time but proceed to execute execjob_launch hook (when specified) once all prologue hook acknowledgements have been received from sister moms.
  • Default value: set to to the total amount of 'alarm' values associated with enabled execjob_prologue hooks. For example, if there are 2 execjob_prologue hooks, where first hook has alarm=30 and second hook has alarm=60, then the default job_launch_delay value will be 90 seconds. If there are no execjob_prologue hooks, then this is set to 30 seconds.
    To change value, add the following line in mom's config file:
                   $job_launch_delay <number of seconds>
  • Log/Error messages:

...

  • Visibility: Public
  • Change Control: Stable
  • Return Python Type: pbs.select
  • Details:
    This is a new method in the pbs.select type where 'increment_spec' number of chunks are added to each chunk (except for the first chunk assigned to primary mom) in the chunk specification. So given a select spec of "[N:][chunk specification][+[N:]chunk specification]", this function will return [N+increment:][chunk specification][+[N+increment:]chunk specification]". A missing 'N' values means 1.Input:
  • if 'increment_spec' is a number (int or long), then it will be the amount to add to the number of chunks spcified for each chunk in the pbs.select spec.
  • if 'increment_spec' is a numeric string (int or long), then it will also be the amount to add to the number of chunks spcified for each chunk in the pbs.select spec. The first chunk is the single chunk inside the first item (in the plus-separated specs) that is assigned to the
    primary mom. It is left as is.
    For instance, given a chunk specs of "3:ncpus=2+2:ncpus=4", this can be viewed as "(1:ncpus=2+2:ncpus=2)+(2:ncpus=4)", and the increment specs described below would apply to the chunks after the initial, single chunk "1:ncpus=2" and in all the succeeding chunks.
  • Input:
    • if 'increment_spec' is a numeric string that ends with a percent sign (%number (int or long), then this it will be the percent amount of chunks to increase to add to the number of chunks (that is not the first chunk) specified for each chunk in the pbs.select spec. The resultingamount is rounded up (i.e. ceiling) (e.g. 1.23 rounds up to 2).Finally,
    • if 'increment_spec' is a numeric string (int or long), then it will also be the amount to add to the number of chunks (that is not the first chunk) specified for each chunk in the pbs.select spec.
    • if 'increment_spec' is a dictionary with elements of the form:numeric string that ends with a percent sign (%), then this will be the percent amount of chunks to increase each chunk (that is not the first chunk) in the pbs.select spec. The resulting amount is rounded up (i.e. ceiling) (e.g. 1.23 rounds up to 2).
    • Finally, if 'increment_spec' is a dictionary with elements of the form:
                       {<chunk_index_to_select_spec> : <increment>, ...}
      where <chunk_index_to_select_spec> starts at 0 for the first chunk, and <increment> can be numeric, numeric string or a percent increase value. This allows for individually specifying the number of chunks to increase original value. Note that for the first chunk in the list (0th index), the increment will apply to the chunks beyond the initial single chunk, which is assigned to the primary mom.
  • Example:

Given:
      sel=pbs.select("ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")

Calling sel.

...

Givenincrement_chunks(2) would return a string:
     "1:ncpus=3:mem=1gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=3gb"

Calling sel.increment_chunks("3") would return a string:
      sel=pbs.select(""1:ncpus=3:mem=1gb+14:ncpus=2:mem=2gb+25:ncpus=1:mem=3gb")

Calling sel.increment_chunks(2) ("23.5%"), would return a stringpbs.select value mapping to:
      "31:ncpus=3:mem=1gb+32:ncpus=2:mem=2gb+43:ncpus=1:mem=3gb"

with the first chunk, which is a single chunk, is left as is, and the second and third chunks are increased by 23.5 % resulting in 1.24 rounded up to 2, and 2.47 rounded up to 3.

Calling sel.increment_chunks("3") {0: 0, 1: 4, 2: "50%"}), would return a stringpbs.select value mapping to:
     "41:ncpus=3:mem=1gb+45:ncpus=2:mem=2gb+53:ncpus=1:mem=3gb"Calling sel.increment_chunks("23.5%"), would return a pbs.select value mapping to:
      "2

where no increase (0) for chunk 1, additional 4 chunks for chunk 2, 50% increase for chunk 3 resulting in 3.

               Given:
                         sel=pbs.select("5:ncpus=3:mem=1gb+

...

1:ncpus=2:mem=2gb+

...

2:ncpus=1:mem=3gb"

...

)

...

               Then calling sel.increment_chunks("50%") or sel.increment_chunks({0:

...

"50%", 1:

...

"50%", 2: "50%

...

})

...

would return a pbs.select value mapping to:
                         "7:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"
                as for the first chunk, the initial single chunk of "1:ncpus=3:mem=1gb

...

" is left as is, with the "50%" increase applied to the remaining chunks "4:ncpus=3:mem=1gb", and then                        added back to the single chunk to make 7, while chunks 2 and 3 are increased to 2 and 3, respectively.

Interface 8: pbs.event().job.release_nodes(keep_select) method

...

                   Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are shown in DEBUG2 level:
      • "could not satisfy select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) 

      • "NEED chunks for keep_select (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)
      • "HAVE chunks from job's exec_vnode (<exec_vnode value>

...

In order to have a job to reliably start, we'll need a queuejob hook that makes a job tolerate node failures by setting the 'tolerate_node_failures' attribute to 'truejob_start', adding extra chunks to the job's select specification using the pbs.event().job.select.increment_chunks()  method, while saving the job's original select value into the builtin resource say "site", and having an execjob_launch hook that will call pbs.event().job.release_nodes() to prune back the job's select value back to the original.

...

j.tolerate_node_failures = Truejob_start

Then, save the current of 'select' in a builtin resource "site". 

...

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
Resource_List.mem = 11gb
Resource_List.ncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

tolerate_node_failures = Truejob_start

Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:

...