Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objective

...

Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649

...

Interface 1: New job attribute 'tolerate_node_failures'

  • Visibility: Public
  • Change Control: Stable
  • Value:  'true' or 'false' (default is 'false')
  • Python type: bool
  • Synopsis:  
    • When set to 'true', any failure of nodes are tolerated throughout the life of the job. That is, the job is allowed to run even after mother superior mom has detected bad nodes.
    • It can be set via qsub, qalter, or in a Python hook say queuejob hook. If set via qalter and the job is already running, it will be consulted the next time the job is rerun.
    • This can also be specified in the server attribute 'default_qsub_arguments' to allow all jobs to be submitted with tolerate_node_failures attribute set.
    • This option is best used when job is assigned extra nodes using pbs.event().job.select.increment_chunks() method (interface 7).
  • Privilege: user, admin, or operator can set it
  • Examples:
    • Via qsub:

                            qsub -W tolerate_node_failures=true <job_script>

    • Via qalter:
  • qalter -W tolerate

                            qalter -W tolerate_node_failures=false <job_script>

    • Via a hook:

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = True

...

                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23

...

                              ...
                              tolerate_node_failures = True

  • Log/Error messages:
    • When a job that has tolerant_node_failures attribute set to 'true', the following mom_logs messages will appear in the following conditions: sister moms that failed to join job due to either communication error or execjob_begin hook rejects, or when a sister mom fails to setup a job like cpuset creation failedfailure, or when a sister mom rejected rejects an execjob_prologue hook, or when the primary mom fails to poll a sister mom for status, or for any communication error to the sister mom, then the following mom_logs message will be shown:
      • DEBUG level: "ignoring error as job is tolerant of node failures"
      • DEBUG3 level: "ignoring POLL error from failed mom <mom_host> as job is tolerant of node failures"
      • DEBUG3 level:  ignoring "ignoring lost communication with <mom_host> as job is tolerant of node failures"

Interface 2: New server accounting record: 's' for

...

secondary start record

...

when job's assigned resources get pruned during job startup

  • Visibility: Public
  • Change Control: Stable
  • Synopsis: When a job was reliably started (had a select_reliable_startup value)has tolerate_node_failures attribute set to 'true', there'll be this new accounting record that will reflect the adjusted (pruned) values to select, 'exec_vnode', 'exec_host', along with values to select_reliable_startup (inteface 1), and select_requested (interface 6)the job's assigned resources, as a result of the call to pbs.event().job.release_nodes() inside execjob_prologue or execjob_launch hooks.
  • Note:  This is a new accounitng record; the start of job record ('S') remains as before.
  • Example:

    04/07/2016 17:08:09;s;20.borgcorretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 1460063203  exec_host=borgcorretja/0*3+federer/0*2+lendl/0*2+agassinadal/0 +sampras/0 exec_vnode=(borgcorretja:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(samprasnadal:ncpus=1:mem=3145728kb)   Resource_List.mem=6gb 6291456kb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.select=1:ncpus=3:mem=1gb1048576kb+1:ncpus=2:mem=2gb2097152kb+1:ncpus=1:mem=3gb 3145728kb Resource_List.select_requestedsite=1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus=1:mem=3gb session=0 run_count=1resource_assigned.mem=24gb resource_assigned.ncpus=9

Interface 3: sister_join_job_alarm mom config option

  • Visibility: Public
  • Change Control: Stable
  • Details:
    This is the number of seconds that mother superior the primary mom will wait to receive acknowledgement from all the sister moms for the IM_JOIN_JOB requests sent, if job's tolerate_node_failures attribute is set to 'true'. That is, When a node failures-tolerant job runs, just before a the job officially launches the its program (script/executable), the primary pbs_mom will ignore any errors from sister moms including failed IM_JOIN_JOB requests. pbs_mom will use this value before proceeding to pre-starting the job (calling finish_exec())Once all the IM_JOIN_JOB requests have been acknowledged or when the 'sister_join_job_alarm' value wait time has been exceeded, then  pre-starting the job (calling finish_exec()) continues
  • Default value: set to to the total amount of 'alarm' values associated with enabled , execjob_begin hooks. Example, if there are 2 execjob_begin hooks with first hook having alarm=30 and second hook having alarm=20, then the default value of sister_join_job_alarm will be 50 seconds. If there are no execjob_begin hooks, then this is set to 30 seconds.
        To change value, add the following line in mom's config file:
                            $sister_join_job_alarm <# of seconds>seconds>
  • Log/Error messages:
    1. When the $sister_join_job_alarm value is specified, then there'll be PBSEVENT_SYSTEM level message:

                                  "'sister_join_job_alarm;<alarm_value>

...

"

    1. When a job has tolerate_node_failures attribute set to 'true' and it's a multi-node job, then the following message is displayed in mom_logs
  • logged
    1. under DEBUG2 level:                             "Job;<job-id>;job waiting up to <alarm value> secs ($sister_joinjob_alarm) for all sister moms to join
    2. When not all join job request from sister moms have been acknowledged within the $sister_join_job_alarm time limit, then the following mom_logs message
  • is logged
    1. appears at DEBUG2 level:
  • "sister_join

                                    "sister_join_job_alarm wait time <alarm_value> secs exceeded"

Interface 4: job_launch_delay mom config option

  • Visibility: Public
  • Change Control: Stable
  • Details:
    This is the number of seconds that mother superior the primary mom will wait before launching (executing the job script or executable), if the job that has tolerate_node_failures set to "true", . This wait time can be used to let execjob_prologue hooks finish execution  to capture or report any node failures, or for mother superior to notice of any communication problems with other nodes. pbs_mom will not necessarily wait fot the entire time but proceed to execute execjob_launch hook (when specified) once receiving acknowledgement all prologue hook acknowledgements have been received from sister moms that they have executed the prologue hook.
  • Default value: set to to the total amount of 'alarm' values associated with enabled execjob_prologue hooks. For example, if there are 2 execjob_prologue hooks, where first hook has alarm=30 and second hook has alarm=60, then the default job_launch_delay value will be 90 seconds. If there are no execjob_prologue hooks, then this is set to 30 seconds.
    To change value, add the following line in mom's config file:
                   $job_launch_delay <number of seconds>
  • Log/Error messages:
    1. When $job_launch_delay value is set, there'll be PBSEVENT_SYSTEM level message:

                               "job_launch_delay;<delay_value>"

    1. Before oficially launching a tolerant job, it would wait up to 'job_launch_delay' time  for any report on failed sister moms, which will later used for determining the entries of vnodelist_fail parameter (see interface 5)  in execjob_launch hook (if any). The following DEBUG2 level log message will be shown:

                  "Job;<job-id>;waiting up to <job_launch_delay_value> secs ($job_launch_delay) for mom hosts status and prologue hooks ack"

    2. When
  • mother superior notice
    1. primary mom notices that not all acks were received from the sister moms in regards to
  • eecjob
    1. execjob_prologue hook execution, then mom_logs would show the
  • PBSEVENT_
    1. DEBUG2 level message:

                                "not all

...

prologue hooks to sister moms completed

...

"

...

Interface 5: pbs.event().vnode_list_fail[] hook parameter

  • Visibility: Public
  • Change Control: Stable
  • Python Type: dict (dictionary of pbs.vnode objects keyed by vnode name)
  • Details:
    This is a new event parameter for the execjob_prologue and execjob_launch hook. It will contain the list of vnodes and their assigned resources that are managed by unhealthy moms. This can include those vnodes from sister moms that failed to join the job, that rejected an execjob_begin hook or execjob_prologue hook request, encountered communication error while primary mom is polling the sister mom host. This dictionary object is keyed by vnode name. And one can walk through this list and start offlining the vnodes. For , for example:

    for vn in e.vnode_list_fail.keys():
        v = e.vnode_list_fail[vn]
        pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
        v.state = pbs.ND_OFFLINE

  • Log/Error messages:
    1. If an execjob_prologue hook or an execjob_launch hook requested to offline a vnode, server_logs would show the message under PBSEVENT_DEBUG2 level:

      ";Server@borg;Node;<node name>;Updated vnode <node_name>'s attribute state=offline per mom hook request" 

Interface 6:

...

Allow execjob_launch hooks to modify job and vnode attributes

  • Visibility:   Public
  • Change Control:  StableStable
  • Python Type: pbs.select
  • Privilege: This is a read-only attribute that cannot be set by client or hook.
  • Details:
    This is a new job attribute that the server use to save job's original  resource 'select' specification but in a complete way, containing any default resources missing from the original chunks as well as the chunk counts.
  • Example:

Given: qsub -l select=3:ncpus=1+mem=5gb+ncpus=2:mem=2gb

  • Detail: With this feature, execjob_launch hooks are now allowed to modify job and vnode attributes, in particular, job's Execution_Time, Hold_Types, resources_used, and run_count values. This is the same with vnode object attributes like state and resources_available.
  • Examples:

                           Set a job's Hold_Types in case the hook script rejects the execjob_launch event:

                               

...

pbs.event().job.Hold_Types = pbs.hold_types('s')

                         

...

Interface 7: pbs.select.increment_chunks(increment, first_chunk=False)

  • Visibility: Public
  • Change Control: Stable
  • Return Python Type: pbs.select
  • Details:
    This is a new method in the pbs.select type where 'increment' number of chunks are added to each chunk in the chunk specification. So given a select spec of "[N:][chunk specification][+[N:]chunk specification]", this function will return [N+increment:][chunk specification][+[N+increment:]chunk specification]". A missing 'N' values means 1. By default, first_chunk=False means no increment is added to the first chunk in the specs. Example:

           Given  pbs.event().job.Resource_List["select"]=ncpus=2:mem=2gb+ncpus=2:mem=2gb+2:ncpus=1:mem=1gb
                new_select = pbs.event().job.Resource_List["select"].increment_chunks(2)         ← first_chunk=False by default
           where new_select is now: ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb Otherwise, if 'first_chunk=True', then the resulting new select also includes 2                   additional increments to first chunk:  new_select: 3:ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb

Interface 8: pbs.event().job.release_nodes() method

  • Visibility: Public
  • Change Control: Stable
  • Return Python type: PBS job object (i.e. the modified PBS job object)
  • Restriction: This is currently callable only from mom hooks execjob_launch and execjob_prologue and makes sense only when executed from the hook ran by the primary mom. It is advisable to put this call in an 'if pbs.event().job.in_ms_mom()' clause.
  • Call syntax 1: pbs.event().job.release_nodes(nodes_list)
    • Input: nodes_list - dictionary of pbs.vnode objects keyed by 'vnode_name'  of those nodes to release from job 
    • Detail: Release the given nodes in 'nodes_list'  that have been assigned to a job. This is a hook front-end to pbs_release_nodes command (PP-339). This returns a dictionary of pbs.vnode objects representing the nodes that have been released. If error encountered performing the action, this method returns None.
    • Returns: the modified PBS job object reflecting the updated values to some of the attributes like 'exec_vnode'.
    • Examples:

                     e = pbs.event()

                     rel_vnode_list =  {"federer": None, "murray": None}               # dictionary keyed by vnode names. The key name is what is significant.
                     pj = e.job.release_nodes(nodes_list=rel_vnode_list)             # will do the cmd line equivalent: pbs_release_nodes federer murray
                    if pj != None:                                                                            # can check what the updated values to 'exec_vnode', 'exec_host', 'exec_host2'
                        pbs.logmsg(pbs.LOG_DEBUG, "pj.exec_vnode=%s" % (pj.exec_vnode,)

                    NOTE: One can also just do: e.job.release_nodes(nodes_list=e.vnode_list_fail) to release vnodes managed by moms that were seen by primary mom as unhealthy.

...

 Set a vnode's state to offline:

                               pbs.event().vnode_list[<node_name>].state = pbs.ND_OFFLINE

  • Log/Error messages:

                           In previous version of PBS, when a job or vnode attribute/resource is set in execjob_launch, the hook rejects the request and returns the following message:

                                     "Can only set progname, argv, env event parameters under execjob_launch hook"

                           Now, setting vnode and job attributes are allowed and would no longer give the above message. If something else get set in the hook, like a server attribute, then

                          this will now be the new DEBUG2 level mom_logs message:

                                     "Can only set progname, argv, env event parameters as well as job, resource, vnode under execjob_launch hook."

Interface 7: pbs.select.increment_chunks(increment, first_chunk=False)

  • Visibility: Public
  • Change Control: Stable
  • Return Python Type: pbs.select
  • Details:
    This is a new method in the pbs.select type where 'increment' number of chunks are added to each chunk in the chunk specification. So given a select spec of "[N:][chunk specification][+[N:]chunk specification]", this function will return [N+increment:][chunk specification][+[N+increment:]chunk specification]". A missing 'N' values means 1. By default, first_chunk=False means no increment is added to the first chunk in the specs. Example:

           Given  pbs.event().job.Resource_List["select"]=ncpus=2:mem=2gb+ncpus=2:mem=2gb+2:ncpus=1:mem=1gb
                new_select = pbs.event().job.Resource_List["select"].increment_chunks(2)         ← first_chunk=False by default
           where new_select is now: ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb Otherwise, if 'first_chunk=True', then the resulting new select also includes 2                   additional increments to first chunk:  new_select: 3:ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb

Interface 8: pbs.event().job.release_nodes(keep_select) method

  • Visibility: Public
  • Change Control: Stable
  • Return Python type: PBS job object reflecting the new values to some of the attributes like 'exec_vnode', Resource_List.* as a result of nodes getting released.
  • Input: keep_select - a pbs.select string that should be a subset of the job’s original select request, mapping to a set of nodes that should be kept.
  • Mother superior mom will remove node resources from nodes that have been detected to be bad, using nodes that have been seen as healthy and functional as replacement when necessary. Common value passed would be the
  • Restriction: This is currently callable only from mom hooks execjob_launch and execjob_prologue and makes sense only when executed from the hook ran by the primary mom. It is advisable to put this call in an 'if pbs.event().job.
  • select
  • in_ms_
  • requested
  • mom(
  • interface 6
  • )' clause.
  • Detail:
  • Release
  •  Release nodes that are assigned to a job in such a way that it still satisfies the given 'keep_select' specification, with no nodes that are known to be bad  (in pbs.event().vnode_list_fail). With a successful execution of release_nodes() call from execjob_prologue and execjob_launch hooks, the 's' accounting record (interface 2) is generated. and primary mom will notify the sister moms to also update its internal nodes table, so future use of task manager API  (e.g. tm_spawn, pbsdsh)
  • will be aware of the change. Returns: the modified PBS job object reflecting the new values to some of the attributes like 'exec_vnode'
  • will be aware of the change.
  • Examples:

           Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:

                pj = e.job.release_nodes(keep_select="ncpus=2:mem=2gb+ncpus=2:mem=2gb+ncpus=1:mem=1gb")
                if pj != None:
                    pbs.logmsg(pbs.LOG_DEBUG, "pj.exec_vnode=%s" % (pj.exec_vnode,))
                else:               # returned None job object, so we can put a hold on the job and requeue it, rejecting the hook event
                    e.job.Hold_Types = pbs.hold_types("s")
                    e.job.rerun()
                    e.reject("unsuccessful at LAUNCH")


  • Log/Error messages:
    • When job's assigned nodes get pruned (nodes released to satisfy 'keep_select') , mom_logs will show the following info under PBSEVENT_JOB log level:

      ";Job;<jobid>;pruned from exec_vnode=<original value>"
      ";Job;<jobid>;pruned to exec_node=<new value>"

    • When a multinode job's assigned resources have been modified, primary mom will do a quick 5 seconds wait  for an acknowledgement from the sister moms that they have updated their nodes table.  There's be this DEBUG2 level mom_logs message:

...

                   Eventually, all node updates will complete in the background. 

    • When mother superior fails  to prune currently assigned chunk resource, then the following detailed mom_logs message are
    shown under PBSEVENT_DEBUG log level unless otherwise noted:"could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the keep_select spec could not be satisfied
    • shown in DEBUG2 level:
      • "could not satisfy

    the original
      • select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)

    with first available chunk
      •  

      • "NEED chunks for keep_select (<resc1>=<val1> <resc2>=<val2> ...
    "  when a secondary (sister) chunk from the keep_select spec could not be satisfied
  • "job node_list_fail: node <node_name1>" which shows what mom is consulting as the bad_nodes list. (consulted by mom in release_nodes() call).

  • "job node_list:_good node <node_name1>" which shows what mom is consulting as the good_nodes_list (consulted by mom in release_nodes() call).

  • When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

        ";pbs_mom;Job;<jobid>;updated nodes info"

  • When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the PBSEVENT_DEBUG2 message: "not all job updates to sister moms completed." Note that eventually, all the nodes would automatically complete updating its info.
  • If a sister mom receives a TM request but its nodes data have not been updated yet, the client would get an "error on spawn" message while doing tm_spawn.

  • Calling release_nodes() from a hook that is not execjob_prologue or execjob_launch hook would return None as this is currently not supported.
  • Examples:

Given a queuejob hook where it sets select_reliable_startup to allow another node to be added to the 2nd chunk and third chunk of the spec:

...

      • <rescN>=valN)
      • "HAVE chunks from job's exec_vnode (<exec_vnode value>
    • When a sister mom updated its internal nodes table, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:                                                   ";<jobid>;updated nodes info"
    • Calling release_nodes() from a hook that is not execjob_prologue or execjob_launch hook would return None as this is currently not supported.
    • Upon successful execution of release_nodes() call, it is normal to receive messages in the mom_logs of the form:

                    " stream <num> not found to job nodes"
                    "im_eof, No error from addr <ipaddr>:<port> on stream <num>

                 which corresponds to the connection stream of a released mom host.

Case of Reliable Job Startup:

In order to have a job to reliably start, we'll need a queuejob hook that makes a job tolerate node failures by setting the 'tolerate_node_failures' attribute to 'true', adding extra chunks to the job's select specification using the pbs.event().job.select.increment_chunks()  method, while saving the job's original select value into the builtin resource say "site", and having an execjob_launch hook that will call pbs.event().job.release_nodes() to prune back the job's select value back to the original.

NOTE: In the future, we would allow any custom resource to be created and use that to save the 'select' value, It's just that currently, custom resources populating Resource_List are not propagated from the server to the mom, and it needs to be as mom hook will use the value.

                  

First, introduce a queuejob hook:
% cat qjob.py

import pbs
e=pbs.event()j =

j = e.job

j.tolerate_node_failures = True

Then, save the current of 'select' in a builtin resource "site". 

e.job.Resource_List["site"] = str(e.job
j.Resource_List["site"])

Next, add extra chunks to current select_reliable_startup"] = j:

new_select = e.job.Resource_List["select"].increment_chunks(1)
e.job.Resource_List["select"] = new_select

Now instantiate the queuejob hook:
# qmgr -c "c h qjob event=queuejob"
# qmgr -c "i h qjob application/x-python default qjob.py"

# Second, Soon introduce an execjob_launch hook so that before job officially runs its program, the job's current assigned resources is pruned to match the original 'select' request of user:

...

import pbs
e=pbs.event()

j = e.job
relnodes pj = j.release_nodes(keep_select=je.job.select_requestedResource_List["site"])

if relnodes pj is None:          # was not successful pruning the nodes

...

   e.reject("something went wrong pruning the job back to its original select request")

# Otherwise, free up the nodes detected already as bad

for

Instantiate the launch hook:

# qmgr -c "c h launch event=execjob_launch"
# qmgr -c "i h launch application/x-python default launch.py"


And say a job is of the form:


% cat jobr.scr
#PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
#PBS -l place=scatter:excl

...


When job first starts, it will get assigned 5 nodes first, as the "select_reliable_startup" was specified to add select specification was modified causing 2 extra nodes getting assigned:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
Resource_List.mem = 11gb
Resource_List.ncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select_reliable_startup = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gbncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select = 1: ncpus=3:mem-=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.select_requested site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

tolerate_node_failures = True

Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+lendl/0*2+agassi/0*2
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 6
Resource_List.nodect = 3
Resource_List.place = scatter:excl
Resource_List.select = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

A snapshot of the job's output would show the pruned list of nodes:

...