Objective

Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649

Interface 1: New job attribute 'tolerate_node_failures'

                            qsub -W tolerate_node_failures="all" <job_script>

                            qalter -W tolerate_node_failures="job_start" <jobid>

                            # cat qjob.py
                            import pbs
                            e=pbs.event()
                            e.job.tolerate_node_failures = "all"
                            # qmgr -c "create hook qjob event=queuejob"
                            # qmgr -c "import hook application/x-python default qjob.py"
                            % qsub job.scr
                            23.borg
                            % qstat -f 23
                              ...
                              tolerate_node_failures = all

Interface 2: New server accounting record: 's' for secondary start record when job's assigned resources get pruned during job startup

Interface 3: sister_join_job_alarm mom config option

  1. When the $sister_join_job_alarm value is specified, then there'll be PBSEVENT_SYSTEM level message that will be shown when mom starts up or kill -HUPed:                                                                      "sister_join_job_alarm;<alarm_value>"
  2. When not all join job request from sister moms have been acknowledged within the $sister_join_job_alarm time limit, then the following mom_logs message appears at DEBUG2 level:                          "sister_join_job_alarm wait time <alarm_value> secs exceeded"

Interface 4: job_launch_delay mom config option

  1. When $job_launch_delay value is set, there'll be PBSEVENT_SYSTEM level message upon mom startup or when it is kill -HUPed:                                                                                                      "job_launch_delay;<delay_value>"
  2. When primary mom notices that not all acks were received from the sister moms in regards to execjob_prologue hook execution, then mom_logs would show the DEBUG2 level message:                                                                                                                                                                                                                                                                                                         "not all prologue hooks to sister moms completed, but job will proceed to execute"

Interface 5: pbs.event().vnode_list_fail[] hook parameter

Interface 6: Allow execjob_launch hooks to modify job and vnode attributes

                           Set a job's Hold_Types in case the hook script rejects the execjob_launch event:

                                pbs.event().job.Hold_Types = pbs.hold_types('s')

                           Set a vnode's state to offline:

                               pbs.event().vnode_list[<node_name>].state = pbs.ND_OFFLINE

                           In previous version of PBS, when a job or vnode attribute/resource is set in execjob_launch, the hook rejects the request and returns the following message:

                                     "Can only set progname, argv, env event parameters under execjob_launch hook"

                           Now, setting vnode and job attributes are allowed and would no longer give the above message. If something else get set in the hook, like a server attribute, then

                          this will now be the new DEBUG2 level mom_logs message:

                                     "Can only set progname, argv, env event parameters as well as job, resource, vnode under execjob_launch hook."

Interface 7: pbs.select.increment_chunks(increment_spec)

Given:
      sel=pbs.select("ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")

Calling sel.increment_chunks(2) would return a string:
     "1:ncpus=3:mem=1gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=3gb"

Calling sel.increment_chunks("3") would return a string:
     "1:ncpus=3:mem=1gb+4:ncpus=2:mem=2gb+5:ncpus=1:mem=3gb"

Calling sel.increment_chunks("23.5%"), would return a pbs.select value mapping to:
      "1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"

with the first chunk, which is a single chunk, is left as is, and the second and third chunks are increased by 23.5 % resulting in 1.24 rounded up to 2, and 2.47 rounded up to 3.

Calling sel.increment_chunks({0: 0, 1: 4, 2: "50%"}), would return a pbs.select value mapping to:
     "1:ncpus=3:mem=1gb+5:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"

where no increase (0) for chunk 1, additional 4 chunks for chunk 2, 50% increase for chunk 3 resulting in 3.

               Given:
                         sel=pbs.select("5:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")

               Then calling sel.increment_chunks("50%") or sel.increment_chunks({0: "50%", 1: "50%", 2: "50%}) would return a pbs.select value mapping to:
                         "7:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"
                as for the first chunk, the initial single chunk of "1:ncpus=3:mem=1gb" is left as is, with the "50%" increase applied to the remaining chunks "4:ncpus=3:mem=1gb", and then                        added back to the single chunk to make 7, while chunks 2 and 3 are increased to 2 and 3, respectively.

Interface 8: pbs.event().job.release_nodes(keep_select) method

                   if 'PBS_NODEFILE' not in pbs.event().env:

                       pbs.event().accept()

                   ...

                  pbs.event().job.release_nodes(keep_select=...)

NOTE: On Windows, where PBS_NODEFILE would always appear in pbs.event().env, need to put the following on top of the execjob_launch hook:


if any("mom_open_demux.exe") in s for s in e.argv):
      e.accept()


                     "<jobid>: no nodes released as job does not tolerate node failures"

                       "not all job updates to sister moms completed"

                   Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.

                    " stream <num> not found to job nodes"
                    "im_eof, No error from addr <ipaddr>:<port> on stream <num>

                 which corresponds to the connection stream of a released mom host.

Interface 9: new hook event: execjob_resize

Case of Reliable Job Startup:

In order to have a job to reliably start, we'll need a queuejob hook that makes a job tolerate node failures by setting the 'tolerate_node_failures' attribute to 'job_start', adding extra chunks to the job's select specification using the pbs.event().job.select.increment_chunks()  method, while saving the job's original select value into the builtin resource say "site", and having an execjob_launch hook that will call pbs.event().job.release_nodes() to prune back the job's select value back to the original.

NOTE: In the future, we would allow any custom resource to be created and use that to save the 'select' value, It's just that currently, custom resources populating Resource_List are not propagated from the server to the mom, and it needs to be as mom hook will use the value.

                  

First, introduce a queuejob hook:
% cat qjob.py

import pbs
e=pbs.event()

j = e.job

j.tolerate_node_failures = "job_start"

Then, save the current of 'select' in a builtin resource "site". 

e.job.Resource_List["site"] = str(e.job.Resource_List["select"])

Next, add extra chunks to current select:

new_select = e.job.Resource_List["select"].increment_chunks(1)
e.job.Resource_List["select"] = new_select

Now instantiate the queuejob hook:
# qmgr -c "c h qjob event=queuejob"
# qmgr -c "i h qjob application/x-python default qjob.py"

Soon introduce an execjob_launch hook so that before job officially runs its program, the job's current assigned resources is pruned to match the original 'select' request of user:

% cat launch.py

import pbs
e=pbs.event()

if 'PBS_NODEFILE' not in e.env:

    e.accept()

j = e.job
pj = j.release_nodes(keep_select=e.job.Resource_List["site"])

if pj is None:          # was not successful pruning the nodes

    j.rerun()        # rerun (requeue) the job

   e.reject("something went wrong pruning the job back to its original select request")

Instantiate the launch hook:

# qmgr -c "c h launch event=execjob_launch"
# qmgr -c "i h launch application/x-python default launch.py"


And say a job is of the form:


% cat jobr.scr
#PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
#PBS -l place=scatter:excl

echo $PBS_NODEFILE
cat $PBS_NODEFILE
echo END
echo "HOSTNAME tests"
echo "pbsdsh -n 0 hostname"
pbsdsh -n 0 hostname
echo "pbsdsh -n 1 hostname"
pbsdsh -n 1 hostname
echo "pbsdsh -n 2 hostname"
pbsdsh -n 2 hostname
echo "PBS_NODEFILE tests"
for host in `cat $PBS_NODEFILE`
do
    echo "HOST=$host"
    echo "pbs_tmrsh $host hostname"
    pbs_tmrsh $host hostname
    echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
    ssh $host pbs_attach -j $PBS_JOBID hostname
done


When job first starts, it will get assigned 5 nodes first, as select specification was modified causing 2 extra nodes getting assigned:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
Resource_List.mem = 11gb
Resource_List.ncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

tolerate_node_failures = job_start

Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+lendl/0*2+agassi/0*2
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 6
Resource_List.nodect = 3
Resource_List.place = scatter:excl
Resource_List.select = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

A snapshot of the job's output would show the pruned list of nodes:

/var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
borg.pbspro.com
lendl.pbspro.com
agassi.pbspro.com
END

HOSTNAME tests

pbsdsh -n 0 hostname
borg.pbspro.com
pbsdsh -n 1 hostname
lendl.pbspro.com
pbsdsh -n 2 hostname
agassi.pbspro.com

PBS_NODEFILE tests
HOST=borg.pbspro.com
pbs_tmrsh borg.pbspro.com hostname
borg.pbspro.com
ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
borg.pbspro.com
HOST=lendl.pbspro.com
pbs_tmrsh lendl.pbspro.com hostname
lendl.pbspro.com
ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
lendl.pbspro.com
HOST=agassi.pbspro.com
pbs_tmrsh agassi.pbspro.com hostname
agassi.pbspro.com
ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
agassi.pbspro.com