Objective

Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.

Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649

Interface 1: select_reliable_startup job resource

Interface 2: New server accounting record: 's' for the start record of a job that was submitted with the select_reliable_startup request

Interface 3: sister_join_job_alarm mom config option

Interface 4: job_launch_delay mom config option

Interface 5: pbs.event().vnode_list_fail[] hook parameter

Interface 6: select_requested job attribute

Given: qsub -l select=3:ncpus=1+mem=5gb+ncpus=2:mem=2gb

                 select_requested would return (noticed the default ncpus=1 in second and third chunks when none was specified):
                                 3:ncpus=1+1:mem=5gb:ncpus=1+1:ncpus=2:mem=2gb

Interface 7: pbs.select.increment_chunks(increment, first_chunk=False)

Interface 8: pbs.event().job.release_nodes() method

          Given an execjob_launch hook, a hook writer can specify that nodes should be released in such a way that it satisfy the user's original select request

            e=pbs.event()

            j = e.job

            if j.in_ms_mom():

                  rel_nodes=j.release_nodes(keep_select=j.select_requested)

                 if rel_nodes is None:   # error occurred

                     j.rerun()   # requeue the job

                    e.reject("Failed to prune job)

Given a queuejob hook where it sets select_reliable_startup to allow another node to be added to the 2nd chunk and third chunk of the spec:

# First, introduce a queue job hook:
% cat qjob.py

import pbs
e=pbs.event()

j = e.job
j.Resource_List["select_reliable_startup"] = j.Resource_List["select"].increment_chunks(1)


# qmgr -c "c h qjob event=queuejob"
# qmgr -c "i h qjob application/x-python default qjob.py"

# Second, introduce an execjob_launch hook so that before job officially runs its program, the job's current assigned resources is pruned to match the original 'select' request of user:

% cat launch.py

import pbs
e=pbs.event()

j = e.job
relnodes = j.release_nodes(keep_select=j.select_requested)

if relnodes is None:          # was not successful pruning the nodes

    j.rerun()        # rerun (requeue) the job

   e.reject("something went wrong pruning the job back to its original select request"

# Otherwise, free up the nodes detected already as bad

for

# qmgr -c "c h launch event=execjob_launch"
# qmgr -c "i h launch application/x-python default launch.py"


And a job of the form:


% cat jobr.scr
#PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
#PBS -l place=scatter:excl

echo $PBS_NODEFILE
cat $PBS_NODEFILE
echo END
echo "HOSTNAME tests"
echo "pbsdsh -n 0 hostname"
pbsdsh -n 0 hostname
echo "pbsdsh -n 1 hostname"
pbsdsh -n 1 hostname
echo "pbsdsh -n 2 hostname"
pbsdsh -n 2 hostname
echo "PBS_NODEFILE tests"
for host in `cat $PBS_NODEFILE`
do
    echo "HOST=$host"
    echo "pbs_tmrsh $host hostname"
    pbs_tmrsh $host hostname
    echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
    ssh $host pbs_attach -j $PBS_JOBID hostname
done


When job first starts, it will get assigned 5 nodes first, as the "select_reliable_startup" was specified to add 2 extra nodes:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
Resource_List.mem = 11gb
Resource_List.ncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select_reliable_startup = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb

Resource_List.select = 1:ncpus=3:mem-=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.select_requested = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:

% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+lendl/0*2+agassi/0*2
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 6
Resource_List.nodect = 3
Resource_List.place = scatter:excl
Resource_List.select = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb

A snapshot of the job's output would show the pruned list of nodes:

/var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
borg.pbspro.com
lendl.pbspro.com
agassi.pbspro.com
END

HOSTNAME tests

pbsdsh -n 0 hostname
borg.pbspro.com
pbsdsh -n 1 hostname
lendl.pbspro.com
pbsdsh -n 2 hostname
agassi.pbspro.com

PBS_NODEFILE tests
HOST=borg.pbspro.com
pbs_tmrsh borg.pbspro.com hostname
borg.pbspro.com
ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
borg.pbspro.com
HOST=lendl.pbspro.com
pbs_tmrsh lendl.pbspro.com hostname
lendl.pbspro.com
ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
lendl.pbspro.com
HOST=agassi.pbspro.com
pbs_tmrsh agassi.pbspro.com hostname
agassi.pbspro.com
ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
agassi.pbspro.com