Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.
Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649
04/07/2016 17:08:09;s;20.borg.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 exec_host=borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0 exec_vnode=(borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb) Resource_List.mem=6gb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.select=ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb Resource_List.select_requested=1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus=1:mem=3gb session=0 run_count=1
Before oficially launching a tolerant job, it would wait up to 'job_launch_delay' time for any report on failed sister moms which will later used for determining the entries of vnodelist_fail parameter in execjob_launch hook (if any). The following DEBUG2 level log message will be shown: "Job;<job-id>;waiting up to <job_launch_delay_value> secs ($job_launch_delay) for mom hosts status and prologue hooks ack"
for vn in e.vnode_list_fail.keys():
v = e.vnode_list_fail[vn]
pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
v.state = pbs.ND_OFFLINE
";Server@borg;Node;<node name>;Updated vnode <node_name>'s attribute state=offline per mom hook request"
Given: qsub -l select=3:ncpus=1+mem=5gb+ncpus=2:mem=2gb
select_requested would return (noticed the default ncpus=1 in second and third chunks when none was specified): 3:ncpus=1+1:mem=5gb:ncpus=1+1:ncpus=2:mem=2gb
Given pbs.event().job.Resource_List["select"]=ncpus=2:mem=2gb+ncpus=2:mem=2gb+2:ncpus=1:mem=1gb
new_select = pbs.event().job.Resource_List["select"].increment_chunks(2) ← first_chunk=False by default
where new_select is now: ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb Otherwise, if 'first_chunk=True', then the resulting new select also includes 2 additional increments to first chunk: new_select: 3:ncpus=2:mem=2gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=1gb
Examples:
Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:
e=pbs.event()
j = e.job
if j.in_ms_mom():
j.release_nodes(e.vnode_list_fail)
Given an execjob_launch hook, a hook writer can specify that nodes should be released in such a way that it satisfy the user's original select request
e=pbs.event()
j = e.job
if j.in_ms_mom():
rel_nodes=j.release_nodes(keep_select=j.select_requested)
if rel_nodes is None: # error occurred
j.rerun() # requeue the job
e.reject("Failed to prune job)
";Job;<jobid>;pruned from exec_vnode=<original value>"
";Job;<jobid>;pruned to exec_node=<new value>"
When mother superior fails to prune currently assigned chunk resources then the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:
"could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>" when first chunk from the keep_select spec could not be satisfied
"could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..." when a secondary (sister) chunk from the keep_select spec could not be satisfied
"job node_list_fail: node <node_name1>" which shows what mom is consulting as the bad_nodes list. (consulted by mom in release_nodes() call).
"job node_list:_good node <node_name1>" which shows what mom is consulting as the good_nodes_list (consulted by mom in release_nodes() call).
When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:
";pbs_mom;Job;<jobid>;updated nodes info"
If a sister mom receives a TM request but its nodes data have not been updated yet, the client would get an "error on spawn" message while doing tm_spawn.
Given a queuejob hook where it sets select_reliable_startup to allow another node to be added to the 2nd chunk and third chunk of the spec:
# First, introduce a queue job hook:
% cat qjob.py
import pbs
e=pbs.event()
j = e.job
j.Resource_List["select_reliable_startup"] = j.Resource_List["select"].increment_chunks(1)
# qmgr -c "c h qjob event=queuejob"
# qmgr -c "i h qjob application/x-python default qjob.py"
# Second, introduce an execjob_launch hook so that before job officially runs its program, the job's current assigned resources is pruned to match the original 'select' request of user:
% cat launch.py
import pbs
e=pbs.event()
j = e.job
relnodes = j.release_nodes(keep_select=j.select_requested)
if relnodes is None: # was not successful pruning the nodes
j.rerun() # rerun (requeue) the job
e.reject("something went wrong pruning the job back to its original select request"
# Otherwise, free up the nodes detected already as bad
for
# qmgr -c "c h launch event=execjob_launch"
# qmgr -c "i h launch application/x-python default launch.py"
And a job of the form:
% cat jobr.scr
#PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
#PBS -l place=scatter:excl
echo $PBS_NODEFILE
cat $PBS_NODEFILE
echo END
echo "HOSTNAME tests"
echo "pbsdsh -n 0 hostname"
pbsdsh -n 0 hostname
echo "pbsdsh -n 1 hostname"
pbsdsh -n 1 hostname
echo "pbsdsh -n 2 hostname"
pbsdsh -n 2 hostname
echo "PBS_NODEFILE tests"
for host in `cat $PBS_NODEFILE`
do
echo "HOST=$host"
echo "pbs_tmrsh $host hostname"
pbs_tmrsh $host hostname
echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
ssh $host pbs_attach -j $PBS_JOBID hostname
done
When job first starts, it will get assigned 5 nodes first, as the "select_reliable_startup" was specified to add 2 extra nodes:
% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
Resource_List.mem = 11gb
Resource_List.ncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select_reliable_startup = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.select = 1:ncpus=3:mem-=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.select_requested = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb
Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:
% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+lendl/0*2+agassi/0*2
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 6
Resource_List.nodect = 3
Resource_List.place = scatter:excl
Resource_List.select = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb
A snapshot of the job's output would show the pruned list of nodes:
/var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
borg.pbspro.com
lendl.pbspro.com
agassi.pbspro.com
END
HOSTNAME tests
pbsdsh -n 0 hostname
borg.pbspro.com
pbsdsh -n 1 hostname
lendl.pbspro.com
pbsdsh -n 2 hostname
agassi.pbspro.com
PBS_NODEFILE tests
HOST=borg.pbspro.com
pbs_tmrsh borg.pbspro.com hostname
borg.pbspro.com
ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
borg.pbspro.com
HOST=lendl.pbspro.com
pbs_tmrsh lendl.pbspro.com hostname
lendl.pbspro.com
ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
lendl.pbspro.com
HOST=agassi.pbspro.com
pbs_tmrsh agassi.pbspro.com hostname
agassi.pbspro.com
ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
agassi.pbspro.com