Provide the ability to pad a job's resource nodes request (i.e. request additional chunks of resource for the job), so that if some nodes fail, the job can still start. Any left over nodes not needed by the job can be released back to the server.
Forum: http://community.pbspro.org/t/pp-928-reliable-job-startup/649
When set to 'none' or if the attribute is unset, this means no node failures are tolerated (default behavior).
qsub -W tolerate_node_failures="all" <job_script>
qalter -W tolerate_node_failures="job_start" <jobid>
# cat qjob.py
import pbs
e=pbs.event()
e.job.tolerate_node_failures = "all"
# qmgr -c "create hook qjob event=queuejob"
# qmgr -c "import hook application/x-python default qjob.py"
% qsub job.scr
23.borg
% qstat -f 23
...
tolerate_node_failures = all
04/07/2016 17:08:09;s;20.corretja.pbspro.com;user=bayucan group=users project=_pbs_project_default jobname=jobr.scr queue=workq ctime=1460063202 qtime=1460063202 etime=1460063202 start=1460063203 exec_host=corretja/0*3+lendl/0*2+nadal/0 exec_vnode=(corretja:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(nadal:ncpus=1:mem=3145728kb) Resource_List.mem=6291456kb Resource_List.ncpus=6 Resource_List.nodect=3 Resource_List.place=scatter:excl Resource_List.select=1:ncpus=3:mem=1048576kb+1:ncpus=2:mem=2097152kb+1:ncpus=1:mem=3145728kb Resource_List.site=ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb resource_assigned.mem=24gb resource_assigned.ncpus=9
for vn in e.vnode_list_fail:
v = e.vnode_list_fail[vn]
pbs.logmsg(pbs.LOG_DEBUG, "offlining %s" % (vn,))
v.state = pbs.ND_OFFLINE
If after some time, a node's host comes back with an acknowledgement of successful prologue hook execution, the primary mom would add back the host to the healthy list.
Set a job's Hold_Types in case the hook script rejects the execjob_launch event:
pbs.event().job.Hold_Types = pbs.hold_types('s')
Set a vnode's state to offline:
pbs.event().vnode_list[<node_name>].state = pbs.ND_OFFLINE
In previous version of PBS, when a job or vnode attribute/resource is set in execjob_launch, the hook rejects the request and returns the following message:
"Can only set progname, argv, env event parameters under execjob_launch hook"
Now, setting vnode and job attributes are allowed and would no longer give the above message. If something else get set in the hook, like a server attribute, then
this will now be the new DEBUG2 level mom_logs message:
"Can only set progname, argv, env event parameters as well as job, resource, vnode under execjob_launch hook."
Given:
sel=pbs.select("ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")
Calling sel.increment_chunks(2) would return a string:
"1:ncpus=3:mem=1gb+3:ncpus=2:mem=2gb+4:ncpus=1:mem=3gb"
Calling sel.increment_chunks("3") would return a string:
"1:ncpus=3:mem=1gb+4:ncpus=2:mem=2gb+5:ncpus=1:mem=3gb"
Calling sel.increment_chunks("23.5%"), would return a pbs.select value mapping to:
"1:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"
with the first chunk, which is a single chunk, is left as is, and the second and third chunks are increased by 23.5 % resulting in 1.24 rounded up to 2, and 2.47 rounded up to 3.
Calling sel.increment_chunks({0: 0, 1: 4, 2: "50%"}), would return a pbs.select value mapping to:
"1:ncpus=3:mem=1gb+5:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"
where no increase (0) for chunk 1, additional 4 chunks for chunk 2, 50% increase for chunk 3 resulting in 3.
Given:
sel=pbs.select("5:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb")
Then calling sel.increment_chunks("50%") or sel.increment_chunks({0: "50%", 1: "50%", 2: "50%}) would return a pbs.select value mapping to:
"7:ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+3:ncpus=1:mem=3gb"
as for the first chunk, the initial single chunk of "1:ncpus=3:mem=1gb" is left as is, with the "50%" increase applied to the remaining chunks "4:ncpus=3:mem=1gb", and then added back to the single chunk to make 7, while chunks 2 and 3 are increased to 2 and 3, respectively.
if 'PBS_NODEFILE' not in pbs.event().env:
pbs.event().accept()
...
pbs.event().job.release_nodes(keep_select=...)
NOTE: On Windows, where PBS_NODEFILE would always appear in pbs.event().env, need to put the following on top of the execjob_launch hook:
if any("mom_open_demux.exe") in s for s in e.argv):
e.accept()
"<jobid>: no nodes released as job does not tolerate node failures"
Examples:
Given an execjob_prologue hook, a hook writer can release a set of nodes from a job by doing:
pj = e.job.release_nodes(keep_select="ncpus=2:mem=2gb+ncpus=2:mem=2gb+ncpus=1:mem=1gb")
if pj != None:
pbs.logmsg(pbs.LOG_DEBUG, "pj.exec_vnode=%s" % (pj.exec_vnode,))
else: # returned None job object, so we can put a hold on the job and requeue it, rejecting the hook event
e.job.Hold_Types = pbs.hold_types("s")
e.job.rerun()
e.reject("unsuccessful at LAUNCH")
";Job;<jobid>;pruned from exec_vnode=<original value>"
";Job;<jobid>;pruned to exec_vnode=<new value>"
When a multinode job's assigned resources have been modified, primary mom will do a quick 5 seconds wait for an acknowledgement from the sister moms that they have updated their nodes table. When not all acknowledgements were received by primary mom during that 5 seconds wait, then there'll be this DEBUG2 level mom_logs message:
"not all job updates to sister moms completed"
Seeing this log message means that a job can momentarily receive an error when doing tm_spawn or pbsdsh to a node that did not complete the nodes table update yet.
"could not satisfy select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN)
" stream <num> not found to job nodes"
"im_eof, No error from addr <ipaddr>:<port> on stream <num>
which corresponds to the connection stream of a released mom host.
If the pbs_cgroups hook is executing in response to an execjob_resize event, calling pbs.event().reject(<message>), encountering an exception, or terminating due to an alarm call, would result in the following DEBUG2 mom_logs message, and the job is aborted on the mom side, and requeued/rerun on the server side:
“execjob_resize” request rejected by ‘pbs_cgroups”
<message>
The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):
# qmgr –c “set hook <hook_name> event = <bad_event>”
from:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach or "" for no event
to:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no event
In order to have a job to reliably start, we'll need a queuejob hook that makes a job tolerate node failures by setting the 'tolerate_node_failures' attribute to 'job_start', adding extra chunks to the job's select specification using the pbs.event().job.select.increment_chunks() method, while saving the job's original select value into the builtin resource say "site", and having an execjob_launch hook that will call pbs.event().job.release_nodes() to prune back the job's select value back to the original.
NOTE: In the future, we would allow any custom resource to be created and use that to save the 'select' value, It's just that currently, custom resources populating Resource_List are not propagated from the server to the mom, and it needs to be as mom hook will use the value.
First, introduce a queuejob hook:
% cat qjob.py
import pbs
e=pbs.event()
j = e.job
j.tolerate_node_failures = "job_start"
Then, save the current of 'select' in a builtin resource "site".
e.job.Resource_List["site"] = str(e.job.Resource_List["select"])
Next, add extra chunks to current select:
new_select = e.job.Resource_List["select"].increment_chunks(1)
e.job.Resource_List["select"] = new_select
Now instantiate the queuejob hook:
# qmgr -c "c h qjob event=queuejob"
# qmgr -c "i h qjob application/x-python default qjob.py"
Soon introduce an execjob_launch hook so that before job officially runs its program, the job's current assigned resources is pruned to match the original 'select' request of user:
% cat launch.py
import pbs
e=pbs.event()
if 'PBS_NODEFILE' not in e.env:
e.accept()
j = e.job
pj = j.release_nodes(keep_select=e.job.Resource_List["site"])
if pj is None: # was not successful pruning the nodes
j.rerun() # rerun (requeue) the job
e.reject("something went wrong pruning the job back to its original select request")
Instantiate the launch hook:
# qmgr -c "c h launch event=execjob_launch"
# qmgr -c "i h launch application/x-python default launch.py"
And say a job is of the form:
% cat jobr.scr
#PBS -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb"
#PBS -l place=scatter:excl
echo $PBS_NODEFILE
cat $PBS_NODEFILE
echo END
echo "HOSTNAME tests"
echo "pbsdsh -n 0 hostname"
pbsdsh -n 0 hostname
echo "pbsdsh -n 1 hostname"
pbsdsh -n 1 hostname
echo "pbsdsh -n 2 hostname"
pbsdsh -n 2 hostname
echo "PBS_NODEFILE tests"
for host in `cat $PBS_NODEFILE`
do
echo "HOST=$host"
echo "pbs_tmrsh $host hostname"
pbs_tmrsh $host hostname
echo "ssh $host pbs_attach -j $PBS_JOBID hostname"
ssh $host pbs_attach -j $PBS_JOBID hostname
done
When job first starts, it will get assigned 5 nodes first, as select specification was modified causing 2 extra nodes getting assigned:
% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+federer/0*2+lendl/0*2+agassi/0+sampras/0
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(federer:ncpus=2:mem=2097152kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)+(sampras:ncpus=1:mem=3145728kb)
Resource_List.mem = 11gb
Resource_List.ncpus = 9
Resource_List.nodect = 5
Resource_List.place = scatter:excl
Resource_List.select = ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb
tolerate_node_failures = job_start
Suppose federer and sampras went down, then just before the job runs its program, the execjob_launch hook executes and prunes the job's node assignment back to the original select request, and the job detail now shows:
% qstat -f 20
Job Id: 20.borg.pbspro.com
...
exec_host = borg/0*3+lendl/0*2+agassi/0*2
exec_vnode = (borg:ncpus=3:mem=1048576kb)+(lendl:ncpus=2:mem=2097152kb)+(agassi:ncpus=1:mem=3145728kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 6
Resource_List.nodect = 3
Resource_List.place = scatter:excl
Resource_List.select = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb
Resource_List.site = 1:ncpus=3:mem=1gb+1:ncpus=2:mem=2gb+1:ncpus1:mem=3gb
A snapshot of the job's output would show the pruned list of nodes:
/var/spool/PBS/aux/20.borg.pbspro.com <-- updated contents of $PBS_NODEFILE
borg.pbspro.com
lendl.pbspro.com
agassi.pbspro.com
END
HOSTNAME tests
pbsdsh -n 0 hostname
borg.pbspro.com
pbsdsh -n 1 hostname
lendl.pbspro.com
pbsdsh -n 2 hostname
agassi.pbspro.com
PBS_NODEFILE tests
HOST=borg.pbspro.com
pbs_tmrsh borg.pbspro.com hostname
borg.pbspro.com
ssh borg.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
borg.pbspro.com
HOST=lendl.pbspro.com
pbs_tmrsh lendl.pbspro.com hostname
lendl.pbspro.com
ssh lendl.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
lendl.pbspro.com
HOST=agassi.pbspro.com
pbs_tmrsh agassi.pbspro.com hostname
agassi.pbspro.com
ssh agassi.pbspro.com pbs_attach -j 20.borg.pbspro.com hostname
agassi.pbspro.com