There are several mom hook events already in place in PBS that respond to when job first enters pbs_mom for execution (EXECJOB_BEGIN), when job does its initial setup (EXECJOB_PROLOGUE), when job is requested to be terminated early (EXECJOB_PRETERM), when job starts performing its cleanup (EXECJOB_EPILOGUE), and when job finally leaves pbs_mom (EXECJOB_END). One event that is useful to have and is proposed in this design is a hook event that responds to when a job prematurely exits during startup. Such a hook event would be useful especially when a site has coded an execjob_begin (or execjob_prologue) hook that does some sort of system setup like pre-creating files for a job, and those files are needed to be cleaned up after job ends using an execjob_epilogue or execjob_end hook. But if the job suddenly ends prematurely, the epilogue hook or end hook may not always execute. Thus, there's a need for a new hook event, which will be called EXECJOB_ABORT, to handle the situation.
Forum: http://community.pbspro.org
An execjob_abort hook is executed by the sister mom when it encounters an error while attempting to join a job, where error conditions include:
errors during job setup
failed to create of cpuset
failure in mkdir() temp dir/file call,
failure in mkjobdir() call
communication errors with the primary mom
An execjob_abort hooks is also executed by sister mom on behalf of a job that has been requested by the primary mom to be aborted, as primary mom has encountered problems starting the job.
A call to pbs.event().accept() means the hook code has executed cleanly, but there'll be no changes to job attributes, resources, or vnodes in vnode list.. The following message would appear in the MoM log at log event class PBSEVENT_DEBUG2:
“execjob_abort request rejected by ”
If the execjob_abort hook script encounters an unexpected error causing an unhandled exception, the following messages would appear in the MoM logs at event class PBSEVENT_DEBUG2:
“execjob_end hook encountered an exception, request rejected”
“alarm call while running execjob_end hook '', request rejected”
import pbs
e=pbs.event()
if e.job.not_in_msmom():
<do cleanup code>
The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):
# qmgr –c “set hook <hook_name> event = <bad_event>”
from:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no event
to:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize,execjob_abort or "" for no event
Ignore this. We may use it later for page characterization. |