There are several mom hook events already in place in PBS that respond to when job first enters pbs_mom for execution (EXECJOB_BEGIN), when job does its initial setup (EXECJOB_PROLOGUE), when job is requested to be terminated early (EXECJOB_PRETERM), when job starts performing its cleanup (EXECJOB_EPILOGUE), and when job finally leaves pbs_mom (EXECJOB_END). One event that is useful to have and is proposed in this design is a hook event that responds to when a job prematurely exits during startup. Such a hook event would be useful especially when a site has coded an execjob_begin (or execjob_prologue) hook that does some sort of system setup like pre-creating files for a job, and those files are needed to be cleaned up after job ends using an execjob_epilogue or execjob_end hook. But if the job suddenly ends prematurely, the epilogue hook or end hook may not always execute. Thus, there's a need for a new hook event, which will be called EXECJOB_ABORT, to handle the situation.
Forum: http://community.pbspro.org/t/a-new-hook-event-execjob-abort/1460
Why add a new hook execjob_abort event instead of calling existing execjob_end hook for all abort cases?
A new execjob_abort hook has been introduced instead of reusing the execjob_end hook to allow backwards compatibility. Some sites may already have execjob_end hooks in place, and would be surprised to see that their hook be called additionally when a sister mom fails to join the job, or when primary mom fails to start a job.
The end hook's purpose may not always be to do cleanup, but just print end of job statistics that it would not expect to be displayed when a sister mom even fails to become part of the job.
An execjob_abort hook is executed by the sister mom when it encounters an error while attempting to join a job, where error conditions include:
errors during job setup
failed to create of cpuset
failure in mkdir() temp dir/file call,
failure in mkjobdir() call
communication errors with the primary mom
An execjob_abort hooks is also executed by sister mom on behalf of a job that has been requested by the primary mom to be aborted, as primary mom has encountered problems starting the job.
A call to pbs.event().accept() means the hook code has executed cleanly, but this hook will not cause changes to job attributes, resources, or vnodes in vnode list..
A call to pbs.event().reject() means the hook code was not able to fully accomplishing its task. The following message would appear in the MoM log at log event class PBSEVENT_DEBUG2:
“execjob_abort request rejected by ”
If the execjob_abort hook script encounters an unexpected error causing an unhandled exception, the following messages would appear in the MoM logs at event class PBSEVENT_DEBUG2:
“execjob_abort hook encountered an exception, request rejected”
“alarm call while running execjob_abort hook '', request rejected”
import pbs
e=pbs.event()
if e.job.not_in_msmom():
<do cleanup code>
The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):
# qmgr –c “set hook <hook_name> event = <bad_event>”
from:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no event
to:
invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize,execjob_abort or "" for no event
Project Documentation Main Page
Ignore this. We may use it later for page characterization. |