Execjob_abort hook

Objective

There are several mom hook events already in place in PBS that respond to when job first enters pbs_mom for execution (EXECJOB_BEGIN), when job does its initial setup (EXECJOB_PROLOGUE), when job is requested to be terminated early (EXECJOB_PRETERM), when job starts performing its cleanup (EXECJOB_EPILOGUE), and when job finally leaves pbs_mom (EXECJOB_END). One event that is useful to have and is proposed in this design is a hook event that responds to when a job prematurely exits during startup. Such a hook event would be useful especially when a site has coded an execjob_begin (or execjob_prologue) hook that does some sort of system setup like pre-creating files for a job, and those files are needed to be cleaned up after job ends using an execjob_epilogue or execjob_end hook. But if the job suddenly ends prematurely, the epilogue hook or end hook may not always execute. Thus, there's a need for a new hook event, which will be called EXECJOB_ABORT, to handle the situation.

Forum: http://community.pbspro.org/t/a-new-hook-event-execjob-abort/1460

Why add a new hook execjob_abort event instead of calling existing execjob_end hook for all abort cases?

  • The execjob_end hook is called by both primary and sister moms at the end of job after running to completion, or when job stops after
    being interrupted by qdel or by a communication problem between the server and the sister moms.
  • A sister mom may not always execute an execjob_end hook on behalf of a job especially when sister mom has problems joining the job.

A new execjob_abort hook has been introduced instead of reusing the execjob_end hook to allow backwards compatibility. Some sites may already have execjob_end hooks in place, and would be surprised to see that their hook be called additionally when a sister mom fails to join the job, or when primary mom fails to start a job.

The end hook's purpose may not always be to do cleanup, but just  print end of job statistics that it would not expect to be displayed when a sister mom even fails to become part of the job.

Interface 1: new mom hook event: execjob_abort

  • Visibility: Public
  • Change Control: Stable
  • Python constant: pbs.EXECJOB_ABORT
  • Event Parameters: 
    • pbs.event().job - This is a pbs.job object representing the job that is ending prematurely.This job object cannot be modified under this hook.
    • pbs.event().vnode_list[] - This is a dictionary of pbs.vnode objects, keyed by vnode name, listing the vnodes that are assigned to the job. The vnode objects in the vnode_list cannot be modified.
  • Restriction: The execjob_abort hook will run under the security context of Admin user.
  • Details:
    • An execjob_abort hook is executed by the primary mom when a job has problems starting up and needing to be aborted. Some sample failure conditions include:
      • execjob_prologue hook rejections (from primary mom or sister mom) whether explicitly or implicitly due to unhandled exceptions
      • execjob_launch hook rejections (whether explicitly or implicitly due to unhandled exceptions) from primary mom before executing top-level job script
      • errors in fork() calls when starting child job process
      • failure to save task information on disk for checkpoint recovery later
      • communication pipes and sockets errors.
      • failed to restart job from checkpoint file image
      • failed to create cpuset
      • failed to setup ptys for interactive job
    • An execjob_abort hook is executed by the sister mom when it encounters an error while attempting to join a job, where error conditions include:

      • errors during job setup

      • failed to create of cpuset

      • failure in mkdir() temp dir/file call,

      • failure in mkjobdir() call

      • problem obtaining owning user's credential
      • communication errors with the primary mom

    • An execjob_abort hooks is also executed by sister mom on behalf of a job that has been requested by the primary mom to be aborted, as primary mom has encountered problems starting the job.

    • A call to pbs.event().accept() means the hook code has executed cleanly, but this hook will not cause changes to job attributes, resources, or vnodes in vnode list..

    • A call to pbs.event().reject() means the hook code was not able to fully accomplishing its task. The following message would appear in the MoM log at log event class PBSEVENT_DEBUG2:

       “execjob_abort request rejected by ”

    • If the execjob_abort hook script encounters an unexpected error causing an unhandled exception, the following messages would appear in the MoM logs at event class PBSEVENT_DEBUG2:

       “execjob_abort hook encountered an exception, request rejected”

      “alarm call while running execjob_abort hook '', request rejected”

  • Additional details:
    • A primary mom that fails to start a job would now result in an execjob_abort hook and an execjob_end hook to execute. In contrast, a sister mom that fails to join a job would result in only an execjob_abort hook to execute.
    • In this case where both execjob_abort and execjob_end hook execute, the former will always get called first.
    • Normally, a job will requeue after abort hook executes, but there maybe cases when the job would actually exit completely, if there are earlier execjob_begin or execjob_prologue hooks that executed, which instructed the job to be deleted via the pbs.event().job.delete() call. Job would also exit completely if an earlier execjob_launch hook resulted in a rejection.
    • Normally, job will exit completely after the execjob_end hook runs. However, the job may actually requeue if there's an earlier execjob_epilogue that executed, which instructed job to be requeued via the pbs.event().job.rerun() call.
    • NOTE: If a site that has cleanup code in an execjob_end hook, could simply add the cleanup code in an execjob_abort hook, but let it only execute if hook is called by a sister mom, which can be done as follows:

                     import pbs

                     e=pbs.event()

                     if e.job.not_in_msmom():

                         <do cleanup code>

                   

  • New qmgr output:
    • The returned error message from qmgr upon seeing an unrecognized hook event has changed (due to this additional hook event):

      # qmgr –c “set hook <hook_name> event = <bad_event>”

      from:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize or "" for no event

      to:
      invalid argument (yaya) to event. Should be one or more of: queuejob,modifyjob,resvsub,movejob,runjob,provision,periodic,resv_end,execjob_begin,execjob_prologue,execjob_epilogue,execjob_preterm,execjob_end,exechost_periodic,execjob_launch,exechost_startup,execjob_attach,execjob_resize,execjob_abort or "" for no event

  • External dependency:
    • The pbs_cgroups hook will be modified to add an execjob_abort handler, which would call the cleanup code done in the execjob_end handler.






OSS Site Map

Project Documentation Main Page

Developer Guide Pages