Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Visibility: Public
  • Change Control: Stable
  • Details:
    The execjob_prologue hook will recognize a hook attribute fail_action = "offline_vnodes" value, to automatically offline the vnodes managed by the executing mom, when the hook prematurely ends due to an un-handled exception or when it alarms out.
  • Example:

    # cat prolo.py
    import pbs
    import time
    time.sleep(60) # long running hook

    # qmgr -c "create hook begin event=execjob_prologue,alarm=10,fail_action=offline_vnodes"
    # qmgr -c "import hook begin application/x-python default prolo.py"
    Given the following vnodes:
    # pbsnodes -av
    ricardo
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = free
    pcpus = 4
    resources_available.arch = linux

    ricardo[1]
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = free

    Submit job:

    % qsub job.scr
    <job-id>

    Job runs but fails due to prologue hook timing out, and job is requeued.

    Now pbsnodes shows things offlined:
    % pbsnodes -av
    ricardo
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = offline
    pcpus = 4
    comment = offlined by hook 'prolo' due to hook error

    ricardo[1]
    Mom = ricardo.pbspro.com
    Port = 15002
    ntype = PBS
    state = offline
    resources_available.arch = linux
    comment = offlined by hook 'prolo' due to hook error

  • Log/Error messages:
    • Setting the fail_action value to a value other than “none"offline_vnodes", and yet the hook itself does not have a mom hook event that matches 'execjob_begin', 'exechost_startup', or now 'execjob_prologue' hook, will result in the following error message printed in STDERR:

      # qmgr –c “set hook <server_hook_name> fail_action=<failoffline_action_value>”vnodes”

      “Can't set hook fail_action value to '<failoffline_action_value>vnodes': hook event must contain at least one of execjob_begin, exechost_startup, or execjob_prologue”

      NOTE: The above message is also returned by pbs_geterrmsg() after calling pbs_manager() operating on a hook and its ‘fail_action’ attribute.

...