...
- Visibility: Public
- Change Control: Stable
- Details:
The execjob_prologue hook will recognize a hook attribute fail_action = "offline_vnodes" value, to automatically offline the vnodes managed by the executing mom, when the hook prematurely ends due to an un-handled exception or when it alarms out. - Example:
# cat prolo.py
import pbs
import time
time.sleep(60) # long running hook# qmgr -c "create hook begin event=execjob_prologue,alarm=10,fail_action=offline_vnodes"
# qmgr -c "import hook begin application/x-python default prolo.py"
Given the following vnodes:
# pbsnodes -av
ricardo
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = free
pcpus = 4
resources_available.arch = linuxricardo[1]
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = freeSubmit job:
% qsub job.scr
<job-id>Job runs but fails due to prologue hook timing out, and job is requeued.
Now pbsnodes shows things offlined:
% pbsnodes -av
ricardo
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = offline
pcpus = 4
comment = offlined by hook 'prolo' due to hook error
ricardo[1]
Mom = ricardo.pbspro.com
Port = 15002
ntype = PBS
state = offline
resources_available.arch = linux
comment = offlined by hook 'prolo' due to hook error - Log/Error messages:
Setting the fail_action value to a value other than “none"offline_vnodes", and yet the hook itself does not have a mom hook event that matches 'execjob_begin', 'exechost_startup', or now 'execjob_prologue' hook, will result in the following error message printed in STDERR:
# qmgr –c “set hook <server_hook_name> fail_action=<failoffline_action_value>”vnodes”
“Can't set hook fail_action value to '<failoffline_action_value>vnodes': hook event must contain at least one of execjob_begin, exechost_startup, or execjob_prologue”
NOTE: The above message is also returned by pbs_geterrmsg() after calling pbs_manager() operating on a hook and its ‘fail_action’ attribute.
...