Noticing that in the latest PTL code that if a PTL test tries to start or restart mom, and after test completes, the left behind pbs_mom fails to execute mom hooks. It gives the message:
09/20/2017 18:23:55;0001;pbs_mom;Svr;pbs_mom;run_hook, execv of /opt/pbs/bin/pbs_python resulted in nonzero exit status=1
09/20/2017 18:23:55;0008;pbs_mom;Job;13.corretja;Internal server error encountered. Skipping hook execjob_hook1
This is after creating the hook "qmgr -c "create hook execjob_hook1 event=execjob_prologue,enabled=t"
When I debugged this, I found that when mom executes pbs_python <execjob_hook1 script>, it is failing trying to load $PBS_CONF_FILE which is set to some /tmp/<filename> that PTL has created and has since deleted. Workaround is that I have to manually restart mom to read the default /etc/pbs.conf file.
For instance, in my hook script, I have:
now restart mom
And this leads PTL to do:
2017-09-20 18:23:06,714 INFO running init script to start pbs mom on corretja.pbspro.com using /etc/pbs.conf init_cmd=['sudo', 'PBS_CONF_FILE=/tmp/PtlPbsIZ2gc8', '/opt/pbs/libexec/pbs_init.d', 'start']
2017-09-20 18:23:06,714 INFOCLI2 corretja: /tmp/PtlPbs3TL4zN
The left behind pbs_mom is using the above PBS_CONF_FILE that have since been deleted.