When PTL starts or restarts mom, after tests, the mom left behind fails to execute mom hooks

Description

Noticing that in the latest PTL code that if a PTL test tries to start or restart mom, and after test completes, the left behind pbs_mom fails to execute mom hooks. It gives the message:

09/20/2017 18:23:55;0001;pbs_mom;Svr;pbs_mom;run_hook, execv of /opt/pbs/bin/pbs_python resulted in nonzero exit status=1
09/20/2017 18:23:55;0008;pbs_mom;Job;13.corretja;Internal server error encountered. Skipping hook execjob_hook1

This is after creating the hook "qmgr -c "create hook execjob_hook1 event=execjob_prologue,enabled=t"

When I debugged this, I found that when mom executes pbs_python <execjob_hook1 script>, it is failing trying to load $PBS_CONF_FILE which is set to some /tmp/<filename> that PTL has created and has since deleted. Workaround is that I have to manually restart mom to read the default /etc/pbs.conf file.

For instance, in my hook script, I have:

  1. now restart mom
    self.mom.start()

And this leads PTL to do:

2017-09-20 18:23:06,714 INFO running init script to start pbs mom on corretja.pbspro.com using /etc/pbs.conf init_cmd=['sudo', 'PBS_CONF_FILE=/tmp/PtlPbsIZ2gc8', '/opt/pbs/libexec/pbs_init.d', 'start']
2017-09-20 18:23:06,714 INFOCLI2 corretja: /tmp/PtlPbs3TL4zN

The left behind pbs_mom is using the above PBS_CONF_FILE that have since been deleted.

Acceptance Criteria

None

Status

Assignee

Sanket Borle

Reporter

Al Bayucan

Severity

3-High

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Fix versions

Priority

Critical
Configure