Intermittent failures in test test_job_array_comment in pbs_job_array_comment.py due to race conditions between mom hook copy and job run

Description

The test test_job_array_comment tries to create a mom hook that rejects and deletes subjobs. But some times, there are race conditions between the hook creation and the job submission following that , resulting in failures.

The test fails as follows:
2017-12-11 02:00:36,948 INFO expect on server x50-centos7: comment ~ Job Array Began at .* && job_state = B job 0[].x50-centos7
2017-12-11 02:00:37,450 INFOCLI x50-centos7: /opt/pbs/bin/qmgr -c set server scheduling=True
2017-12-11 02:00:37,992 INFOCLI x50-centos7: /opt/pbs/bin/qstat -f 0[].x50-centos7
2017-12-11 02:00:38,037 INFO expect on server x50-centos7: comment ~ Job Array Began at .* && job_state ~ B job 0[].x50-centos7 attempt: 2 ... OK
2017-12-11 02:00:38,038 INFOCLI x50-centos7: /opt/pbs/bin/qstat -x -f 0[0].x50-centos7
2017-12-11 02:00:38,581 INFO expect on server x50-centos7: no data for comment = Subjob failed job 0[0].x50-centos7
2017-12-11 02:00:38,581 INFOCLI x50-centos7: /opt/pbs/bin/qstat -x -f 0[0].x50-centos7
2017-12-11 02:00:39,124 INFO expect on server x50-centos7: no data for comment = Subjob failed job 0[0].x50-centos7 attempt: 2
2017-12-11 02:00:39,124 INFOCLI x50-centos7: /opt/pbs/bin/qstat -x -f 0[0].x50-centos7
2017-12-11 02:00:39,667 INFO expect on server x50-centos7: no data for comment = Subjob failed job 0[0].x50-centos7 attempt: 3
2017-12-11 02:00:39,668 INFOCLI x50-centos7: /opt/pbs/bin/qstat -x -f 0[0].x50-centos7
2017-12-11 02:00:40,211 INFO expect on server x50-centos7: no data for comment = Subjob failed job 0[0].x50-centos7 attempt: 4
2017-12-11 02:00:40,212 INFOCLI x50-centos7: /opt/pbs/bin/qstat -x -f 0[0].x50-centos7
2017-12-11 02:00:40,756 INFO expect on server x50-centos7: no data for comment = Subjob failed job 0[0].x50-centos7 attempt: 5
2017-12-11 02:00:40,756 INFOCLI x50-centos7: /opt/pbs/bin/qstat -x -f 0[0].x50-centos7
2017-12-11 02:00:40,906 INFO job: executable set to /bin/sleep with arguments: 100
2017-12-11 02:00:40,907 INFO expect on server x50-centos7: comment = Subjob failed job 0[0].x50-centos7 attempt: 6 got: comment = Job Array Began at Mon Dec 11 at 02:00
2017-12-11 02:00:41,408 INFOCLI x50-centos7: /opt/pbs/bin/qmgr -c set server scheduling=True
2017-12-11 02:00:42,062 INFOCLI x50-centos7: /opt/pbs/bin/qstat -x -f 0[0].x50-centos7
2017-12-11 02:00:42,107 INFO expect on server x50-centos7: comment = Subjob failed job 0[0].x50-centos7 attempt: 7 got: comment = Subjob finished
See below server logs during that time :
12/11/2017 02:00:37;0080;Server@x50-centos7;Node;0[0].x50-centos7;vnode x50-centos7's parent mom x50-centos7.pbspro.com:15002 has a pending copy hook or delete hook request

Suggested fix:
The test needs to add sufficient delay so that the hook creation is completed before job submission .Also, the test needs to log_match for message like "successfully sent hook file <filename> to <mom_hostname>" before proceeding further

Acceptance Criteria

None

Status

Assignee

Latha Subramanian

Reporter

Latha Subramanian

Severity

3-High

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Fix versions

Priority

Medium
Configure