File descriptor leak in PBS mom while deleting jobs

Description

  • I am observing "File descriptor leak" in PBS mom while trying to delete 2000 jobs with qdel -Wforce command when jobs are hung and job history is enabled intermittently

  • The expected outcome is that all the jobs should get deleted without any error but in the observed behavior is that few jobs are not being deleted with the below error "'qdel: Request invalid for state of job 9141.pbspro-master'".

  • Jobs that were not able to delete were in 'Q' state with comment "Not Running: PBS Error: Execution server rejected request"

  • On having a closer look, jobs were not being deleted because of the open file limit.

mom-log snippet.
09/28/2017 23:26:29;0028;pbs_mom;Job;9141.pbspro-master;No Password Entry for User pbsuser
09/28/2017 23:26:29;0001;pbs_mom;Svr;pbs_mom;Too many open files (24) in job_save, error opening for full save

09/28/2017 23:26:29;0008;pbs_mom;Job;9139.pbspro-master;kill_job
09/28/2017 23:26:29;0001;pbs_mom;Svr;pbs_mom;Too many open files (24) in remtree, opendir: /var/spool/pbs/mom_priv/jobs/9139.pbspro-master.TK
09/28/2017 23:26:29;0100;pbs_mom;Req;;Type 18 request received from root@10.8.100.29:15001, sock=1

  • In this scenario,there are only few jobs which are running at one time and when a job runs, 2 pipes are set up to connect the job back to the main mom, but in the failed scenario the lsof output for mom is showing a lot of pipes

  • I have attached the output of lsof command in the ticket(lsof_mom.txt)

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

zulekha mahalty

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Affects versions

18.1.1

Priority

Low
Configure