I am observing "File descriptor leak" in PBS mom while trying to delete 2000 jobs with qdel -Wforce command when jobs are hung and job history is enabled intermittently
The expected outcome is that all the jobs should get deleted without any error but in the observed behavior is that few jobs are not being deleted with the below error "'qdel: Request invalid for state of job 9141.pbspro-master'".
Jobs that were not able to delete were in 'Q' state with comment "Not Running: PBS Error: Execution server rejected request"
On having a closer look, jobs were not being deleted because of the open file limit.
09/28/2017 23:26:29;0028;pbs_mom;Job;9141.pbspro-master;No Password Entry for User pbsuser
09/28/2017 23:26:29;0001;pbs_mom;Svr;pbs_mom;Too many open files (24) in job_save, error opening for full save
09/28/2017 23:26:29;0001;pbs_mom;Svr;pbs_mom;Too many open files (24) in remtree, opendir: /var/spool/pbs/mom_priv/jobs/9139.pbspro-master.TK
09/28/2017 23:26:29;0100;pbs_mom;Req;;Type 18 request received from email@example.com:15001, sock=1
In this scenario,there are only few jobs which are running at one time and when a job runs, 2 pipes are set up to connect the job back to the main mom, but in the failed scenario the lsof output for mom is showing a lot of pipes
I have attached the output of lsof command in the ticket(lsof_mom.txt)