We're updating the issue view to help you get more done. 

Server should not ping Scheduler for indefinite time when scheduler is down

Description

The Server keeps scheduler pinging for indefinite time when Scheduler is down.
This result in infinite server logging with the following message at the default log level:
11/15/2017 11:59:30;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 11:59:32;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 11:59:34;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 11:59:36;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler

Suggestion is that scheduler should report to server about its health.
Consider the scenario if scheduler is running on a different host. The network issues can result in infinite logging.

Following is the simple scenario where this can be reproduced:
1) Bring down scheduler
2) Submit a job or kick the sched cycle
3) check sever logs for infinite logging

[root@centos1 ~]# ps -ef | grep pbs
root 60317 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 60340 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_mom
root 60343 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 60604 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 60659 1 0 11:34 ? 00:00:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
postgres 60676 60659 0 11:34 ? 00:00:00 postgres: postgres pbs_datastore 192.168.150.101(35081) idle
root 60677 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
root 60698 8751 0 11:36 pts/1 00:00:00 grep --color=auto pbs
[root@centos1 ~]# kill -9 60343
[root@centos1 ~]# qsub – /bin/sleep 100
0.centos1
[root@centos1 ~]# qstat -s

centos1:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ — — ------ ----- - -----
0.centos1 root workq STDIN – 1 1 – – Q –

[root@centos1 functional]# tail -f /var/spool/pbs/server_logs/20171115
11/15/2017 12:14:54;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:14:56;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:14:57;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:14:59;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:15:01;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:15:03;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:15:05;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:15:07;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:15:09;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:15:11;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
11/15/2017 12:15:13;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
^C
[root@centos1 functional]#

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Varun Sonkar

Severity

2-Medium

Story Points

1

Components

Priority

Low