Uploaded image for project: 'PBS Pro'
  1. PP-1083

Server should not ping Scheduler for indefinite time when scheduler is down

    Details

    • Type: RFE
    • Status: Open
    • Priority: Low
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Scheduler, Server
    • Labels:
      None
    • Severity:
      2-Medium
    • Sprint:
    • Story Points:
      1

      Description

      The Server keeps scheduler pinging for indefinite time when Scheduler is down.
      This result in infinite server logging with the following message at the default log level:
      11/15/2017 11:59:30;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 11:59:32;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 11:59:34;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 11:59:36;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler

      Suggestion is that scheduler should report to server about its health.
      Consider the scenario if scheduler is running on a different host. The network issues can result in infinite logging.

      Following is the simple scenario where this can be reproduced:
      1) Bring down scheduler
      2) Submit a job or kick the sched cycle
      3) check sever logs for infinite logging

      [root@centos1 ~]# ps -ef | grep pbs
      root 60317 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_comm
      root 60340 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_mom
      root 60343 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_sched
      root 60604 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
      postgres 60659 1 0 11:34 ? 00:00:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
      postgres 60676 60659 0 11:34 ? 00:00:00 postgres: postgres pbs_datastore 192.168.150.101(35081) idle
      root 60677 1 0 11:34 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
      root 60698 8751 0 11:36 pts/1 00:00:00 grep --color=auto pbs
      [root@centos1 ~]# kill -9 60343
      [root@centos1 ~]# qsub – /bin/sleep 100
      0.centos1
      [root@centos1 ~]# qstat -s

      centos1:
      Req'd Req'd Elap
      Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
      --------------- -------- -------- ---------- ------ — --- ------ ----- - -----
      0.centos1 root workq STDIN – 1 1 – – Q –

      [root@centos1 functional]# tail -f /var/spool/pbs/server_logs/20171115
      11/15/2017 12:14:54;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:14:56;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:14:57;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:14:59;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:15:01;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:15:03;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:15:05;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:15:07;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:15:09;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:15:11;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      11/15/2017 12:15:13;0001;Server@centos1;Svr;Server@centos1;Operation now in progress (115) in contact_sched, Could not contact Scheduler
      ^C
      [root@centos1 functional]#

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              varunsonkar Varun Sonkar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: