PP-832: PBSPro failover secondary server fails to continuously check whether it needs to start a scheduler locally

PP-832 - Getting issue details... STATUS

Forum Discussion

Issue Description:

Currently, in a failover setup, the secondary server at the time of taking over checks if it can communicate with the scheduler on the primary host. This check is done only at the time of take-over and never again.

If unable to communicate, the secondary server spawns a scheduler process on the secondary host, otherwise, it proceeds to use the scheduler on the primary host.

In the latter case, if after some time, the scheduler process on the primary host stops communicating (due to crash, host going down, etc...), there is no scheduler process to communicate with and scheduling halts.

The below proposed solutions are focused only on the default scheduler and does not cover the multi-sched scenario.

Solution:

Follow the below steps - 

  1. At the time of taking over, secondary server, spawns a local scheduler process.
  2. When PBS on primary comes up, the primary server should send SCH_CONFIGURE to the local scheduler.
  3. Scheduler on receiving SCH_CONFIGURE would re-read the usage and configuration.