Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

PP-832 - Getting issue details... STATUS

Forum Discussion

Issue Description:

Currently, in a failover setup, the secondary server at the time of taking over checks if it can communicate with the scheduler on the primary host. This check is done only at the time of take-over and never again.

If unable to communicate, the secondary server spawns a scheduler process on the secondary host, otherwise, it proceeds to use the scheduler on the primary host.

In the latter case, if after some time, the scheduler process on the primary host stops communicating (due to crash, host going down, etc...), there is no scheduler process to communicate with and scheduling halts.

The below proposed solutions are focused only on the default scheduler and does not cover the multi-sched scenario.

Solution 1:

Follow the below steps - 

  1. At the time of taking over, secondary server checks if it can communicate with the scheduler on primary host.
    1. If able to communicate, proceeds to use the scheduler on the primary host.
    2. If not, spawn a local scheduler process.
  2. While the secondary is active and the scheduler on the primary goes down, the secondary server will spawn a local scheduler.
  3. The PBS init script should always restart the scheduler on the primary host.


Solution 2:

Follow the below steps - 

  1. At the time of taking over, secondary server checks if it can communicate with the scheduler on primary host.
    1. If able to communicate, it sends SCH_QUIT signal to the scheduler on primary and then spawn a local scheduler process.
    2. If not, spawn a local scheduler process.
  • No labels