PP-35, PP-729: PBS Failover - STONITH

STONITH - Shoot The Other Node In The Head

PBSPro Community Discussion: http://community.pbspro.org/t/pp-35-pp-729-pbs-failover-stonith/658

Interface 1:  Facilitate calling a new script named "stonith"

  • Visibility: Public
  • Change Control: UNSTABLE
  • Synopsis: From now on secondary server will call an external script named "stonith" (if exists) to bring down primary server node before going active.
  • Details: In failover setup when secondary server takes control of the PBS complex, there are certain cases where split brain scenario could occur and lead to both primary and secondary dataservices running. If both primary and secondary data services are active it could lead to a very unstable PBS complex. STONITH script will facilitate admins to overcome this issue. When secondary server decides to become active based on multiple internal checks, secondary server will call the external STONITH script (written by the admin). Primary server will not call STONITH when it becomes active, the admin will have to manually bring primary up again. Host name of the primary server as in pbs.conf will be passed as an argument to the script. This script is expected to return 0 for success and non-0 for failure. In case of failure, server will not become active and will wait for 10 seconds before repeating the cycle.
    • Location: $PBS_HOME/server_priv/stonith
    • Script Permission: 750
    • Default:  Script doesn't exist at the specified location.

Interface 2: New server log message: Executing STONITH script to bring down primary at <hostname>

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Info
      • The above messaged will logged just before STONITH script execution.

Interface 3: New server log message: STONITH script execution failed

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Critical
      • The above messaged will be logged in case STONITH script returns failure or fails to execute. 

Interface 4: New server log message: Error message returned from STONITH script: <errmsg>

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Info
      • The above messaged will be logged if STONITH script had any failure messages written to STDOUT or STDERR.

Interface 5: New server log message: Secondary will attempt taking over again

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Info
      • In case STONITH script fails secondary will again try to become active, above message will be logged in that case.

Interface 6: New server log message: STONITH script executed successfully

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Info
      • If STONITH script returns success after exectuion, the above message will be logged.

Interface 7: New server log message: Skipping STONITH

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Info
      • If STONITH script is not available in the server_priv, then every time secondary goes active the above message will be logged.

Interface 8: New server log message: pbs_status_db exit code <rc>

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Info
    • While server is initialising the connection information, it queries for db status, the above message will log the exit code for the pbs_status_db call.

Interface 9: New server log message: Dataservice connection failed due to timeout

  • Visibility: PBS Public
  • Change Control: Stable
  • Details: pbs_server log, Error
    • The above log message is logged when establishing connection with dataservice is failed due to timeout.