============ PRIMARY ============ [siddhs@x50-centos7 ~]$ cat /etc/pbs.conf PBS_EXEC=/opt/pbs PBS_SERVER=x50-centos7 PBS_PRIMARY_SERVER=x50-centos7.pbspro.com PBS_SECONDARY_SERVER=x40-64-centos64.pbspro.com PBS_MOM_HOME=/r/failover/siddhs_stonith PBS_START_SERVER=1 PBS_START_SCHED=1 PBS_START_COMM=1 PBS_START_MOM=1 PBS_HOME=/r/failover/siddhs_stonith PBS_CORE_LIMIT=unlimited PBS_SCP=/bin/scp ============ SECONDARY ============ [siddhs@x40-64-centos64 ~]$ cat /etc/pbs.conf PBS_EXEC=/opt/pbs PBS_SERVER=x50-centos7 PBS_PRIMARY_SERVER=x50-centos7.pbspro.com PBS_SECONDARY_SERVER=x40-64-centos64.pbspro.com PBS_MOM_HOME=/r/failover/siddhs_stonith PBS_START_SERVER=1 PBS_START_SCHED=0 PBS_START_COMM=1 PBS_START_MOM=0 PBS_HOME=/r/failover/siddhs_stonith PBS_CORE_LIMIT=unlimited PBS_SCP=/bin/scp ============================================================================================================================== =============================================== CASE 1: NO STONITH SCRIPT FOUND ============================================== ============================================================================================================================== [siddhs@x50-centos7 ~]$ ls -lrt /r/failover/siddhs_stonith/ total 14 -rw-r--r--. 1 root root 19 Sep 19 01:18 pbs_environment drwxr-xr-x. 2 root root 512 Sep 19 01:18 aux drwx------. 2 root root 512 Sep 19 01:18 checkpoint drwxrwxrwt. 2 root root 512 Sep 19 01:18 undelivered drwxr-xr-x. 2 root root 512 Sep 19 01:19 server_logs -rw-r--r--. 1 root root 22 Sep 19 01:19 pbs_version drwxr-xr-x. 2 root root 512 Sep 19 01:19 comm_logs drwxr-xr-x. 2 root root 512 Sep 19 01:19 mom_logs drwxr-x--x. 4 root root 512 Sep 19 01:19 mom_priv drwxr-xr-x. 2 root root 512 Sep 19 01:19 sched_logs drwxr-x---. 2 root root 512 Sep 19 01:19 sched_priv drwxr-x---. 7 root root 512 Sep 19 02:17 server_priv drwx------. 16 pbsdata root 512 Sep 19 02:17 datastore drwxrwxrwt. 2 root root 512 Sep 19 02:17 spool [siddhs@x50-centos7 ~]$ ls -lrt /r/failover/siddhs_stonith/stonith ls: cannot access /r/failover/siddhs_stonith/stonith: No such file or directory [siddhs@x50-centos7 ~]$ ps -eaf | grep pbs root 1531 1 0 Sep16 ? 00:00:09 /opt/execPBS/pbs/sbin/pbs_mom root 2199 1 0 02:21 ? 00:00:00 /opt/pbs/sbin/pbs_comm root 2229 1 0 02:21 ? 00:00:00 /opt/pbs/sbin/pbs_mom root 2243 1 0 02:21 ? 00:00:00 /opt/pbs/sbin/pbs_sched root 2374 1 0 02:21 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor pbsdata 2395 1 0 02:21 ? 00:00:00 /opt/pbs/pgsql/bin/postgres -D /r/failover/siddhs_stonith/datastore -p 15007 pbsdata 2402 2395 0 02:21 ? 00:00:00 postgres: logger process pbsdata 2404 2395 0 02:21 ? 00:00:00 postgres: checkpointer process pbsdata 2405 2395 0 02:21 ? 00:00:00 postgres: writer process pbsdata 2406 2395 0 02:21 ? 00:00:00 postgres: wal writer process pbsdata 2407 2395 0 02:21 ? 00:00:00 postgres: autovacuum launcher process pbsdata 2408 2395 0 02:21 ? 00:00:00 postgres: stats collector process pbsdata 2412 2395 0 02:21 ? 00:00:00 postgres: pbsdata pbs_datastore 10.8.100.42(58271) idle root 2414 1 0 02:21 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin siddhs 2531 17980 0 02:22 pts/0 00:00:00 grep --color=auto pbs [siddhs@x50-centos7 ~]$ sudo kill -9 2414 [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs status pbs_server is not running pbs_mom is pid 2229 pbs_sched is pid 2243 pbs_comm is 2199 [siddhs@x40-64-centos64 ~]$ sudo /etc/init.d/pbs status pbs_server is pid 5960 pbs_comm is 1353 ============================================================== SERVER LOG ============================================================== 09/19/2017 02:23:39;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 02:23:39;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Skipping STONITH 09/19/2017 02:23:48;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;connected to PBS dataservice@x50-centos7.pbspro.com 09/19/2017 02:23:48;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Server@x40-64-centos64, PBS data service is up on the primary instance, attempting to stop it 09/19/2017 02:24:06;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Starting PBS dataservice 09/19/2017 02:24:09;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;connected to PBS dataservice@x40-64-centos64.pbspro.com [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs restart Restarting PBS Stopping PBS PBS mom - was pid: 2229 PBS sched - was pid: 2243 PBS comm - was pid: 2199 Waiting for shutdown to complete Starting PBS /opt/pbs/sbin/pbs_comm ready (pid=3068), Proxy Name:x50-centos7.pbspro.com:17001, Threads:4 PBS comm PBS mom Creating usage database for fairshare. PBS sched Notifying Secondary Server that we are taking over Have taken control from Secondary Server Connecting to PBS dataservice....connected to PBS dataservice@x50-centos7.pbspro.com Using license server at 6200@x80-lmx PBS server ============================================================================================================================== ========================================== CASE 2: STONITH SCRIPT NOT AN EXECUTABLE ========================================== ============================================================================================================================== [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs status pbs_server is pid 5419 pbs_mom is pid 5246 pbs_sched is pid 5249 pbs_comm is 5225 [siddhs@x40-64-centos64 ~]$ sudo /etc/init.d/pbs status pbs_server is pid 7927 pbs_comm is 5225 [siddhs@x50-centos7 ~]$ ps -eaf | grep pbs root 1531 1 0 Sep16 ? 00:00:10 /opt/execPBS/pbs/sbin/pbs_mom root 5225 1 0 02:53 ? 00:00:00 /opt/pbs/sbin/pbs_comm root 5246 1 0 02:53 ? 00:00:00 /opt/pbs/sbin/pbs_mom root 5249 1 0 02:53 ? 00:00:00 /opt/pbs/sbin/pbs_sched root 5380 1 0 02:53 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor pbsdata 5401 1 0 02:53 ? 00:00:00 /opt/pbs/pgsql/bin/postgres -D /r/failover/siddhs_stonith/datastore -p 15007 pbsdata 5408 5401 0 02:53 ? 00:00:00 postgres: logger process pbsdata 5410 5401 0 02:53 ? 00:00:00 postgres: checkpointer process pbsdata 5411 5401 0 02:53 ? 00:00:00 postgres: writer process pbsdata 5412 5401 0 02:53 ? 00:00:00 postgres: wal writer process pbsdata 5413 5401 0 02:53 ? 00:00:00 postgres: autovacuum launcher process pbsdata 5414 5401 0 02:53 ? 00:00:00 postgres: stats collector process pbsdata 5418 5401 0 02:53 ? 00:00:00 postgres: pbsdata pbs_datastore 10.8.100.42(58312) idle root 5419 1 0 02:53 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin siddhs 5633 17980 0 02:57 pts/0 00:00:00 grep --color=auto pbs [siddhs@x50-centos7 ~]$ sudo kill -9 5419 [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs status pbs_server is not running pbs_mom is pid 5246 pbs_sched is pid 5249 pbs_comm is 5225 ============================================================== SERVER LOG ============================================================== 09/19/2017 03:03:32;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:03:32;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:03:32;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script execution failed, script exit code: 126 09/19/2017 03:03:32;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;sh: /r/failover/siddhs_stonith/server_priv/stonith: Permission denied, exit_code: 126. 09/19/2017 03:03:32;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary will attempt taking over again. 09/19/2017 03:03:42;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:03:42;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:03:42;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script execution failed, script exit code: 126 09/19/2017 03:03:42;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;sh: /r/failover/siddhs_stonith/server_priv/stonith: Permission denied, exit_code: 126. 09/19/2017 03:03:42;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary will attempt taking over again. 09/19/2017 03:03:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:03:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:03:52;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script execution failed, script exit code: 126 09/19/2017 03:03:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;sh: /r/failover/siddhs_stonith/server_priv/stonith: Permission denied, exit_code: 126. 09/19/2017 03:03:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary will attempt taking over again. 09/19/2017 03:04:02;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:04:02;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:04:02;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script execution failed, script exit code: 126 09/19/2017 03:04:02;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;sh: /r/failover/siddhs_stonith/server_priv/stonith: Permission denied, exit_code: 126. 09/19/2017 03:04:02;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary will attempt taking over again. ============================================================================================================================== ======================================== CASE 3: STONITH SCRIPT EXECUTES SUCCESSFULLY ======================================== ============================================================================================================================== [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs status pbs_server is not running pbs_mom is pid 8445 pbs_sched is pid 8459 pbs_comm is 8415 [siddhs@x50-centos7 ~]$ sudo ls -lrt /r/failover/siddhs_stonith/server_priv/stonith -rwxr-x---. 1 root root 89 Sep 19 03:45 /r/failover/siddhs_stonith/server_priv/stonith [siddhs@x50-centos7 ~]$ sudo cat /r/failover/siddhs_stonith/server_priv/stonith #! /bin/bash hostname whoami echo "Successful exectution of STONITH script" exit [siddhs@x50-centos7 ~]$ ps -eaf | grep pbs root 1531 1 0 Sep16 ? 00:00:10 /opt/execPBS/pbs/sbin/pbs_mom root 8415 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_comm root 8445 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_mom root 8459 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_sched root 8590 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor pbsdata 8611 1 0 03:19 ? 00:00:00 /opt/pbs/pgsql/bin/postgres -D /r/failover/siddhs_stonith/datastore -p 15007 pbsdata 8618 8611 0 03:19 ? 00:00:00 postgres: logger process pbsdata 8620 8611 0 03:19 ? 00:00:00 postgres: checkpointer process pbsdata 8621 8611 0 03:19 ? 00:00:00 postgres: writer process pbsdata 8622 8611 0 03:19 ? 00:00:00 postgres: wal writer process pbsdata 8623 8611 0 03:19 ? 00:00:00 postgres: autovacuum launcher process pbsdata 8624 8611 0 03:19 ? 00:00:00 postgres: stats collector process pbsdata 8629 8611 0 03:19 ? 00:00:00 postgres: pbsdata pbs_datastore 10.8.100.42(58344) idle root 8630 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin siddhs 8634 7796 0 03:19 pts/1 00:00:00 grep --color=auto pbs [siddhs@x50-centos7 ~]$ sudo kill -9 8630 ============================================================== SERVER LOG ============================================================== 09/19/2017 03:20:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:20:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:20:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script executed successfully 09/19/2017 03:20:52;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;x40-64-centos64.pbspro.com Successful exectution of STONITH script, exit_code: 0. 09/19/2017 03:21:02;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;connected to PBS dataservice@x50-centos7.pbspro.com 09/19/2017 03:21:02;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Server@x40-64-centos64, PBS data service is up on the primary instance, attempting to stop it 09/19/2017 03:21:31;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Starting PBS dataservice 09/19/2017 03:21:34;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;connected to PBS dataservice@x40-64-centos64.pbspro.com ============================================================================================================================== ==================================== CASE 4: STONITH SCRIPT DOES NOT EXECUTE SUCCESSFULLY ==================================== ============================================================================================================================== [siddhs@x50-centos7 ~]$ sudo cat /r/failover/siddhs_stonith/server_priv/stonith #! /bin/bash echo "Unsuccessful exectution of STONITH script" exit 1 [siddhs@x50-centos7 ~]$ [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs start Starting PBS PBS comm already running. PBS mom already running. PBS scheduler already running. Notifying Secondary Server that we are taking over Have taken control from Secondary Server Connecting to PBS dataservice....connected to PBS dataservice@x50-centos7.pbspro.com Using license server at 6200@x80-lmx PBS server [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs status pbs_server is pid 9169 pbs_mom is pid 8445 pbs_sched is pid 8459 pbs_comm is 8415 [siddhs@x50-centos7 ~]$ ps -eaf | grep pbs root 1531 1 0 Sep16 ? 00:00:10 /opt/execPBS/pbs/sbin/pbs_mom root 8415 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_comm root 8445 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_mom root 8459 1 0 03:19 ? 00:00:00 /opt/pbs/sbin/pbs_sched root 9130 1 0 03:25 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor pbsdata 9151 1 0 03:25 ? 00:00:00 /opt/pbs/pgsql/bin/postgres -D /r/failover/siddhs_stonith/datastore -p 15007 pbsdata 9158 9151 0 03:25 ? 00:00:00 postgres: logger process pbsdata 9160 9151 0 03:25 ? 00:00:00 postgres: checkpointer process pbsdata 9161 9151 0 03:25 ? 00:00:00 postgres: writer process pbsdata 9162 9151 0 03:25 ? 00:00:00 postgres: wal writer process pbsdata 9163 9151 0 03:25 ? 00:00:00 postgres: autovacuum launcher process pbsdata 9164 9151 0 03:25 ? 00:00:00 postgres: stats collector process pbsdata 9168 9151 0 03:25 ? 00:00:00 postgres: pbsdata pbs_datastore 10.8.100.42(58349) idle root 9169 1 0 03:26 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin siddhs 9369 7796 0 03:31 pts/1 00:00:00 grep --color=auto pbs [siddhs@x50-centos7 ~]$ sudo kill -9 9169 [siddhs@x50-centos7 ~]$ sudo /etc/init.d/pbs status pbs_server is not running pbs_mom is pid 8445 pbs_sched is pid 8459 pbs_comm is 8415 ============================================================== SERVER LOG ============================================================== 09/19/2017 03:31:56;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:31:56;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:31:56;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script execution failed, script exit code: 1 09/19/2017 03:31:56;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Unsuccessful exectution of STONITH script, exit_code: 1. 09/19/2017 03:31:56;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary will attempt taking over again. 09/19/2017 03:32:06;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:32:06;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:32:06;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script execution failed, script exit code: 1 09/19/2017 03:32:06;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Unsuccessful exectution of STONITH script, exit_code: 1. 09/19/2017 03:32:06;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary will attempt taking over again. 09/19/2017 03:32:16;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary attempting to connect with Primary one last time before taking over 09/19/2017 03:32:16;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Executing STONITH script to bring down primary at x50-centos7 09/19/2017 03:32:16;0001;Server@x40-64-centos64;Svr;Server@x40-64-centos64;STONITH script execution failed, script exit code: 1 09/19/2017 03:32:16;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Unsuccessful exectution of STONITH script, exit_code: 1. 09/19/2017 03:32:16;0002;Server@x40-64-centos64;Svr;Server@x40-64-centos64;Secondary will attempt taking over again