We've noticed that if you stop the primary server (but not the scheduler) after failover the secondary keeps using the scheduler on the primary.
But if you then reboot the primary, the secondary will not start a scheduler locally, since it only decides whether it needs a local scheduler once, immediately after it has taken over, and not in the main server loop.
You end up with a secondary that will never schedule at all. Of course customers consider that failover mechanism "broken".
the sequence to trigger this:
-qterm -t quick on primary
-Secondary takes over, schedules using primary scheduler
-Kill the primary's scheduler.
What does work:
-qterm -t quick -s on the primary (or /etc/init.d/pbs stop, or reboot, or yanking the power,...)
-Secondary takes over and on initial attempt to use the scheduler and failure, decides to start local scheduler
We should, on EVERY failure to contact the scheduler in the main server loop, consider to start a new scheduler locally if we see that we are the secondary and that we were using the scheduler on the primary.
Actually, it would be better to simply always start a scheduler on the secondary and connect to that, even though that is not behaviour according to the documentation. It will always be a faster scheduler, since it is only going to be used if we're the active server.
Critical, since it can lead to situations in which a failover server doesn't correctly take over scheduling services.