offline_vnodes and pbsnodes -o should only offline vnodes belonging to more than one Mom when all the MoMs are offline

Currently when a hook running on a MoM fails and the fail_action is "offline_vnodes" all of the vnodes belonging to that MoM also get marked as offline.  Unfortunately this will happen even though there are other up/free MoMs also reporting that vnode.  

Also when a MoM is marked offline by an administrator using pbsnodes -o <mom1>, then all of the vnodes belonging to that MoM also get marked as offline, even though there are other up/free MoMs also reporting that vnode.

This is particularly destructive on a Cray system because a vnode that belongs to more than one MoM will get marked offline even though the other MoMs associated with the vnode are perfectly fine.  Which has the effect that no more vnode resources are available for running jobs.


I propose a change in behavior:
Only offline the problem Mom that had a problem running the hook, or the host being requested via "pbsnodes -o".
The vnodes that belong to more than one Mom will only be marked offline when all other Moms that report it are also offline.


For example, take a system that has mom1 and mom2.  There are 3 vnodes vn1 vn2 and vn3 that have both mom1 and mom2 as the MoM (i.e. listed in the Mom attribute).
Most jobs actually want to run on vn1, vn2, vn3.  A hook runs on mom1 and fails, with fail_action "offline_vnodes".

With the buggy behavior: mom1, vn1, vn2, vn3 all get marked offline
      only jobs that only need mom2 can run, all other jobs cannot run.

With the fixed behavior: mom1 gets marked offline
      mom2, vn1, vn2, vn3 are still available to run jobs.
With the fixed behavior: if mom1 is already offline and a hook fails to run and has fail_action "offline_vnodes" on mom2: mom2, vn1, vn2, vn3 all get marked offline (mom1 is still offline)
      no more jobs can run.

Design detail
Modify the server to check if a vnode has more than one Mom, and if so, check if all the Moms are offline before marking the vnode as offline.




OSS Site Map

Developer Guide Pages