Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


I propose a change in behavior:
Only offline the problem vnode(s) Mom that had a problem running the hook, or the host being requested via "pbsnodes -o".
Except if that problem vnode is a MoM of other vnodes, and that problem vnode is the last Mom that was up, then also offline the problem/requested vnode's childrenThe vnodes that belong to more than one Mom will only be marked offline when all other Moms that report it are also offline.


For example, take a system that has mom1 and mom2.  There are 3 vnodes vn1 vn2 and vn3 that have both mom1 and mom2 as the MoM (i.e. listed in the Mom attribute).
Most jobs actually want to run on vn1, vn2, vn3.  A hook runs on mom1 and fails, with fail_action "offline_vnodes".

With the buggy behavior: mom1, vn1, vn2, vn3 all get marked offline
      only jobs that only need mom2 can run, all other jobs cannot run.

With the fixed behavior: mom1 gets marked offline
      mom2, vn1, vn2, vn3 are still available to run jobs.
With the fixed behavior and : if mom1 is already offline and a hook fails to run and has fail_action "offline_vnodes" on mom2: mom2, vn1, vn2, vn3 all get marked offline (mom1 is still offline)
      no more jobs can run.

Design detail
Modify the server to check if a vnode has more than one Mom, and if so, check if all the Moms are offline before marking the vnode as offline.



...

OSS Site Map

Developer Guide Pages

...