Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: update for pbsnodes -o
  • Please provide your feedback in the Developer Forum
  • Need to add a pull request once a pull request exists:

Currently when a hook running on a MoM fails and the fail_action is "offline_vnodes" all of the vnodes belonging to that MoM also get marked as offline.  Unfortunately this will happen even though there are other up/free MoMs also reporting that vnode.  

Also when a MoM is marked offline by an administrator using pbsnodes -o <mom1>, then all of the vnodes belonging to that MoM also get marked as offline, even though there are other up/free MoMs also reporting that vnode.

This is particularly destructive on a Cray system because a vnode that belongs to more than one MoM will get marked offline even though the other MoMs associated with the vnode are perfectly fine.  Which has the effect that no more vnode resources are available for running jobs.


I propose a change in behavior:
Only offline the problem vnode(s) that had a problem running the hook or being requested via "pbsnodes -o".
Except if that problem vnode is a MoM of other vnodes, and that problem vnode is the last Mom that was up, then also offline the problem/requested vnode's children.


For example, take a system that has mom1 and mom2.  There are 3 vnodes vn1 vn2 and vn3 that have both mom1 and mom2 as the MoM (i.e. listed in the Mom attribute).
Most jobs actually want to run on vn1, vn2, vn3.  A hook runs on mom1 and fails, with fail_action "offline_vnodes".

...