Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Current »

Currently when a hook running on a MoM fails and the fail_action is "offline_vnodes" all of the vnodes belonging to that MoM also get marked as offline.  Unfortunately this will happen even though there are other up/free MoMs also reporting that vnode.  

Also when a MoM is marked offline by an administrator using pbsnodes -o <mom1>, then all of the vnodes belonging to that MoM also get marked as offline, even though there are other up/free MoMs also reporting that vnode.

This is particularly destructive on a Cray system because a vnode that belongs to more than one MoM will get marked offline even though the other MoMs associated with the vnode are perfectly fine.  Which has the effect that no more vnode resources are available for running jobs.


I propose a change in behavior:
Only offline the problem vnode(s) that had a problem running the hook or being requested via "pbsnodes -o".
Except if that problem vnode is a MoM of other vnodes, and that problem vnode is the last Mom that was up, then also offline the problem/requested vnode's children.


For example, take a system that has mom1 and mom2.  There are 3 vnodes vn1 vn2 and vn3 that have both mom1 and mom2 as the MoM (i.e. listed in the Mom attribute).
Most jobs actually want to run on vn1, vn2, vn3.  A hook runs on mom1 and fails, with fail_action "offline_vnodes".

With the buggy behavior: mom1, vn1, vn2, vn3 all get marked offline
      only jobs that only need mom2 can run, all other jobs cannot run.

With the fixed behavior: mom1 gets marked offline
      mom2, vn1, vn2, vn3 are still available to run jobs.
With the fixed behavior and mom1 is already offline and a hook fails to run and has fail_action "offline_vnodes" on mom2: mom2, vn1, vn2, vn3 all get marked offline (mom1 is still offline)
      no more jobs can run.




OSS Site Map

Developer Guide Pages


  • No labels