PP-610: On a Cray X-series, periodically synchronize PBS with ALPS inventory.

PP-610 - Getting issue details... STATUS

Forum discussion (EDD review).

Overview:
PBS and ALPS can sometimes get out of sync. The purpose of the synchronization hook is to check to see if the information
that PBS has is out of sync with what ALPS is reporting. When the hook detects that PBS and ALPS are out of sync, the hook 
will HUP the Mom. The hook will only do its work on Cray X-series Moms.


Interface 1: PBS hook PBS_alps_inventory_check

  • Visibility: Public
  • Change Control: Experimental
  • Details: 
    • This is a periodic hook that runs on the execution host.
    • The Hook is enabled by default when run on a Cray X* series machine.
      • The hook is disabled by default on all other platforms.
    • The hook runs as the Administrator and executes every 300 seconds.
    • The timeout for the Hook is 90 seconds.
        

Interface 2: Mom log entry: ALPS Inventory Check: apstat command cannot be found at <path>

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • Recorded when apstat is not present in the expected/default location, indicating a Cray system issue.
    • Log level: PBSEVENT_ADMIN.

Interface 3: Mom log entry: No <host name> file found on this host

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • Recorded when hook attempts to read the /etc/xthostname file to determine Cray hostname, but the hostname file is missing.
    • Log level: PBSEVENT_ADMIN.


Interface 4Mom log entry: Processing ALPS inventory for crayhost <host name>

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details:
    • Recorded at the start of hook processing.
    • Log level: PBSEVENT_ADMIN.


Interface 5Mom log entry: ALPS Inventory Check: No vnodes reported by PBS

  • Visibility: Public
  • Change Control: Experimental
  • Details: 
    • Recorded when no vnodes have been created by Mom, due to lack of inventory information.
    • This happens when mom_priv/config does not have alps_client set. Can also happen when compute nodes 
      are in Interactive mode.
    • Log level: PBSEVENT_ADMIN.


Interface 6Mom log entry: ALPS Inventory Check: No eligible login nodes to perform inventory check

  • Visibility: Public
  • Change Control: Experimental
  • Details:
    • Recorded when no nodes with vntype 'cray_login' are present.
    • Log level: PBSEVENT_ADMIN.


Interface 7: Mom log entry: ALPS Inventory Check: Login node '<name>' is in charge of verification, skipping check on '<name>'

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • The mom installed on a login node reports inventory; additional moms, if any, do not.
    • The first instance of 'name' is the hostname of the login node responsible for performing the inventory query. The second 
      instance of 'name' is the hostname of the current/local mom.
    • Log level: PBSEVENT_ADMIN.

Interface 8: Mom log entry: ALPS Inventory Check: No nodes reported by apstat

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • Recorded when apstat fails to report nodes due to a systems or apsched issue.
    • Log level: PBSEVENT_ADMIN.

Interface 9: Mom log entry: ALPS Inventory Check: apstat query: <#>s pbsnodes query: <#>s

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • Records apstat and pbsnodes query durations in seconds.
    • Log level: PBSEVENT_ADMIN.

Interface 10: Mom log entry: ALPS Inventory Check: Compute node(s) defined in ALPS, but not in PBS: <list of nodes>

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • Recorded when PBS and ALPS are out of sync i.e. ALPS has information that PBS does not have.
    • Log level: PBSEVENT_ADMIN.


Interface 11: Mom log entry: ALPS Inventory Check: Compute node(s) defined in PBS, but not in ALPS: <list of nodes>

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • Recorded when PBS and ALPS are out of sync i.e. PBS reports nodes that ALPS does not.
    • Log level: PBSEVENT_ADMIN.


Interface 12: Mom log entry: ALPS Inventory Check: Internal error in retrieving path to mom_priv

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details:
    • Recorded when mom_priv directory is not in the expected/default location (PBS_HOME), indicating a PBS installation issue.
    • Log level: PBSEVENT_ADMIN.

Interface 13: Mom log entry: ALPS Inventory Check: PBS and ALPS are in sync

  • Visibility: Public
  • Change Control: Experimental
  • Details: 
    • Recorded when both PBS and ALPS are in sync i.e. they report the same number of compute nodes 
      in the Cray cluster.
    • Log level: PBSEVENT_ADMIN.


Interface 14: Mom log entry: ALPS Inventory Check: Failure in refreshing nodes on login node (<name>)

  • Visibility: PBS Private
  • Change Control: Experimental
  • Details: 
    • Recorded when the Hook is unable to HUP the Mom and successfully refresh nodes.
    • Log level: PBSEVENT_ADMIN.