External Interface Design for Ramp Rate Limiting and power on/off nodes.

Forum discussion link :http://community.pbspro.org/t/external-design-document-for-pp-824-cray-ramp-rate-limiting/693

Interface design:

  • Interface 1: Changes to Power hook.
    • Change control: Stable
    • Synopsis: Power hook PBS_power
    • Details: Enabling/disabling pbshook PBS_power will enable/disable power management feature.
      • Through qmgr changing hook attribute "enabled" would enable or disable the hook and power management feature.
      • This hook will use server periodic event to ramp up/down and power on/off nodes.
      • Example:
        • qmgr -c "s pbshook PBS_power enabled=true"
        • qmgr -c "s pbshook PBS_power enabled=false"
      • This change overrides existing way to enable power feature i.e. setting server attribute power_provisioning.
        • Server attribute power_provisioning is now read-only and gets enabled or disabled when power hook is enabled or disabled.
  • Interface 2: Hook configuration file 
    • Change Control: Stable
    • Synopsis: JSON hook config file
    • Details: The configuration file allows the administrator to set global parameters, which are used to modify the behavior of the PBS_power hook across all nodes in the PBS Pro complex. The file must conform to JSON syntax. A sample configuration file is displayed and described below:

      {

      "power_ramp_rate_enable": "True",

      "power_on_off_enable": "False",

      "node_idle_limit": "1000",

      "min_node_down_delay": "600",

      "max_jobs_analyze_limit": "80"

      }

    • Parameters:

Parameter Name

Default value

Description

power_ramp_rate_enableFalse

Enabling would make PBS perform ramp rate limiting across the PBS cluster running on a CRAY CLE 6.0 platform.

Nodes will be ramped-up and kept at sleep state C1 and for ramp down nodes will be put to sleep state C6.

power_on_off_enableFalse

Enabling would make PBS power on and off nodes on the nodes where node attribute poweroff_eligible is true.

node_idle_limit1800How long any node should be left idle before it to be considered for powering down or ramp down.
min_node_down_delay1800The time limit before a powered-off node can be considered to be brought up.
max_jobs_analyze_limit100

The limit indicating maximum number of jobs that are analyzed for power on/ramp-up. The jobs considered here are those which have

estimated start_time and exec_vnode updated on them. To have these attributes updated one should have strict_ordering set to true and

submit jobs with walltime.

max_concurrent_nodes5

Defines how many nodes can be power on/off or ramped up/down at a time.

For ramp rate, while stepping up or down sleep states, hook will sleep X seconds (where 1<=X<=10) between each level of sleep state.

If a node supports 5 levels of sleep states, in worst case scenario hook can wait for 50 seconds for single node. So while increasing the value of

this attribute one should also consider increasing the PBS_Power hook frequency and alarm time so that hook instances do not overlap or timeout.

  • Interface 3: New node attribute: poweroff_eligible
    • Change control: Stable
    • Synopsis: Node attribute for power control.
    • Details: This new node attribute will control if a node can be allowed to power off or not.
      • PBS type: Boolean
      • Python type: Boolean
      • Default value: False
      • Manager has set permission. All have read permission.
      • To modify the default value use qmgr:
        • qmgr -c "set node <node_name> poweroff_eligible=True"
  • Interface 4: New node attribute: last_state_change_time
    • Change control: Stable
    • Synopsis: Read only node attribute to capture timestamp.
    • Details: This new node attribute will be updated with time stamp when the node changes from its current state to a new state.
      • Managers and Operators have read permission.
      • Node status command pbsnodes will convert internal date format (seconds since epoch) to human readable format and display the value of this attribute in “MON DD YY HH:MM:SS” format.
      • PBS type: long
      • Python type: int
  • Interface 5: New node attribute: last_used_time
    • Change Control: Stable
    • Synopsis: Read only node attribute to capture timestamp.
    • Details: This new node attribute will be updated with time stamp at the end of any job or reservation.
      • If node is released early from a running job this timestamp gets updated.
      • Node status command pbsnodes will convert internal date format (seconds since epoch) to human readable format and display the value of this attribute in "MON DD YY HH:MM:SS" format.
      • Attribute will be reset when node is ramped up.
      • Managers and Operators have read permission.
      • For vnodes this attribute will be updated for the first time with the current timestamp when they are created or when the nodes are rebooted.
      • This attribute can now be used in sched_config as a node_sort_key. This will help sort the nodes based on their last used time.
      • PBS type: long
      • Python type: int
      • Example:
        • node_sort_key: "last_used_time HIGH"
        • node_sort_key: "last_used_time LOW"
  • Interface 6: New node state: sleep
    • Change Control: Stable
    • Synopsis: New node state that shows node is put down by PBS.
    • Details: This new node state will be set when nodes are ramped down or powered-off by PBS via power ramp rate limiting or power on/off feature.
      • A server periodic hook (pbs hook PBS_power provided as part of PBS package) runs every $freq seconds and takes list of vnodes to power ramp down/power-off the nodes and marks them in new sleep node state.
      • At most max_concurrent_power_limit nodes will be ramped down/powered-off every freq seconds, freq being the server periodic hook frequency.
      • Scheduler can consider the nodes in sleep state to run jobs now.
      • Server periodic hook can ramp-up/power-on the nodes which are in sleep state based on the requirement. Requirement is calculated based on analyzing the jobs estimated start time and the exec_vnode list.
  • Interface 7: Log/Error messages.
    • Change Control: Unstable
    • Synopsis: New log/error messages.
    • Details: Below listed are the new log and error messages introduced by power ramp limiting feature.

      #ScenarioLog/error message
      1Nodes are being ramped down

      In server logs:

      Job;power_ramp_down;launch: /opt/cray/capmc/default/bin/capmc set_sleep_state_limit --nids 24-25 --limit 4

      Job;power_ramp_down;launch: finished

      Log level: LOG_INFO

      2Nodes are being ramped up

      In server logs:

      Job;power_ramp_up;launch: /opt/cray/capmc/default/bin/capmc set_sleep_state_limit --nids 24-25 --limit 0

      Job;power_ramp_up;launch: finished

      Log level: LOG_INFO

      3Server periodic hook output

      In server logs:

      power_ramp_limit: nodes to ramp up: <node_list>

      power_ramp_limit: nodes to ramp down: <node_list>

      Log level: LOG_INFO

      4Nodes are being powered off

      In server logs:


      03/29/2016 02:05:59;0008;Server@sdb;Job;node_power_off;launch: /opt/cray/capmc/default/bin/capmc node_off --nids 24-25

      03/29/2016 02:06:01;0008;Server@sdb;Job;node_power_off;launch: finished

      Log level: LOG_INFO

      5Nodes are being powered on

      In server logs:


      03/29/2016 02:05:59;0008;Server@sdb;Job;node_power_on;launch: /opt/cray/capmc/default/bin/capmc node_on --nids 24-25

      03/29/2016 02:06:01;0008;Server@sdb;Job;node_power_on;launch: finished

      Log level: LOG_INFO