Forum discussion link :http://community.pbspro.org/t/external-design-document-for-pp-824-cray-ramp-rate-limiting/693
Interface design:
{
"power_ramp_rate_enable": "True",
"power_on_off_enable": "False",
"node_idle_limit": "1000",
"min_node_down_delay": "600",
"max_jobs_analyze_limit": "80"
}
Parameter Name | Default value | Description |
---|---|---|
power_ramp_rate_enable | False | Enabling would make PBS perform ramp rate limiting across the PBS cluster running on a CRAY CLE 6.0 platform. Nodes will be ramped-up and kept at sleep state C1 and for ramp down nodes will be put to sleep state C6. |
power_on_off_enable | False | Enabling would make PBS power on and off nodes on the nodes where node attribute poweroff_eligible is true. |
node_idle_limit | 1800 | How long any node should be left idle before it to be considered for powering down or ramp down. |
min_node_down_delay | 1800 | The time limit before a powered-off node can be considered to be brought up. |
max_jobs_analyze_limit | 100 | The limit indicating maximum number of jobs that are analyzed for power on/ramp-up. The jobs considered here are those which have estimated start_time and exec_vnode updated on them. To have these attributes updated one should have strict_ordering set to true and submit jobs with walltime. |
max_concurrent_nodes | 5 | Defines how many nodes can be power on/off or ramped up/down at a time. For ramp rate, while stepping up or down sleep states, hook will sleep X seconds (where 1<=X<=10) between each level of sleep state. If a node supports 5 levels of sleep states, in worst case scenario hook can wait for 50 seconds for single node. So while increasing the value of this attribute one should also consider increasing the PBS_Power hook frequency and alarm time so that hook instances do not overlap or timeout. |
Details: Below listed are the new log and error messages introduced by power ramp limiting feature.
# | Scenario | Log/error message |
---|---|---|
1 | Nodes are being ramped down | In server logs: Job;power_ramp_down;launch: /opt/cray/capmc/default/bin/capmc set_sleep_state_limit --nids 24-25 --limit 4 Job;power_ramp_down;launch: finished Log level: LOG_INFO |
2 | Nodes are being ramped up | In server logs: Job;power_ramp_up;launch: /opt/cray/capmc/default/bin/capmc set_sleep_state_limit --nids 24-25 --limit 0 Job;power_ramp_up;launch: finished Log level: LOG_INFO |
3 | Server periodic hook output | In server logs: power_ramp_limit: nodes to ramp up: <node_list> power_ramp_limit: nodes to ramp down: <node_list> Log level: LOG_INFO |
4 | Nodes are being powered off | In server logs: 03/29/2016 02:05:59;0008;Server@sdb;Job;node_power_off;launch: /opt/cray/capmc/default/bin/capmc node_off --nids 24-25 03/29/2016 02:06:01;0008;Server@sdb;Job;node_power_off;launch: finished Log level: LOG_INFO |
5 | Nodes are being powered on | In server logs: 03/29/2016 02:05:59;0008;Server@sdb;Job;node_power_on;launch: /opt/cray/capmc/default/bin/capmc node_on --nids 24-25 03/29/2016 02:06:01;0008;Server@sdb;Job;node_power_on;launch: finished Log level: LOG_INFO |