Technical Term | Description or Definition |
PMI | Power Management Infrastructure |
Tractability Matrix
Use Case(s) | Requirement(s) | Interface(s) |
---|---|---|
2.a | 3.a, 3.b, 3.e, 3.f | 2, 10, 11 |
2.b | 3.d, 3.g | 1, 4, 6, 7, 8, 9, 12, 24-38, 40-49 |
2.c | 3.c | 3, 5, 13, 18, 20, 21, 39 |
A. Interface changes
Use qmgr to set the power_provisioning flag true or false. For example:
qmgr -c “set server power_provisioning = true” |
Use qmgr to set the power_provisioning flag true or false. For example:
qmgr -c “set node bigbox power_provisioning = true” |
When it is set to True, PMI operations may take place on the vnode. If it is unset or set to False, no PMI operations are allowed to take place on the vnode.
Example:
11/19/2014 14:44:15;0008;pbs_python;Job;165.bigcray;PMI: reset current_eoe |
Example:
11/06/2014 18:35:26;0008;pbs_python;Job;4856.iceberg;SGI HPE: energy 1.456kWh |
Example:
11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: 167.bigcray launch: /opt/cray/capmc/default/bin/capmc get_node_energy_counter --nids 0 |
Example:
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch: finished in 156 seconds |
Example:
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch stderr: i fell and cannot get up |
For example:
11/19/2014 15:20:05;0008;pbs_python;Job;166.centos1;error: node count 2, should be 1 |
Example:
|
A message will show the energy used by each aprun run by a job and a job tally in Joules. For example:
11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray:RUR: {"apid":34876,"apid_energy":83876J,"job_energy":83876J} 11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray:RUR: {"apid":34972,"apid_energy":84272J,"job_energy":168148J} 11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray:RUR: {"apid":35234,"apid_energy":83194J,"job_energy":251342J} |
Example:
11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray: no RUR data |
Example:
11/06/2014 18:35:26;0008;pbs_python;Job;156.bigcray;energy usage 554520J |
Example:
11/06/2014 18:35:26;0008;pbs_python;Job;156.bigcray;Cray: get_usage: energy 346342J |
Example:
11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: init 11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: connect |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: get_usage |
Example:
11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: query |
Example:
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;Cray: 167.centos1 activate 'low' |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no compute nodes for power setting |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: pcap node 350 |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: pcap accel 250 |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no power cap to set |
Example:
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;Cray: deactivate 167.centos1 |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no compute nodes for power setting |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: remove pcap node 350 |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: remove pcap accel 250 |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no power cap to remove |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: RUR file: /var/spool/PBS/spool/167.centos1.rur should only be writable by root |
Example
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: reading RUR file: /var/spool/PBS/spool/167.centos1.rur |
Example
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy_used not found: unexpected EOF while parsing |
Example
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: warning: energy plugin not enabled by RUR |
If no energy value was obtained from capmc:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy 4.234kWh |
If the energy value from capmc was smaller than what was obtained from RUR:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy 4.234kWh replaces capmc energy 4.1432kWh |
If the energy value from capmc was greater than or equal to what was obtained from RUR:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy 4.234kWh last capmc usage 4.2432kWh |
Example:
11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;SGI HPE: init 11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;SGI HPE: connect |
Example:
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;SGI HPE: get_usage |
Example:
11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;SGI HPE: query |
Example:
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;SGI HPE: 167.centos1 activate '450W' |
Example:
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;SGI HPE: deactivate |
Example
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1; power functionality is disabled on vnode v12 |
Example
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; Event not serviceable for power provisioning. |
Example
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; Provisioning cannot be enabled on this host. |
Example
11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1; socket.error: [Errno 111] Connection refused |
Example
1/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; PMI:activate: set eoe: low,med,high |
Example
1/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; PMI:activate: set myself offline |
Example
# qmgr -c “set pbshook power_hook order = -1000” # qmgr -c “set pbshook power_hook order = 2000” |
B. Administrator’s instructions
Set power_provisioning on the server to true, and power_provisioning to true on the desired vnodes. For example:
# qmgr -c “set server power_provisioning = True” # qmgr -c “set node node1 power_provisioning = True” # qmgr -c “set node node2 power_provisioning = True” # qmgr -c “set node node3 power_provisioning = True” |
If all vnodes will have power_provisioning set, @default can be used instead of individual vnode names. For example:
# qmgr -c “set node @default power_provisioning = True” |
The RUR config file has to be modified to use the PBS output plugin:
/opt/pbs/default/lib/cray/pbs_output.py |
If any host is running PBS with an alternate location for the pbs.conf file, PBS_CONF_FILE must be added to the pbs_environment file on that host. On Linux systems, the default location for the pbs.conf file is /etc/pbs.conf. The pbs.conf file is used by each MOM to check if the server or scheduler is running on the local host. If so, the node will not be automatically configured for power provisioning. For example, if /var/pbs.conf is the active pbs.conf file, the following line must be added to PBS_HOME/pbs_environment:
PBS_CONF_FILE=/var/pbs.conf |
If the PMI power profile names are obtained from one of the vendors listed in B.4.c, then eoe values must be set manually on the vnodes and a submit hook needs to map the eoe values to the options listed for the PMI vendor. The hook will set the desired job attributes for each possible eoe value. For example:
# for n in node1 node2 node3 ;do > qmgr -c “set node $n resources_available.eoe='low,med,high'” > done # cat map_eoe.py import pbs e = pbs.event() j = e.job profile = j.Resource_List['eoe'] if profile is None: res = j.Resource_List['select'] if res is not None: for s in str(res).split('+')[0].split(':'): if s[:4] == 'eoe=': profile = s.partition('=')[2] break pbs.logmsg(pbs.LOG_DEBUG, "got profile '%s'" % str(profile)) if profile == "low": j.Resource_List["pstate"] = "1900000" j.Resource_List["pcap_node"] = 100 pbs.logmsg(pbs.LOG_DEBUG, "set low") elif profile == "med": j.Resource_List["pstate"] = "220000" j.Resource_List["pcap_node"] = 200 pbs.logmsg(pbs.LOG_DEBUG, "set med") elif profile == "high": j.Resource_List["pstate"] = "240000" pbs.logmsg(pbs.LOG_DEBUG, "set high") else: pbs.logmsg(pbs.LOG_DEBUG, "unhandled profile '%s'" % str(profile)) e.accept() # qmgr <<EOF create hook power_map event=queuejob import hook power_map application/x-python default map_eoe.py set hook power_map enabled=True EOF |
Set power_provisioning on the server to true. For example:
# qmgr -c “set server power_provisioning = True” |
Set power_provisioning to true on the desired vnodes. For example:
# qmgr -c “set node node1 power_provisioning = True” # qmgr -c “set node node2 power_provisioning = True” # qmgr -c “set node node3 power_provisioning = True” |
Set power_provisioning on the server to true and power_provisioning to true on the desired vnodes. For example:
# qmgr -c “set server power_provisioning = True” # qmgr -c “set node node1 power_provisioning = True” # qmgr -c “set node node2 power_provisioning = True” # qmgr -c “set node node3 power_provisioning = True” |
C. User’s instructions
Use the provisioning feature and set “eoe” to a power profile name. For example:
qsub -leoe=low -lncpus=20 lackadaisical.sh qsub -lselect=4:eoe=high:ncpus=8 zoomjob |
For example, energy could be included with resources_used for a job 'E' record:
04/14/2014 04:42:03;E;1.x44-mpi.pbspro.com;user=ashisha group=altair project=_pbs_project_default jobname=STDIN queue=workq ctime=1397475718 qtime=1397475718 etime=1397475718 start=1397475718 exec_host=x44-mpi/0 exec_vnode=(x44-mpi:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=pack Resource_List.select=1:ncpus=1 session=4746 end=1397475723 Exit_status=255 resources_used.cpupercent=0 resources_used.cput=00:00:01 resources_used.mem=0kb resources_used.ncpus=1 resources_used.vmem=0kb resources_used.walltime=00:00:05 resources_used.energy=1.67 run_count=1 |
D. Internal Design Interfaces
Visibility: Public
Change Control: Stable
Synopsis: PBS hook power control module
A new class “pbs.Power” will be made available that will provide power functionality. A hook will be able to access it via python import.
Reference to more detail on the interface. The following define the PMI operations available:
activate_profile(self, profile_name, job)
Activate a given power profile on a set of hosts on behalf of a given job. The parameter “profile_name” is a string containing the name of a profile. The parameter “job” is a PBS job object. The hosts will be calculated from the job object. If the job parameter is not specified, the pbs.event().job object will be used.
The return type is bool where True indicates success and False indicates the request was made without an indication from the PMI if it was successful or not.
If an error occurs where it is appropriate for some or all of the job vnodes to be marked offline, this may be done before an exception is raised.
If an error occurs where it is appropriate for the supported profile names for some or all of the job vnodes to be refreshed, this may be done before an exception is raised.
get_usage(self, job)
Retrieve power usage for a job. The parameter “job” is a PBS job object.
The return will be a float which gives the cumulative energy usage for the job at the time of the call in kilowatt-hours (kWh). If no power usage information is available, None is returned.
deactivate_profile(self, job)
Inform the PMI that a job is no longer active. This would be used when a job is suspended or terminated. The parameter “job” is a PBS job object. If it is not specified, the pbs.event().job object will be used.
The return type is bool where True indicates success and False indicates the request was made without an indication from the PMI if it was successful or not.
query(self, query_type)
Return information that matches a request type. The parameter “query_type” is used to specify what should be returned. The only value for query_type is QUERY_PROFILE, and the return will be a list of strings giving profile names supported by the PMI.
connect(self, endpoint, port)
Connect to the PMI. The parameter “endpoint” defaults to None and is a string which will be meaningful to the PMI. The parameter “port” defaults to None and is an integer. A typical usage would be “endpoint” specifying a hostname and “port” giving a network port for a network service connection.
Currently the connection/disconnection will be done per hook instead of creating a long lasting session.
Nothing is returned, the connection information is maintained in an instantiation of the Power class.
If the endpoint or port parameters are not specified, the underlying code specific to the PMI will determine the connection details.
disconnect(self)
Disconnect from the PMI. There are no parameters needed since each instance of the Power class is associated to a backend power management interface.
Exceptions
InternalError - returned in cases where the underlying cause of a failure cannot be determined.
BackendError - the backend PMI call was unsuccessful.
Power module initialization
A string can optionally be passed to specify the name of the PMI to be used (see D.2). By default, the type of PMI to be used will be determined automatically based on the type of hardware used.
Examples
Activate a profile from a job specific event.
p = pbs.Power() p.connect(“power_master”) p.activate_profile(“LOW”) p.disconnect() |
Get profile name list.
import pbs p = pbs.Power() p.connect(port=3564) pnames = p.query(p.QUERY_PROFILE) p.disconnect() |
Deactivate profile on a specific job.
import pbs p = pbs.Power() badjob = pbs.server().job(“10”) p.connect() p.deactivate_profile(job=badjob) p.disconnect() |
The PBS “power” hook can be modified to specify a PMI name in the pbs.Power() instantiation in the init_power function. For example, the code below would cause the new file described in D.2.d.ii to be used by the hook:
power = pbs.Power(“ipmitool”) |
Python code patterned after the file PBS_EXEC/lib/python/altair/pbs/v1/_pmi_none.py must be placed in a file where none is replaced by the PMI name being implemented. For example:
# cd $PBS_EXEC/lib/python/altair/pbs/v1 # cp _pmi_none.py _pmi_ipmitool.py # vi _pmi_ipmitool.py |