Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Technical Term

Description or Definition

PMI

Power Management Infrastructure

...

  1. Interface #1
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable server power_provisioning  flag
    4. Reference to more detail on the interface.
      1. The  power_provisioning boolean server attribute will have a default of unset, be visible to all and changeable by a manager.  When it is set True, PMI operations may take place if allowed by vnode power_provisioning flag (see A.1.9).  If it is unset or set False, no PMI operations will take place on any vnode.
      2. Use qmgr to set the power_provisioning flag true or false.  For example:   

        Info
        iconfalse

        qmgr -c “set server power_provisioning = true”


  2. Interface #2
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable energy usage for a job
      1. Add a new attribute for a job: resources_used.energy
    4. Reference to more detail on the interface.
      1. The type will be float.
      2. The units will be kWh.  For example:  resources_used.energy=64.2
      3. The resources_used.energy value will only be updated when PMI operations are allowed on the vnodes used by the job. The resources_used.energy value will not be seen in qstat -f output or server/accounting logs when PMI operations are not allowed on the node.
  3. Interface #3
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable resource “eoe”
      1. A new resource similar to “aoe” is added to both jobs and vnodes to specify the energy operational environment.
    4. Reference to more detail on the interface.
      1. Is added to default resource list of scheduler in sched_config file.
      2. It is a non-consumable resource.
      3. It is of type resource, added to attribute resources_available. e.g. resources_available.eoe=”low,med,high”. It is a string array.
      4. Contains list of all power profile names that are available on a vnode. By default, resources_available.eoe is unset.
      5. The list is visible to all but settable only by manager.
      6. Job Resource_List.eoe per chunk in –l select as –l select=1:ncpus:eoe=low. This will request one chunk from a node with resource_available.eoe=low.
      7. Only one eoe value can be active on a vnode at a time.
      8. A job Resource_List.eoe may be requested in a select statement but no more than one distinct value for the requested eoe is currently supported. i.e. -lselect=1:ncpus=1:eoe=med+1:ncpus=2:eoe=med
      9. If a Job request is made with more than one value for eoe (I.e. –l select=1:eoe=low+1:eoe=high), it will be rejected by qsub with the error “qsub: only one value of eoe is allowed”.
      10. A value for resources_available.eoe will not be automatically set on the system(s) where the PBS server and scheduler are running.
      11. If both an aoe and eoe are set for a job, the aoe setting will be processed first by the scheduler.
      12. The scheduler will not preempt a job with eoe set using suspend or checkpoint.
  4. Interface #4
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable vnode attribute: current_eoe
    4. Reference to more detail on the interface.
      1. Identifies the eoe active on a vnode. It is of type String. By default, it is unset. It is settable only by manager and visible to all.
      2. A job J1 running with a eoe setting X will cause the value of current_eoe to be set  to X on the vnodes assigned to J1 that allow PMI operations.
      3. Manually changing current_eoe is unsupported.
      4. The scheduler can run a job requesting an eoe on vnodes with a current_eoe value that matches the job eoe.
      5. The scheduler can only run a job on a vnode where the current_eoe does not match the job eoe if no jobs are running on the vnode and PMI operations are allowed on the vnode.
      6. When a job ends the deactivate operation will take place if all the vnodes used by the job have no other jobs running and allow PMI operations.  At this point, current_eoe will be unset on all the vnodes used by the job.
  5. Interface #5
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Cray specific job attributeresource: pstate
    4. Reference to more detail on the interface.
      1. Cray ALPS reservation setting for p-state.  See Basil 1.4 documentation.
      2. It is of type String. By default, it is unset. It is settable and visible to all PBS users.
  6. Interface #6
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Cray specific job attributeresource: pgov
    4. Reference to more detail on the interface.
      1. Cray ALPS reservation setting for p-governor.  See Basil 1.4 documentation.
      2. It is of type String. By default, it is unset. It is settable and visible to all PBS users.
  7. Interface #7
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Cray specific job attributeresource: pcap_node
    4. Reference to more detail on the interface.
      1. Cray capmc set_power_cap --node setting.  See capmc documentation.
      2. It is of type Int. By default, it is unset. It is settable and visible to all PBS users.
      3. A  negative value will result in an a PBSE_BADATVAL error.
  8. Interface #8
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Cray specific job attributeresource: pcap_accelerator
    4. Reference to more detail on the interface.
      1. Cray capmc set_power_cap --accel setting.  See capmc documentation.
      2. It is of type Int. By default, it is unset. It is settable and visible to all PBS users.
      3. A negative value will result in an a PBSE_BADATVAL error.
  9. Interface #9
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable vnode power_provisioning flag
    4. Reference to more detail on the interface.
      1. The  power_provisioning boolean vnode attribute will be unset by default, be visible to all and changeable by a manager.
      2. Use qmgr to set the power_provisioning flag true or false.  For example:

        Info
        iconfalse

        qmgr -c “set node bigbox power_provisioning = true”                                             


      3. When it is set to True, PMI operations may take place on the vnode.  If it is unset or set to False, no PMI operations are allowed to take place on the vnode.

  10. Interface #10
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Mom log using logjobmsg when a job ends and the value of current_eoe is unset.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 14:44:15;0008;pbs_python;Job;165.bigcray;PMI: reset current_eoe


  11. Interface #11
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When the energy for a job on an SGI HPE system is obtained, it will be logged by MoM using  using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/06/2014 18:35:26;0008;pbs_python;Job;4856.iceberg;SGI HPE: energy 1.456kWh


  12. Interface #12
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: The Cray capmc command invocations will be logged by MoM using LOG_DEBUG with the keyword “launch”.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: 167.bigcray launch: /opt/cray/capmc/default/bin/capmc get_node_energy_counter --nids 0


  13. Interface #13
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis:  Following a successful Cray capmc invocation, a message will be logged by MoM using LOG_WARNING if the time used by capmc is greater than 30 seconds.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch: finished  in 156 seconds


  14. Interface #14
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis:  If Cray capmc writes anything to stderr, the first line will be logged by MoM using LOG_WARNING after the “launch” message.
    4. Reference to more detail on the interface.
      1. Cray has not documented the possible stderr output from capmc.
      2. Example:

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch stderr: i fell and cannot get up


  15. Interface #15
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis:  When Cray capmc is run with the argument “get_node_energy_counter”, the node count is checked and if it is wrong, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. The same command will be run one additional time if an error is seen.  No message will be logged for the first error.  If an error occurs after the second attempt, a message is logged.
      2. For example:

        Info
        iconfalse

        11/19/2014 15:20:05;0008;pbs_python;Job;166.centos1;error: node count 2, should be 1


      3. The output from capmc should include a node count.  If it does not, the messages will show “not set” instead of a number.
      4. Example:

        Info
        iconfalse
        1. 11/19/2014 15:20:05;0008;pbs_python;Job;166.centos1;node count not set, should be 1


  16. Interface #16
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If Cray RUR is configured (see B.1.f), log messages will be logged by MoM using logjobmsg when a job ends.
    4. Reference to more detail on the interface.
      1. A message will show the energy used by each aprun run by a job and a job tally in Joules.  For example:

        Info
        iconfalse

        11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray:RUR: {"apid":34876,"apid_energy":83876J,"job_energy":83876J}

        11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray:RUR: {"apid":34972,"apid_energy":84272J,"job_energy":168148J}

        11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray:RUR: {"apid":35234,"apid_energy":83194J,"job_energy":251342J}


  17. Interface #17
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If Cray RUR is not configured, a log message will be logged by MoM using logjobmsg when a job ends.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 18:17:16;0008;pbs_python;Job;267.bigcray;Cray: no RUR data


  18. Interface #18
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: At the end of a job on a Cray, the energy reported by capmc for the compute nodes used by the job will be logged  by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/06/2014 18:35:26;0008;pbs_python;Job;156.bigcray;energy usage 554520J


  19. Interface #19
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: The energy reported by capmc for the compute nodes used by a job on a Cray will be logged by MoM using logjobmsg periodically every 5 minutes as the job runs.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/06/2014 18:35:26;0008;pbs_python;Job;156.bigcray;Cray: get_usage: energy 346342J


  20. Interface #20
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When the PMI on a Cray is initialized, MoM will log messages at LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: init

        11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: connect


  21. Interface #21
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When get_usage() is called for a job on a Cray, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: get_usage


  22. Interface #22
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When query() is called on a Cray, a message will be logged by MoM using LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: query


  23. Interface #23
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When activate_profile() is called on a Cray, a message will be logged by MoM using LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;Cray: 167.centos1 activate 'low'


  24. Interface #24
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When activate_profile() is called on a Cray but no compute nodes are allocated to the job, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no compute nodes for power setting


  25. Interface #25
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When activate_profile() is called on a Cray and the job has pcap_node set, a message will be logged by MoM using logjobmsg showing the pcap_node value.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: pcap node 350


  26. Interface #26
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When activate_profile() is called on a Cray and the job has pcap_accelerator set, a message will be logged by MoM using logjobmsg showing the pcap_ accelerator value.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: pcap accel 250


  27. Interface #27
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When activate_profile() is called on a Cray and the job has neither pcap_node or pcap_accelerator set, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no power cap to set


  28. Interface #28
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When deactivate_profile() is called on a Cray, a message will be logged by MoM using LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;Cray: deactivate 167.centos1


  29. Interface #29
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When deactivate_profile() is called on a Cray but no compute nodes are allocated to the job, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no compute nodes for power setting


  30. Interface #30
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When deactivate_profile() is called on a Cray and the job has pcap_node set, a message will be logged by MoM using logjobmsg showing the pcap_node value.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: remove pcap node 350


  31. Interface #31
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When deactivate_profile() is called on a Cray and the job has pcap_accelerator set, a message will be logged by MoM using logjobmsg showing the pcap_ accelerator value.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: remove pcap accel 250


  32. Interface #32
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When deactivate_profile() is called on a Cray and the job has neither pcap_node or pcap_accelerator set, a message will be logged by MoM using logjobmsg.
    4.  Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: no power cap to remove


  33. Interface #33
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis:  If Cray RUR is configured but the file created by the output plugin has a permission problem, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. The file owner must be 0 and it must not be writable by other.
      2. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: RUR file:  /var/spool/PBS/spool/167.centos1.rur should only be writable by root


  34. Interface #34
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis:  If Cray RUR is configured but the file created by the output plugin can be read, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray: reading RUR file:  /var/spool/PBS/spool/167.centos1.rur


  35. Interface #35
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If the file created by the RUR output plugin can be read but the energy value cannot be parsed, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. A python exception error string will be output as part of the message.
      2. Example

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy_used not found: unexpected EOF while parsing


  36. Interface #36
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If the file created by the RUR output plugin can be read but the Cray energy RUR plugin has not been enabled, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: warning: energy plugin not enabled by RUR


  37. Interface #37
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When the energy for a job is successfully obtained from RUR, MOM will log one of three possible messages using logjobmsg.
    4. Reference to more detail on the interface.
      1. If no energy value was obtained from capmc:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy 4.234kWh


      2. If the energy value from capmc was smaller than what was obtained from RUR:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy 4.234kWh replaces capmc energy 4.1432kWh


      3. If the energy value from capmc was greater than or equal to what was obtained from RUR:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;Cray:RUR: energy 4.234kWh last capmc usage 4.2432kWh


  38. Interface #38
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When the PMI on an SGI HPE is initialized, MoM will log messages at LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;SGI HPE: init

        11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;SGI HPE: connect


  39. Interface #39
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When get_usage() is called for a job on an SGI HPE, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1;SGI HPE: get_usage


  40. Interface #40
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When query() is called on an SGI HPE, a message will be logged by MoM using LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;SGI HPE: query


  41. Interface #41
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When activate_profile() is called on an SGI HPE, a message will be logged by MoM using LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;SGI HPE: 167.centos1 activate '450W'


  42. Interface #42
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When deactivate_profile() is called on an SGI HPE, a message will be logged by MoM using LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example:

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python;SGI HPE: deactivate


  43. Interface #43
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If any PMI operation is attempted for a job with a vnode assigned that does not have power_provisioning=True, a message will be logged by MoM using logjobmsg.
    4. Reference to more detail on the interface.
      1. Example

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1; power functionality is disabled on vnode v12


  44. Interface #44
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If the PMI hook is run with an unexpected event,  MoM will log a message at LOG_WARNING.
    4. Reference to more detail on the interface.
      1. Example

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; Event not serviceable for power provisioning.


  45. Interface #45
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: When the PMI hook handles the EXECHOST_STARTUP event and the MOM is running on the same host as the pbs_server or pbs_sched, MoM will log a message at LOG_DEBUG.
    4. Reference to more detail on the interface.
      1. Example

        Info
        iconfalse

        11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; Provisioning cannot be enabled on this host.


  46. Interface #46
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If any PMI operation at job end throws a python exception, a message will be logged by MoM using logjobmsg showing the exception string.
    4. Reference to more detail on the interface.
      1. Example

        Info
        iconfalse

        11/19/2014 17:24:21;0008;pbs_python;Job;167.centos1; socket.error: [Errno 111] Connection refused


  47. Interface #47
    1. Visibility: Public
    2. Change Control: Unstable
    3. Synopsis: If activate_profile() throws either of the python exceptions defined in D.1.d.vii a message will be logged by MoM at LOG_WARNING.
    4. Reference to more detail on the interface.
      1. If the exception is BackendError, query() is called to reset the eoe value for the natural vnode for the MoM.
      2. Example

        Info
        iconfalse

        1/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; PMI:activate: set eoe: low,med,high


      3. If the exception is InternalError, the natural vnode for the MoM will be set offline.
      4. Example

        Info
        iconfalse

        1/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; PMI:activate: set myself offline


  48. Interface #48
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: PBS hook order support range from [-1000, 2000].
    4. Reference to more detail on the interface.
      1. Example

        Info
        iconfalse

        # qmgr -c “set pbshook power_hook order = -1000”

        # qmgr -c “set pbshook power_hook order = 2000”


...