Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
  1. Interface #1

    1. Visibility: Public

    2. Change Control: Stable

    3. Synopsis: PBS hook power control module

    4. A new class “pbs.Power” will be made available that will provide power functionality.  A hook will be able to access it via python import.

    5. Reference to more detail on the interface. The following define the PMI operations available:

    6. pmi_activate_profile

      1. Activate a given power profile on a set of hosts on behalf of a given job.  The parameter “profile_name” is a string containing the name of a profile.  The parameter “hosts” is a list containing strings that specify hostnames.  The parameter “job” is a PBS job object.  If the hosts parameter is not specified, the hosts will be calculated from the job object.  If the job parameter is not specified, the pbs.event().job object will be used.

      2. The return type is bool where True indicates success and False indicates the request was made without an indication from the PMI if it was successful or not.

      3. If an error occurs where it is appropriate for some or all of the job vnodes to be marked offline, this may be done before an exception is raised.

      4. If an error occurs where it is appropriate for the supported profile names for some or all of the job vnodes to be refreshed, this may be done before an exception is raised.

    7. pmi_get_usage

      1. Retrieve power usage for a job.  The parameter “job” is a PBS job object.

      2. The return will be a float which gives the cumulative energy usage for the job at the time of the call in kilowatt-hours (kWh).  If no power usage information is available, None is returned.

    8. pmi_deactivate_profile

      1. Inform the PMI that a job is no longer active.  This would be used when a job is suspended or terminated.  The parameter “job” is a PBS job object.  If it is not specified, the pbs.event().job object will be used.

      2. The return type is bool where True indicates success and False indicates the request was made without an indication from the PMI if it was successful or not.

    9. pmi_query

      1. Return information that matches a request type.  The parameter “query_type” is used to specify what should be returned.  The only value for  query_type is PMI_QUERY_PROFILE, and the return will be a list of strings giving profile names supported by the PMI.

    10. pmi_connect

      1. Connect to the PMI.  The parameter “endpoint” defaults to None and is a string which will be meaningful to the PMI.  The parameter “port” defaults to None and is an integer.  A typical usage would be “endpoint” specifying a hostname and “port” giving a network port for a network service connection.

      2. Currently the connection/disconnection will be done per hook instead of creating a long lasting session.

      3. Nothing is returned, the connection information is maintained in an instantiation of the Power class.

      4. If the endpoint or port parameters are not specified, the underlying code specific to the PMI will determine the connection details.

    11. pmi_disconnect

      1. Disconnect from the PMI.  There are no parameters needed since each instance of the Power class is associated to a backend power management interface.

    12. Exceptions

      1. InternalError - returned in cases where the underlying cause of a failure cannot be determined.

      2. BackendError - the backend PMI call was unsuccessful.

    13. Power module initialization

      1. A string can optionally be passed to specify the name of the PMI to be used (see I.1.11).  By default, the type of PMI to be used will be determined automatically based on the type of hardware used.

    14. Examples

      1. Activate a profile from a job specific event.

        Info
        iconfalse

        p = pbs.Power()

        p.pmi_connect(“power_master”)

        p.pmi_activate_profile(“LOW”)

        p.pmi_disconnect()


      2. Get profile name list.

        Info
        iconfalse

        import pbs

        p = pbs.Power()

        p.pmi_connect(port=3564)

        pnames = p.pmi_query(p.PMI_QUERY_PROFILE)

        p.pmi_disconnect()


      3. Deactivate profile on a specific job.


        Info
        iconfalse

        import pbs

        p = pbs.Power()

        badjob = pbs.server().job(“10”)

        p.pmi_connect()

        p.pmi_deactivate_profile(job=badjob)

        p.pmi_disconnect()


  2. Interface #2
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable server power_provisioning  flag
    4. Reference to more detail on the interface.
      1. The  power_provisioning boolean server attribute will have a default of unset, be visible to all and changeable by an administrator.  When it is set True, PMI operations may take place if allowed by power_enable flag (see I.1.10).  If it is unset or set False, no PMI operations will take place on any vnode.
      2. Use qmgr to set the power_provisioning flag true or false.  For example:   qmgr -c “set server power_provisioning = true”

  3. Interface #3
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable energy usage for a job
      1. Add a new attribute for a job: resources_used.energy
    4. Reference to more detail on the interface.
      1. The type will be float.
      2. The units will be kWh.  For example:  resources_used.energy=64.2
      3. The resources_used.energy value will only be updated when PMI operations are allowed on the vnodes used by the job.
  4. Interface #4
    1. Visibility: Public
    2. Change Control: Stable
    3. Synopsis: Generically applicable resource “eoe”
      1. A new resource similar to “aoe” is added to both jobs and vnodes to specify the energy operational environment.
    4. Reference to more detail on the interface.
      1. Is added to default resource list of scheduler in sched_config file.
      2. It is a non-consumable resource.
      3. It is of type resource, added to attribute resources_available. e.g. resources_available.eoe=”low,med,high”. It is a string array.
      4. Contains list of all power profile names that are available on a vnode. By default, resources_available.eoe is unset.
      5. The list is visible to all but settable only by manager.
      6. Job Resource_List.eoe per chunk in –l select as –l select=1:ncpus:eoe=low.
      7. Only one eoe value can be active on a vnode at a time.
      8. A job Resource_List.eoe may be requested in a select statement but no more than one distinct value for the requested eoe is currently supported. i.e. -lselect=1:ncpus=1:eoe=med+1:ncpus=2:eoe=med
      9. If a Job request is made with more than one value for eoe (I.e. –l select=1:eoe=low+1:eoe=high), it will be rejected by qsub with the error “qsub: only one value of eoe is allowed”.
      10. A value for resources_available.eoe will not be automatically set on the system(s) where the PBS server and scheduler are running.
      11. If both an aoe and eoe are set for a job, the aoe setting will be processed first by the scheduler.
      12. The scheduler will not prempt a job with eoe set using suspend or checkpoint.
    A.1.5     
  5. Interface #6A.1.5.1          #5
    1. Visibility: Public
    A.1.5.2         
    1. Change Control: Stable
    A.1.5.3         
    1. Synopsis: Generically applicable vnode attribute: current_eoe
    A.1.5.4         
    1. Reference to more detail on the interface.
        A.1.5.4.1              
          1. Identifies the eoe active on a vnode. It is of type String. By default, it is unset. It is settable only by manager and visible to all.
        A.1.5.4.2              
          1. A job J1 running with a eoe setting X will cause the value of current_eoe to be set  to X on the vnodes assigned to J1 that allow PMI operations.
        A.1.5.4.3              
          1. Manually changing current_eoe is unsupported.
        A.1.5.4.4              
          1. The scheduler can run a job requesting an eoe on vnodes with a current_eoe value that matches the job eoe.
        A.1.5.4.5              
          1. The scheduler can only run a job on a vnode where the current_eoe does not match the job eoe if no jobs are running on the vnode and PMI operations are allowed on the vnode.
        A.1.5.4.6              
          1. When a job ends the pmi_deactivate operation will take place if all the vnodes used by the job have no other jobs running and allow PMI operations.  At this point, current_eoe will be unset on all the vnodes used by the job.
      1. A.1.5.5          Comments on the interface
      2. A.1.5.5.1               Standing of the interface: new interface
      3. A.1.5.5.2               Interface type: Other
      4. A.1.6      Interface #7
      5. A.1.6.1          Visibility: Public
      6. A.1.6.2          Change Control: Experimental
      7. A.1.6.3         
      8. Interface #6
        1. Visibility: Public
        2. Change Control: Stable
        3. Synopsis: Cray specific job attribute: pstate
        A.1.6.4         
        1. Reference to more detail on the interface.
            A.1.6.4.1              
              1. Cray ALPS reservation setting for p-state.  See Basil 1.4 documentation.
            A.1.6.4.2              
              1. It is of type String. By default, it is unset. It is settable and visible to all PBS users.
          1. A.1.6.4.3               DELETED
          2. A.1.6.5          Comments on the interface
          3. A.1.6.5.1               Standing of the interface: new interface
          4. A.1.6.5.2               Interface type: Other
          5. A.1.7      Interface #8
          6. A.1.7.1          Visibility: Public
          7. A.1.7.2          Change Control: Experimental
          8. A.1.7.3          Interface #7
            1. Visibility: Public
            2. Change Control: Stable
            3. Synopsis: Cray specific job attribute: pgov
            A.1.7.4         
            1. Reference to more detail on the interface.
                A.1.7.4.1              
                  1. Cray ALPS reservation setting for p-governor.  See Basil 1.4 documentation.
                A.1.7.4.2              
                  1. It is of type String. By default, it is unset. It is settable and visible to all PBS users.
              1. A.1.7.4.3               DELETED
              2. A.1.7.5          Comments on the interface
              3. A.1.7.5.1               Standing of the interface: new interface
              4. A.1.7.5.2               Interface type: Other
              5. A.1.8      Interface #9
              6. A.1.8.1          Visibility: Public
              7. A.1.8.2          Change Control: Experimental
              8. A.1.8.3          Interface #8
                1. Visibility: Public
                2. Change Control: Stable
                3. Synopsis: Cray specific job attribute: pcap_node
                A.1.8.4         
                1. Reference to more detail on the interface.
                    A.1.8.4.1              
                      1. Cray capmc set_power_cap --node setting.  See capmc documentation
                    .
                  1. A.1.8.4.2               It is of type Int. By default, it is unset. It is settable and visible to all PBS users.
                  2. A.1.8.4.3               A  negative value will result in an PBSE_BADATVAL error.
                  3. A.1.8.4.4               DELETED
                  4. A.1.8.5          Comments on the interface
                  5. A.1.8.5.1               Standing of the interface: new interface
                  6. A.1.8.5.2               Interface type: Other
                  7. A.1.9      Interface #10
                  8. A.1.9.1          Visibility: Public
                  9. A.1.9.2          Change Control: Experimental
                  10. A.1.9.3         
                      1. .
                      2. It is of type Int. By default, it is unset. It is settable and visible to all PBS users.
                      3. A  negative value will result in an PBSE_BADATVAL error.
                  11. Interface #9
                    1. Visibility: Public
                    2. Change Control: Stable
                    3. Synopsis: Cray specific job attribute: pcap_accelerator
                    A.1.9.4         
                    1. Reference to more detail on the interface.
                        A.1.9.4.1              
                          1. Cray capmc set_power_cap --accel setting.  See capmc documentation.
                        A.1.9.4.2              
                          1. It is of type Int. By default, it is unset. It is settable and visible to all PBS
                        users.A.1.9.4.3              
                          1. users.
                          2. A negative value will result in an PBSE_BADATVAL error.
                      1. A.1.9.4.4               DELETED
                      2. A.1.9.5          Comments on the interface
                      3. A.1.9.5.1               Standing of the interface: new interface
                      4. A.1.9.5.2               Interface type: Other
                      5. A.1.10   Interface #11
                      6. A.1.10.1        Interface #10
                        1. Visibility: Public
                        A.1.10.2       
                        1. Change Control:
                        ExperimentalA.1.10.3       
                        1. Stable
                        2. Synopsis: Generically applicable vnode power enable flag
                        A.1.10.4       
                        1. Reference to more detail on the interface.
                            A.1.10.4.1           
                              1. The  power_enable boolean vnode attribute will have a default of unset, be visible to all and changeable by an administrator
                            .A.1.10.4.2           
                              1. .
                              2. Use qmgr to set the power_provisioning flag true or false.  For example:

                                                                           
                              1. Info
                                iconfalse

                                qmgr -c “set node bigbox power_enable = true”

                            A.1.10.4.3           
                              1.                                             


                              2. When it is set True, PMI operations may take place on the vnode.  If it is unset or set False, no PMI operations are allowed to take place on the vnode

                            .A.1
                              1. .

                            10.5        Comments on the interface
                          1. A.1.10.5.1            Standing of the interface: new interface
                          2. A.1.10.5.2            Interface type: Other
                          3. A.1.11   Interface #12
                          4. A.1.11.1        Visibility: Private
                          5. A.1.11.2        Change Control: Unstable
                          6. A.1.11.3        Interface #11
                            1. Visibility: Public
                            2. Change Control: Unstable
                            3. Synopsis: Expose the hook PMI structure to allow additions to the supported PMI list.
                            A.1.11.4       
                            1. Reference to more detail on the interface.
                                A.1.11.4.1           
                                  1. The PBS “power” hook can be modified to specify a PMI name in the pbs.Power() instantiation in the init_power function.  For example, the code below would cause the new file described in I.1.11.4.2 to be used by the hook:

                                    Info
                                    iconfalse
                                                                               
                                  1. power = pbs.Power(“ipmitool”)

                                A.1.11.4.2           

                                  1. Python code patterned after the file PBS_EXEC/lib/python/altair/pbs/v1/_pmi_none.py must be placed in a file where none is replaced by the PMI name being implemented.  For example:

                                                                               
                                  1. Info
                                    iconfalse

                                    # cd $PBS_EXEC/lib/python/altair/pbs/v1

                                                                               
                                  1. # cp _pmi_none.py _pmi_ipmitool.py

                                                                               
                                  1. # vi _pmi_ipmitool.py

                                A.1.11.4.3           

                                  1. The defined functions must all be present: __init__, _pmi_connect, _pmi_disconnect, _pmi_get_usage, _pmi_query, _pmi_activate_profile, _pmi_deactivate_profile.  These all have the same arguments as those in I
                                .1.1 except the function name has an intial underbar ('_').
                              1. A.1.11.5        Comments on the interface
                              2. A.1.11.5.1            Standing of the interface: new interface
                              3. A.1.11.5.2            Interface type: Other
                              4. A.1.12   Interface #13
                              5. A.1.12.1        Visibility: Public
                              6. A.1.12.2        Change Control: Experimental
                              7. A.1.12.3       
                                  1. .1.1 except the function name has an intial underbar ('_').
                              8. Interface #12
                                1. Visibility: Public
                                2. Change Control: Stable
                                3. Synopsis: Mom log using logjobmsg when a job ends and the value of current_eoe is unset.
                                A.1.12.4       
                                1. Reference to more detail on the interface.
                                    A.1.12.4.1           
                                      1. Example:

                                        Info
                                        iconfalse
                                                                       
                                      1. 11/19/2014 14:44:15;0008;pbs_python;Job

                                    ;165.bigcray;PMI: reset current_eoe
                                  1. A.1.12.5        Comments on the interface
                                  2. A.1.12.5.1            Standing of the interface: new interface
                                  3. A.1.12.5.2            Interface typeLog message
                                  4. A.1.13   Interface #14
                                  5. A.1.13.1        Visibility: Public
                                  6. A.1.13.2       
                                      1. ;165.bigcray;PMI: reset current_eoe


                                  7. Interface #13
                                    1. Visibility: Public
                                    2. Change Control: Experimental
                                    A.1.13.3       
                                    1. Synopsis: When the energy for a job on an SGI system is obtained, it will be logged by MoM using  logjobmsg.
                                    A.1.13.4       
                                    1. Reference to more detail on the interface.
                                        A.1.13.4.1           
                                          1. Example:

                                            Info
                                            iconfalse

                                            11/06/2014 18:35:26;0008;pbs_python

                                        ;Job;4856.iceberg;SGI: energy 1.456kWh
                                      1. A.1.13.5        Comments on the interface
                                      2. A.1.13.5.1            Standing of the interface: new interface
                                      3. A.1.13.5.2            Interface typeLog message
                                      4. A.1.14   Interface #15
                                      5. A.1.14.1        Visibility: Public
                                      6. A.1.14.2       
                                          1. ;Job;4856.iceberg;SGI: energy 1.456kWh


                                      7. Interface #14
                                        1. Visibility: Public
                                        2. Change Control: Experimental
                                        A.1.14.3       
                                        1. Synopsis: The Cray capmc command invocations will be logged by MoM using LOG_DEBUG with the keyword “launch”.
                                        A.1.14.4       
                                        1. Reference to more detail on the interface.
                                            A.1.14.4.1           
                                              1. Example:

                                                Info
                                                iconfalse

                                                11/19/2014 15:20:58;0006;pbs_python;Hook;pbs_python;Cray: 167.bigcray launch: /opt/cray/capmc/default/bin/capmc get_node_

                                            energy_counter --nids 0
                                          1. A.1.14.5        Comments on the interface
                                          2. A.1.14.5.1            Standing of the interface: new interface
                                          3. A.1.14.5.2            Interface typeLog message
                                          4. A.1.15   Interface #16
                                          5. A.1.15.1        Visibility: Public
                                          6. A.1.15.2       
                                              1. energy_counter --nids 0


                                          7. Interface #15
                                            1. Visibility: Public
                                            2. Change Control: Experimental
                                            A.1.15.3       
                                            1. Synopsis:  Following a successful Cray capmc invocation, a message will be logged by MoM using LOG_WARNING if the time used by capmc is greater than 30 seconds.
                                            A.1.15.4       
                                            1. Reference to more detail on the interface.
                                                A.1.15.4.1           
                                                  1. Example:

                                                    Info
                                                    iconfalse

                                                    11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch: finished  in 156 seconds

                                              1. A.1.15.5        Comments on the interface
                                              2. A.1.15.5.1            Standing of the interface: new interface
                                              3. A.1.15.5.2            Interface typeLog message

                                              4. A.1.16   Interface #17#16
                                              5. A.1.16.1        Visibility: Private Public
                                              6. A.1.16.2        Change Control: Unstable
                                              7. A.1.16.3        Synopsis:  If Cray capmc writes anything to stderr, the first line will be logged by MoM using LOG_WARNING after the “launch” message.
                                              8. A.1.16.4        Reference to more detail on the interface.
                                              9. A.1.16.4.1            Cray has not documented the possible stderr output from capmc.
                                              10. A.1.16.4.2            Example:
                                              11. 11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch stderr: i fell and cannot get up
                                              12. A.1.16.5        Comments on the interface
                                              13. A.1.16.5.1            Standing of the interface: new interface
                                              14. A.1.16.5.2            Interface typeLog message
                                              15. A.1.17   Interface #18
                                              16. A.1.17.1        Visibility: Private
                                              17. A.1.17.2        Change Control: Unstable
                                              18. A.1.17.3        Synopsis:  When Cray capmc is run with the argument “get_node_energy_counter”, the node count is checked and if it is wrong, a message will be logged by MoM using logjobmsg.
                                              19. A.1.17.4        Reference to more detail on the interface.
                                              20. A.1.17.4.1            The same command will be run one additional time if an error is seen.  No message will be logged for the first error.  If an error occurs after the second attempt, a message is logged.
                                              21. A.1.17.4.2            For example:
                                              22. 11/19/2014 15:20:05;0008;pbs_python;Job;166.centos1;error: node count 2, should be 1

                                              ...