Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Objective:

To integrate PBS with Linux cgroup capabilities

Interface 1: cgroup configuration file

  • Visibility: Public
  • Change Control: Experimental
  • Synopsis: JSON hook config file

  • Details: Allows the admin to enable and adjust the behavior of the cgroup hook based on the settings.
  • Notes:
    • All config options inherit the visibility and change control of the cgroup configuration file
    • qmgr will validate this file before importing. Any errors in the json syntax will cause qmgr to reject the import
    • Config options in red do not have PTL tests checking them


Cgroup Config File
{
        "cgroup_prefix"         : "pbspro",
        "periodic_resc_update"  : true,
        "exclude_hosts"         : ["n001", "n002"],
        "exclude_vntypes"       : ["no_cgroups", "login_nodes"],
        "run_only_on_hosts"     : [],
        "vnode_per_numa_node"   : false,
        "online_offlined_nodes" : true,
        "cgroup":
        {
                "cpuacct":
                {
                        "enabled"               : true,
                        "exclude_hosts"         : [], 
                        "exclude_vntypes"       : []
                },
                "cpuset":
                {
                        "enabled"               : true,
                        "exclude_hosts"         : [], 
                        "exclude_vntypes"       : ["no_cgroups_cpus"]
                },
                "devices":
                {
                        "enabled"               : false,
                        "exclude_hosts"         : [], 
                        "exclude_vntypes"       : [], 
                        "allow" : ["b *:* rwm","c *:* rwm", ["mic/scif","rwm"],["nvidiactl","rwm", "*"],["nvidia-uvm","rwm"]]
                },
                "hugetlb":
                {
                        "enabled"               : false,
                        "default"               : "0MB",
                        "exclude_hosts"         : [], 
                        "exclude_vntypes"       : []
                },
                "memory":
                {
                        "enabled"               : true,
                        "default"               : "256MB",
                        "reserve_memory"        : "0MB",
                        "exclude_hosts"         : [], 
                        "exclude_vntypes"       : ["no_cgroups_mem"]
                },
                "memsw":
                {
                        "enabled"               : true,
                        "default"               : "256MB",
                        "reserve_memory"        : "2gb",
                        "exclude_hosts"         : [], 
                        "exclude_vntypes"       : []
                }
        }
}


  • Config Option: cgroup_prefix allows the admin to name the directory where all of the cgroup directories for PBS jobs will be placed (i.e. /sys/fs/cgroup/cpuset/<cgroup_prefix>).
  • Config Option: periodic_resc_update allows the admin to enable the cgroup hook to update the resources_used values for cput, mem, and vmem. Valid values are true/false. Default value is true.
  • Config Option: exclude_hosts allows the admin exclude certain hosts from running the cgroups hooks. Valid values are any host name, as reported by the hook function pbs.get_local_nodename(), managed by the pbs server.
  • Config Option: exclude_vntypes allows the admin exclude certain vntypes from running the cgroups hooks. Valid values are any string that the admin places on the first line in a file named vntype located in PBS_HOME/mom_priv. 
    Note: the file in PBS_HOME/mom_priv is called vntypes, NOT vntype
  • Config Option: run_only_on_hosts allows the admin allow the cgroups hook to only run on a certain set of hosts. Valid values are any host name, as reported by the hook function pbs.get_local_nodename(), managed by the pbs server.
  • Config Option: vnode_per_numa_node allows the admin allow to create individual vnodes per numa node. On a two socket system it creates two additional vnodes and assigns the resources of each numa node to the vnode. It also sets the resources managed by the parent vnode to zero. Valid values are true/false.
    • Note: No automated tests since I only have a single socket test system. I have manually tested it and it works as expected.
  • Config Option: online_offlined_nodes allows the cgroup hook to online nodes that were offlined by the cgroup hook due to orphan cgroups not cleaning up. Valid values are true or false.
  • Config Option: cgroup allows the admin to specify which subsystems PBS will use for the job. Valid subsystem values are cpuacct, cpuset, devices, hugetlb (where supported), memory, and memsw.
    • Valid Config Options for all Subsystems
      • enabled valid options true/false
      • exclude_hosts see above
      • exclude_vntypes see above for valid options
    • Config Option: cpuacct subsystem allows the admin to use the cpuacct subsystem, which tracks the cput of all of the pids assigned to the cgroup. Valid keys in the cpuacct subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
    • Config Option: cpuset subsystem allows the admin to use the cpuset subsystem, which assigns the cores and memory socket(s) for use by all of the pids assigned to the cgroup. Valid keys in the cpuset subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
    • Config Option: devices subsystem allows the admin to use the devices subsystem, which assigns the devices for use by all of the pids assigned to the cgroup. Valid keys in the devices subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4), allow (list of devices to allow access to).
      • Config Option: allow allows the admin to use the devices subsystem, which allows access to the listed devices for use by all of the pids assigned to the cgroup. Valid ways to reference allowable devices are as follows
        • "b *:* rwm" (This exact string will be used in the allowed string)
        • ["mic/scif","rwm"] (This will look for the major and minor number of the mic/scif device and set it to rwm (i.e. if /dev/mic reported "crw-rw-rw- 1 root root 244, 1 Mar 30 14:50 scif" then the line added to the allow file would look like "c 244:1 rwm"))
        • ["nvidiactl","rwm", "*"] (This will look for the major number of the nvidiactl device and set it to rwm (i.e. if /dev/nvidiactl reported "crw-rw-rw- 1 root root 284, 1 Mar 30 14:50 nvidiactl" then the line added to the allow file would look like "c 284:* rwm"))
      • Notes: The devices subsystem has not been extensively tested. However manual tests on systems with GPUs and Mics seem to work as expected if the correct devices have been added to the allow list
    • Config Option: hugetlb subsystem allows the admin to use the hugetlb subsystem, which allows access to the hugetlb memory. Valid keys in the devices subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
    • Config Option: memory subsystem allows the admin to use the memory subsystem, which allows the admin to monitory and limit memory used by all of the pids assigned to the cgroup. Valid keys in the memory subsystem are enabled (valid options true/false), default (memory assigned if the job did not request any), reserve_memory (memory to reserve for processes outside of PBS), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
      • Config Option: default allows the admin to assign memory to a job if it did not request any
      • Config Option: reserve_memory allows the admin to reserve memory for processes outside of PBS jobs
    • Config Option: memsw subsystem allows the admin to monitory and limit swap used by all of the pids assigned to the cgroup.
      • Config Option: default allows the admin to assign memory to a job if it did not request any
      • Config Option: reserve_memory allows the admin to reserve memory for processes outside of PBS jobs
      • Note: To limit swap you must add vmem to the resources line in the sched_config file

Setup:

  • Run the following commands in qmgr from the directory where the cgroups.py and cgroups.json file are located
    • create hook cgroups
    • set hook cgroups event = "execjob_begin,execjob_launch,execjob_attach,execjob_epilogue,execjob_end,exechost_startup,exechost_periodic"
    • set hook cgroups freq = 120
    • set hook cgroups fail_action = offline_vnodes
    • import hook cgroups application/x-python default cgroups.py
    • import hook cgroups application/x-config default cgroups.json
  • No labels