Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 18 Next »

Overview:

The objective of  PP-325 - Getting issue details... STATUS is to provide a hook that restricts resource availability for job processes by utilizing the Linux cgroup infrastructure. An overview of the Linux kernel cgroup implementation may be found here: https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt It is assumed that the reader is familiar with the various cgroup subsystems and how they may be utilized. When the cgroup hook is enabled, it runs on every node assigned to the job. When a job is started, the hook creates a set of directories for the configured subsystems based on the resource requirements of the job and then places the job process within the cgroup. The kernel then enforces resource restrictions based on the newly created cgroup settings while the job is running. The cgroup hook may be configured to periodically poll the job's cgroup while the job is running and updates resource utilization. When the job completes, the final resource utilization measurement is taken and the hook removes the cgroup directories it created when the job was initialized.

Interface 1: cgroup configuration file

  • Visibility: Public
  • Change Control: Experimental
  • Synopsis: JSON hook config file

  • Details: The configuration file allows the administrator to adjust the behavior of the cgroup hook across their cluster. The file must conform to JSON syntax. A sample configuration file is displayed and described below:
Cgroup Configuration File
{
    "cgroup_prefix"         : "pbspro",
    "exclude_hosts"         : ["node001", "node002"],
    "exclude_vntypes"       : ["disable_cgroups", "login_node"],
    "run_only_on_hosts"     : [],
    "periodic_resc_update"  : true,
    "vnode_per_numa_node"   : false,
    "online_offlined_nodes" : true,
    "use_hyperthreads"      : false,
    "cgroup" : {
        "cpuacct" : {
            "enabled"         : true,
            "exclude_hosts"   : ["node001"],
            "exclude_vntypes" : ["red_node"]
        },
        "cpuset" : {
            "enabled"         : true,
            "exclude_hosts"   : ["node002"],
            "exclude_vntypes" : ["green_node"]
        },
        "devices" : {
            "enabled"         : false,
            "exclude_hosts"   : [],
            "exclude_vntypes" : [],
            "allow"           : [
                "b *:* rwm",
                "c *:* rwm",
                ["mic/scif", "rwm"],
                ["nvidiactl", "rwm", "*"],
                ["nvidia-uvm", "rwm"]
            ]
        },
        "hugetlb" : {
            "enabled"         : false,
            "exclude_hosts"   : [],
            "exclude_vntypes" : [],
            "default"         : "0MB",
            "reserve_percent" : "0",
            "reserve_amount"  : "0MB"
        },
        "memory" : {
            "enabled"         : true,
            "exclude_hosts"   : [],
            "exclude_vntypes" : ["blue_node"],
            "soft_limit"      : false,
            "default"         : "256MB",
            "reserve_percent" : "0",
            "reserve_amount"  : "1GB"
        },
        "memsw" : {
            "enabled"         : true,
            "exclude_hosts"   : [],
            "exclude_vntypes" : ["grey_node"],
            "default"         : "256MB",
            "reserve_percent" : "0",
            "reserve_amount"  : "1GB"
        }
    }
}


Global Parameters:

Parameter NameDefault ValueDescription
cgroup_prefix"pbspro"The parent directory under each cgroup subsystem where job cgroups will be created. For example, if the memory subsystem is located at /sys/fs/cgroup/memory then the memory cgroup for job 123.foo would be found in the /sys/fs/cgroup/memory/pbspro/123.foo directory.
cgroup_lock_file"/var/spool/pbs/mom_priv/cgroups.lock"This file is used to ensure reads and writes of the PBS Professional cgroups are mutually exclusive. The filesystem must support file locking.
exclude_hosts[ ]Specifies the list of hosts for which the cgroups hook should be disabled.
exclude_vntypes[ ]Specifies a list of vnode types for which the cgroups hook should be disabled.
kill_timeout10Specifies the amount of time the cgroup hook should spend while attempting to kill a process within a cgroup.
nvidia-smi/usr/bin/nvidia-smiThe location of the nvidia-smi command on nodes supporting NVIDIA GPU devices.
online_offlined_nodesfalseWhen the cgroup hook fails to kill all processes within a cgroup, it will offline the node to prevent oversubscribing resources. The cgroup hook will periodically attempt to cleanup these "orphaned" cgroups. When set to false, the administrator must manually online the node when the problem is resolved. When set to true, the hook will return the node to service automatically.
periodic_resc_updatefalseWhen set to true, the hook periodically polls the cgroups of a running job and updates the jobs resource usage for cput, mem, and vmem resources. When set to false, MoM periodically polls /proc to obtain resource usage data.
placement_type"load_balanced"When this parameter is set to "load_balanced" the cgroup hook will reorder the sockets of a multi-socket node in an effort to distribute load across them.
run_only_on_hosts[ ]Specifies the list of hosts for which the cgroup hook should be enabled. If the list is not empty, it overrides the settings of exclude_hosts and exclude_vntypes.
use_hyperthreadsfalseWhen set to true, hyperthreads are treated as though they were physical cores.
vnode_per_numa_nodefalseWhen set to true, each NUMA node will appear as though it were an independent vnode managed by a parent vnode. The parent vnode will have no resources associated with it. When set to false, the node will appear as a single vnode.


  • Config Option: online_offlined_nodes allows the cgroup hook to online nodes that were offlined by the cgroup hook due to orphan cgroups not cleaning up. Valid values are true or false.
  • Config Option: cgroup allows the admin to specify which subsystems PBS will use for the job. Valid subsystem values are cpuacct, cpuset, devices, hugetlb (where supported), memory, and memsw.
    • Valid Config Options for all Subsystems
      • enabled valid options true/false
      • exclude_hosts see above
      • exclude_vntypes see above for valid options
    • Config Option: cpuacct subsystem allows the admin to use the cpuacct subsystem, which tracks the cput of all of the pids assigned to the cgroup. Valid keys in the cpuacct subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
    • Config Option: cpuset subsystem allows the admin to use the cpuset subsystem, which assigns the cores and memory socket(s) for use by all of the pids assigned to the cgroup. Valid keys in the cpuset subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
    • Config Option: devices subsystem allows the admin to use the devices subsystem, which assigns the devices for use by all of the pids assigned to the cgroup. Valid keys in the devices subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4), allow (list of devices to allow access to).
      • Config Option: allow allows the admin to use the devices subsystem, which allows access to the listed devices for use by all of the pids assigned to the cgroup. Valid ways to reference allowable devices are as follows
        • "b *:* rwm" (This exact string will be used in the allowed string)
        • ["mic/scif","rwm"] (This will look for the major and minor number of the mic/scif device and set it to rwm (i.e. if /dev/mic reported "crw-rw-rw- 1 root root 244, 1 Mar 30 14:50 scif" then the line added to the allow file would look like "c 244:1 rwm"))
        • ["nvidiactl","rwm", "*"] (This will look for the major number of the nvidiactl device and set it to rwm (i.e. if /dev/nvidiactl reported "crw-rw-rw- 1 root root 284, 1 Mar 30 14:50 nvidiactl" then the line added to the allow file would look like "c 284:* rwm"))
      • Notes: The devices subsystem has not been extensively tested. However manual tests on systems with GPUs and Mics seem to work as expected if the correct devices have been added to the allow list
    • Config Option: hugetlb subsystem allows the admin to use the hugetlb subsystem, which allows access to the hugetlb memory. Valid keys in the devices subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
    • Config Option: memory subsystem allows the admin to use the memory subsystem, which allows the admin to monitory and limit memory used by all of the pids assigned to the cgroup. Valid keys in the memory subsystem are enabled (valid options true/false), default (memory assigned if the job did not request any), reserve_memory (memory to reserve for processes outside of PBS), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
      • Config Option: default allows the admin to assign memory to a job if it did not request any
      • Config Option: reserve_memory allows the admin to reserve memory for processes outside of PBS jobs
    • Config Option: memsw subsystem allows the admin to monitory and limit swap used by all of the pids assigned to the cgroup.
      • Config Option: default allows the admin to assign memory to a job if it did not request any
      • Config Option: reserve_memory allows the admin to reserve memory for processes outside of PBS jobs
      • Note: To limit swap you must add vmem to the resources line in the sched_config file

Setup:

  • Run the following commands in qmgr from the directory where the cgroups.py and cgroups.json file are located
    • create hook cgroups
    • set hook cgroups event = "execjob_begin,execjob_launch,execjob_attach,execjob_epilogue,execjob_end,exechost_startup,exechost_periodic"
    • set hook cgroups freq = 120
    • set hook cgroups fail_action = offline_vnodes
    • import hook cgroups application/x-python default cgroups.py
    • import hook cgroups application/x-config default cgroups.json
  • No labels