Overview:
The objective of - PP-325Getting issue details... STATUS is to provide a hook that restricts resource availability for job processes by utilizing the Linux cgroup infrastructure. An introductory video to cgroups together with links to other resources may be found here: https://sysadmincasts.com/episodes/14-introduction-to-linux-control-groups-cgroups The definitive Linux kernel cgroup documentation may be found here: https://www.kernel.org/doc/Documentation/cgroup-v1/ It is assumed that the reader is familiar with the various cgroup subsystems and how they may be utilized. When the cgroup hook is enabled, it runs on every node assigned to the job. When a job is started, the hook creates a set of directories for the configured subsystems based on the resource requirements of the job and then places the job process within the cgroup. The kernel then enforces resource restrictions based on the newly created cgroup settings while the job is running. The cgroup hook may be configured to periodically poll the job's cgroup while the job is running and updates resource utilization. When the job completes, the final resource utilization measurement is taken and the hook removes the cgroup directories it created when the job was initialized.
Interface 1: cgroup configuration file
- Visibility: Public
- Change Control: Experimental
Synopsis: JSON hook config file
- Details: The configuration file allows the administrator to adjust the behavior of the cgroup hook across their cluster. The file must conform to JSON syntax. A sample configuration file is displayed and described below:
{ "cgroup_prefix" : "pbspro", "exclude_hosts" : ["node001", "node002"], "exclude_vntypes" : ["disable_cgroups", "login_node"], "run_only_on_hosts" : [], "periodic_resc_update" : true, "vnode_per_numa_node" : false, "online_offlined_nodes" : true, "use_hyperthreads" : false, "cgroup" : { "cpuacct" : { "enabled" : true, "exclude_hosts" : ["node001"], "exclude_vntypes" : ["red_node"] }, "cpuset" : { "enabled" : true, "exclude_hosts" : ["node002"], "exclude_vntypes" : ["green_node"] }, "devices" : { "enabled" : false, "exclude_hosts" : [], "exclude_vntypes" : [], "allow" : [ "b *:* rwm", "c *:* rwm", ["mic/scif", "rwm"], ["nvidiactl", "rwm", "*"], ["nvidia-uvm", "rwm"] ] }, "hugetlb" : { "enabled" : false, "exclude_hosts" : [], "exclude_vntypes" : [], "default" : "0MB", "reserve_percent" : "0", "reserve_amount" : "0MB" }, "memory" : { "enabled" : true, "exclude_hosts" : [], "exclude_vntypes" : ["blue_node"], "soft_limit" : false, "default" : "256MB", "reserve_percent" : "0", "reserve_amount" : "1GB" }, "memsw" : { "enabled" : true, "exclude_hosts" : [], "exclude_vntypes" : ["grey_node"], "default" : "256MB", "reserve_percent" : "0", "reserve_amount" : "1GB" } } }
Global Parameters:
The global parameters are used to modify the behavior of the cgroup hook across all nodes in the PBS Pro complex. They are not specific to any cgroup subsystem.
Parameter Name | Default Value | Description |
---|---|---|
cgroup_prefix | "pbspro" | The parent directory under each cgroup subsystem where job cgroups will be created. For example, if the memory subsystem is located at /sys/fs/cgroup/memory then the memory cgroup for job 123.foo would be found in the /sys/fs/cgroup/memory/pbspro/123.foo directory. |
cgroup_lock_file | "/var/spool/pbs/mom_priv/cgroups.lock" | This file is used to ensure reads and writes of the PBS Professional cgroups are mutually exclusive. The filesystem must support file locking. |
exclude_hosts | [ ] | Specifies the list of hosts for which the cgroups hook should be disabled. |
exclude_vntypes | [ ] | Specifies a list of vnode types for which the cgroups hook should be disabled. |
kill_timeout | 10 | Specifies the amount of time the cgroup hook should spend while attempting to kill a process within a cgroup. |
nvidia-smi | /usr/bin/nvidia-smi | The location of the nvidia-smi command on nodes supporting NVIDIA GPU devices. |
online_offlined_nodes | false | When the cgroup hook fails to kill all processes within a cgroup, it will offline the node to prevent oversubscribing resources. The cgroup hook will periodically attempt to cleanup these "orphaned" cgroups. When set to false, the administrator must manually online the node when the problem is resolved. When set to true, the hook will return the node to service automatically. |
periodic_resc_update | false | When set to true, the hook periodically polls the cgroups of a running job and updates the jobs resource usage for cput, mem, and vmem resources. When set to false, MoM periodically polls /proc to obtain resource usage data. |
placement_type | "load_balanced" | When this parameter is set to "load_balanced" the cgroup hook will reorder the sockets of a multi-socket node in an effort to distribute load across them. |
run_only_on_hosts | [ ] | Specifies the list of hosts for which the cgroup hook should be enabled. If the list is not empty, it overrides the settings of exclude_hosts and exclude_vntypes. |
use_hyperthreads | false | When set to true, hyperthreads are treated as though they were physical cores. When false, hyperthreads are not counted as physical cores. |
vnode_per_numa_node | false | When set to true, each NUMA node will appear as though it were an independent vnode managed by a parent vnode. The parent vnode will have no resources associated with it. When set to false, the node will appear as a single vnode. |
cpuacct Subsystem:
The cpuacct subsystem provides support for measuring CPU usage within a cgroup.
Parameter Name | Default Value | Description |
---|---|---|
enabled | false | When set to true, the hook will update job CPU time using the value from the cpuacct subsystem (e.g. /sys/fs/cgroup/cpuacct/pbspro/123.foo/cpuacct.usage). When set to false, CPU time is accumulated when MoM periodically polls the processes of the job. |
exclude_hosts | [ ] | Specifies the list of hosts for which the use of this subsystem should be disabled. |
exclude_vntypes | [ ] | Specifies the list of vnode types for which the use of this subsystem should be disabled. |
cpuset Subsystem:
The cpuset subsystem is used by the Linux kernel to restrict access to both CPU and memory based on available sockets and NUMA nodes.
Parameter Name | Default Value | Description |
---|---|---|
enabled | false | When set to true, the hook will create a cpuset for each job. The hook will configure the cpuset based on the resources requested by the job, taking into account the number of CPUs and memory requirements. This helps to ensure the job uses memory that is local to the CPUs assigned to the job. When set to false, the kernel is free to schedule processes and allocate memory based on the system configured policies. |
exclude_hosts | [ ] | Specifies the list of hosts for which the use of this subsystem should be disabled. |
exclude_vntypes | [ ] | Specifies the list of vnode types for which the use of this subsystem should be disabled. |
devices Subsystem:
The devices subsystem is used to grant or restrict access to devices on the system. This is most commonly used for accelerator cards such as GPUs, and MICs.
Parameter Name | Default Value | Description |
---|---|---|
enabled | false | When set to true, the hook will configure the devices subsystem based on the number of nmics and ngpus requested by the job. Refer to the allow parameter below for additional information. When set to false, no cgroup will be created for the device subsystem. |
exclude_hosts | [ ] | Specifies the list of hosts for which the use of this subsystem should be disabled. |
exclude_vntypes | [ ] | Specifies the list of vnode types for which the use of this subsystem should be disabled. |
allow | [ ] | Specifies how access to devices will be controlled. The list consists of entries in one of the following formats:
|
hugetlb Subsystem:
The hugetlb subsystem restricts the amount of huge page memory that may be used in a cgroup.
Parameter Name | Default Value | Description |
---|---|---|
enabled | false | When set to true, the hook will register a limit that restricts the amount of huge page memory processes may access. When set to false, no limit is registered. |
exclude_hosts | [ ] | Specifies the list of hosts for which the use of this subsystem should be disabled. |
exclude_vntypes | [ ] | Specifies the list of vnode types for which the use of this subsystem should be disabled. |
default | 0MB | The amount of huge page memory assigned to the cgroup when the job does not request hpmem. |
reserve_percent | 0 | The percentage of available huge page memory that is not to be assigned to jobs. This will alter the amount of hpmem that MoM reports to the server. This is added to reserve_amount to obtain the total amount reserved. |
reserve_amount | 0MB | An amount of available huge page memory that is not to be assigned to jobs. This will alter the amount of hpmem that MoM reports to the server. This is added to reserve_percent to obtain the total amount reserved. |
memory Subsystem:
The memory subsystem restricts the amount of physical memory all of the processes in a cgroup may allocate. It works in conjunction with the memsw (memory and swap) subsystem to limit both physical memory and virtual memory. The memsw subsystem should be enabled while the memory subsystem is enabled. Otherwise, the virtual memory limit will remain unrestricted. If the processes in a cgroup exceed their physical memory limit, the kernel will begin to utilize swap space even if the node has sufficient physical memory to allocate. When a job specifies a mem limit without a vmem limit, the vmem limit is automatically set to the mem limit.
Parameter Name | Default Value | Description |
---|---|---|
enabled | false | The hook will register the physical memory limit for a job when set to true. No limit is registered when set to false. |
exclude_hosts | [ ] | Specifies the list of hosts for which the use of this subsystem should be disabled. |
exclude_vntypes | [ ] | Specifies the list of vnode types for which the use of this subsystem should be disabled. |
soft_limit | false | A soft memory limit is used to specify the minimum amount of physical memory a job should be allocated before utilizing swap space. This adjusts the behavior of the kernel by allowing the physical memory allocation to exceed the amount specified in the soft limit when memory demand (a.k.a memory pressure) is low. The cgroup is ultimately limited to the amount of virtual memory specified in the memsw system. When memory pressure increases, the kernel will begin to page physical memory out to swap space until the soft limit is reached. Soft memory limits allow processes to take advantage of physical memory when it is available, but may lead to longer run times when memory pressure is high. Soft memory limits are used when this parameter is set to true. When set to false, hard memory limits are used that prevent the processes from ever exceeding their specified mem limit. |
default | 0MB | The amount of physical memory available to a cgroup when no mem limit has been specified. |
reserve_percent | 0 | The percentage of available physical memory that is not to be assigned to jobs. This will alter the amount of mem that MoM reports to the server. This is added to reserve_amount to obtain the total amount reserved. |
reserve_amount | 0MB | The amount of available physical memory that is not to be assigned to jobs. This will alter the amount of mem that MoM reports to the server. This is added to reserve_amount to obtain the total amount reserved. |
memsw Subsystem:
Parameter Name | Default Value | Description |
---|---|---|
enabled | false | The hook will register the virtual memory limit for a job when set to true. No limit is registered when set to false. |
exclude_hosts | [ ] | Specifies the list of hosts for which the use of this subsystem should be disabled. |
exclude_vntypes | [ ] | Specifies the list of vnode types for which the use of this subsystem should be disabled. |
default | 0MB | The amount of physical memory available to a cgroup when no vmem limit has been specified. |
reserve_percent | 0 | The percentage of available virtual memory that is not to be assigned to jobs. This will alter the amount of mem that MoM reports to the server. This is added to reserve_amount to obtain the total amount reserved. |
reserve_amount | 0MB | The amount of available virtual memory that is not to be assigned to jobs. This will alter the amount of mem that MoM reports to the server. This is added to reserve_amount to obtain the total amount reserved. |
Interface 2: nmics
Interface 3: ngpus
Interface 4: hpmem
- Config Option: hugetlb subsystem allows the admin to use the hugetlb subsystem, which allows access to the hugetlb memory. Valid keys in the devices subsystem are enabled (valid options true/false), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
- Config Option: memory subsystem allows the admin to use the memory subsystem, which allows the admin to monitory and limit memory used by all of the pids assigned to the cgroup. Valid keys in the memory subsystem are enabled (valid options true/false), default (memory assigned if the job did not request any), reserve_memory (memory to reserve for processes outside of PBS), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
- Config Option: default allows the admin to assign memory to a job if it did not request any
- Config Option: reserve_memory allows the admin to reserve memory for processes outside of PBS jobs
- Config Option: memory subsystem allows the admin to use the memory subsystem, which allows the admin to monitory and limit memory used by all of the pids assigned to the cgroup. Valid keys in the memory subsystem are enabled (valid options true/false), default (memory assigned if the job did not request any), reserve_memory (memory to reserve for processes outside of PBS), exclude_hosts (see interface 3), exclude_vntypes (see interface 4).
- Config Option: memsw subsystem allows the admin to monitory and limit swap used by all of the pids assigned to the cgroup.
- Config Option: default allows the admin to assign memory to a job if it did not request any
- Config Option: reserve_memory allows the admin to reserve memory for processes outside of PBS jobs
- Note: To limit swap you must add vmem to the resources line in the sched_config file
- Config Option: memsw subsystem allows the admin to monitory and limit swap used by all of the pids assigned to the cgroup.
Setup:
- Run the following commands in qmgr from the directory where the cgroups.py and cgroups.json file are located
- create hook cgroups
- set hook cgroups event = "execjob_begin,execjob_launch,execjob_attach,execjob_epilogue,execjob_end,exechost_startup,exechost_periodic"
- set hook cgroups freq = 120
- set hook cgroups fail_action = offline_vnodes
- import hook cgroups application/x-python default cgroups.py
- import hook cgroups application/x-config default cgroups.json