PP-810: cgroups v2 with systemd

Note: This page updates information from the original cgroup design in PP-325

Overview

The release of cgroups v2 in the Linux kernel combined with the adoption of systemd style service management in most popular Linux distros means that the cgroup hook in PBS Pro must be updated to support new capabilities. This document describes the interface changes that will be introduced.

Interface 1:

  • Synopsis: cgroup.cpuset.exclude_cpus
  • Detail: Allow administrator to exclude cores from being assigned to jobs by adding numeric entries to a JSON list within the cpuset section of the cgroup hook configuration file. This setting impacts the creation of vnodes when vnode_per_numa_node is set to true. When vnodes are created, their core count (resources_available.ncpus) is reduced accordingly. When vnode_per_numa_node is false, the excluded CPUs apply to the node itself. The core count (resources_available.ncpus) of the node itself is reduced in this case.
  • Default: Empty list, no CPUs excluded
  • Example:

    exclude_cpus
    "cpuset" : {
        "enabled"         : true,
        "exclude_cpus"    : [0, 8],
        "exclude_hosts"   : ["node004"],
        "exclude_vntypes" : ["green_node"]
    },

Interface 2:

  • Synopsis: cgroup.cpuset.mem_fences
  • Detail: Allow administrator to prevent cgroup hook from binding jobs to NUMA nodes. This prevents the hook from writing values to cpuset.mems in the cpuset subsystem.
  • Default: True, cgroup hook will write values to cpuset.mems.
  • Example:

    mem_fences
    "cpuset" : {
        "enabled"         : true,
        "mem_fences"      : true,
        "exclude_hosts"   : ["node004"],
        "exclude_vntypes" : ["green_node"]
    },

Interface 3:

  • Synopsis: cgroup.cpuset.mem_hardwall
  • Detail: Allow administrator to override the value of cpuset.mem_hardwall. The RedHat documentation discribes this as:
    cpuset.mem_hardwall
    contains a flag (0 or 1) that specifies whether kernel allocations of memory page and buffer data should be restricted to the memory nodes specified for the cpuset. By default (0), page and buffer data is shared across processes belonging to multiple users. With a hardwall enabled (1), each tasks' user allocation can be kept separate.
  • Default: False (zero)
  • Example:

    mem_hardwall
    "cpuset" : {
        "enabled"         : true,
        "mem_hardwall"    : false,
        "exclude_hosts"   : ["node004"],
        "exclude_vntypes" : ["green_node"]
    },

Interface 4:

  • Synopsis: cgroup.cpuset.memory_spread_page
  • Detail: Allow administrator to override the value of cpuset.memory_spread_page. The RedHat documentation discribes this as:
    cpuset.memory_spread_page
    contains a flag (0 or 1) that specifies whether file system buffers should be spread evenly across the memory nodes allocated to the cpuset. By default (0), no attempt is made to spread memory pages for these buffers evenly, and buffers are placed on the same node on which the process that created them is running.
  • Default: False (zero)
  • Example:

    memory_spread_page
    "cpuset" : {
        "enabled"            : true,
        "memory_spread_page" : false,
        "exclude_hosts"      : ["node004"],
        "exclude_vntypes"    : ["green_node"]
    },

Upgrading PBS Pro with cgroup hook:

  • Migrating from versions of PBS Pro prior to 18.2 on systems that utilize systemd will leave behind subdirectories that the older cgroups hook had created.
  • The presence of these directories is not harmful to newer versions of the cgroups hook.
  • These directories will no longer be present after a reboot. The cgroups hook creates new directories when the exechost_startup event is handled.



OSS Site Map

Project Documentation Main Page

Developer Guide Pages