Introduction:

Use case:

As clusters get larger and workloads vary it is becoming critical that the jobs get evaluated in as short as time possible to ensure that the correct workload is being run. Using multiple schedulers to address this issue can allow for different scheduling policies and quicker turnaround time for large number of jobs or nodes.

Gist of design proposal::

PBS scheduler in it's current form can run easily run in multiple instances on the same machine. There are only two major problems that we have to deal with:

Managing the scheduler - This includes starting the scheduler, configuring them, making PBS server connect to each one of them and then make them run on specific events.
Make sure that schedulers do not overrun on each other's territory. Make sure that they run on clearly partitioned complex in terms of jobs and nodes.

Design proposal mentioned below tends to address both these problem.

Forum discussion

Interface 1: Extend PBS to support a list of scheduler objects
- Visibility: Public
- Change Control: Stable
- Details:
  - PBS supports a list of scheduler objects to be created using qmgr. It is similar to how we create nodes in server.
  - qmgr command can be used to create a scheduler object . It must be invoked by a PBS admin/manager.
  - To create a scheduler object and make it run, the following are the mandatory attributes that needs to be set by the user
    - Name of the scheduler is mandatory to be given while creating a scheduler object.
      - qmgr -c "c sched multi_sched_1"
        This will create/set the following attributes for the sched object
        port - If not defined by the user, It will start from 15050 and try to run the scheduler on the next available port number.
        host (read-only for now, Has the same value as PBS server host)
        partition = None (default)
        sched_priv = $PBS_HOME/multi_sched_1_priv (default)
        sched_log = $PBS_HOME/multi_sched_1_log (default)
        scheduling = False (default)
        scheduler_iteration = 600 (default)
        sched_user = <pbs_server user> (default)
    - Set the priv directory for the scheduler.
      - The directory must be owned by the sched_user specified while creating scheduler object. It should have permissions set as "750". By default a sched object has
        it's priv directory set as $PBS_HOME/<sched-name>_priv
      - qmgr -c "s sched multi_sched_1 sched_priv=/var/spool/pbs/sched_priv_1"
      - If the priv directory is not accessible and admin tries to set "scheduling" attribute to true then error code is set to "15211" and following error if thrown to the user
        "scheduler <scheduler name> can not access it's home or priv directory"
    - Set the log directory for the scheduler.
      - The directory must be owned by the sched_user specified while creating scheduler object. It should have permissions as "755". By default a sched object has
        it's logs directory set as $PBS_HOME/<sched_name>_logs
      - qmgr -c "s sched multi_sched_1 sched_log=/var/spool/pbs/sched_logs"
      - If the log directory is not accessible and admin tries to set "scheduling" attribute to true then error code is set to "15211" and following error if thrown to the user
        "scheduler <scheduler name> can not access it's home or priv directory"
    - To set scheduling on one of the newly created scheduler object one must make use of scheduler name.
      - By default a multi-sched object has scheduling set as False.
        If no name is specified then PBS server will enable/disable scheduling on default scheduler.
      - qmgr -c " s sched <scheduler name> scheduling = 1"
  - By default PBS server will configure a default scheduler which will run out of the box.
    - The name of this default scheduler will be "sched"
    - The sched_priv directory of this default scheduler will be set to the $PBS_HOME/sched_priv
    - Default scheduler will log in $PBS_HOME/sched_logs directory.
Interface 2: Changes to PBS scheduler
- Visibility: Public
- Change Control: Stable
- Details:
  - Scheduler now has additional attributes which can be set in order to run it.
    - sched_priv - to point to the directory where scheduler keeps the fairshare usage, resource_group, holidays file and sched_config
    - sched_logs - to point to the directory where scheduler logs.
    - policy - collection of various attributes (as mentioned below) which can be used to configure scheduler.
    - partitions - list of all the partitions for which this scheduler is going to schedule jobs.
    - host - hostname on which scheduler is running. For default scheduler it is set to pbs server hostname.
    - port - port number on which scheduler is listening.
    - job_accumulation_time - amount of time server will wait after the submission of a job before starting a new cycle.
    - state - This attribute shows the status of the scheduler. It is a parameter that is set only by pbs server.
  - One can set a partition or a list of partitions to scheduler object. Once set, given scheduler object will only schedule jobs from the queues attached to specified partition"
    - qmgr -c "s sched multi_sched_1 partitions='part1,part2'"
  - If no partition are specified with a given scheduler object then that scheduler will not schedule any jobs.
  - By default, All new queues created will be attached to the default scheduler, until they have been assigned to a specific partition.
  - A partition once attached to a scheduler can not be attached to another scheduler. If tried, then it will throw following error:
    - qmgr -c "s sched multi_sched_1 partitions+='part2'"
      Partition part2 is already associated with scheduler <scheduler name>.
  - Scheduler can now accept a set of policy that it can work on:
    - Policy can be specified by using - qmgr -c "s sched <sched_name> policy=<policy object>" command.
  - Scheduler object "state" attribute will show one of these 3 values - DOWN, IDLE, SCHEDULING
    - If a scheduler object is created but scheduler is not running for some reason state will be shown as "DOWN"
    - If a scheduler is up and running but waiting for a cycle to be triggered the state will be shown as "IDLE"
    - If a scheduler is up and running and also running a scheduling cycle then the state will be shown as "SCHEDULING"

Interface 3: New policy object

Visibility: Public
Change Control: Stable

Details:

Admins will now be allowed to create policy objects and give a name to these policy object.
Admins can then assign these policy objects to specific schedulers, they can have one policy object assigned to more than one scheduler.
One can delete a policy object only when it is not assigned to any scheduler.
Example:
qmgr -c "c policy p1"
qmgr -c "s p p1 by_queue=False, strict_ordering=True"
qmgr -c "s sched scheduler1 policy=p1"

Below is the list of policies that reside in the policy attribute of scheduler.

Policy name	Type	Default value	example
round_robin	Boolean	round_robin=False	qmgr -c "s policy p1 round_robin=True"
by_queue	Boolean	by_queue=True	qmgr -c "s policy p1 by_queue=True"
strict_ordering	Boolean	strict_ordering=False	qmgr -c "s policy p1 strict_ordering=True"
help_starving_jobs	Boolean	help_starving_jobs=True	qmgr -c "s policy p1 help_starving_jobs=True"
max_starve	string	max_starve="24:00:00"	qmgr -c "s policy p1 max_starve=24:00:00"
node_sort_key	array_string	node_sort_key = "sort_priority HIGH"	qmgr -c 's policy p1 node_sort_key="sort_priority HIGH, ncpus HIGH"'
provision_policy	string	provision_policy="aggressive_provision"	qmgr -c "s policy p1 provision_policy="aggressive_provision"
exclude_resources	array_string	NOT SET BY DEFAULT	qmgr -c 's policy p1 exclude_resources="vmem, color"'
load_balancing	Boolean	load_balancing=False	qmgr -c "s policy p1 load_balancing=True"
fairshare	Boolean	fairshare=False	qmgr -c "s policy p1 fairshare=True"
fairshare_usage_res	string	fairshare_usage_res=cput	qmgr -c "s policy p1 fairshare_usage_res=cput"
fairshare_entity	string	fairshare_entity=euser	qmgr -c "s policy p1 fairshare_entity=euser"
fairshare_decay_time	string	fairshare_decay_time="24:00:00"	qmgr -c "s policy p1 fairshare_decay_time=24:00:00"
fairshare_enforce_no_shares	Boolean	fairshare_enforce_no_shares=True	qmgr -c "s policy p1 fairshare_enforce_no_shared=True"
preemption	Boolean	preemption=True	qmgr -c "s policy p1 preemption=True"
preempt_queue_prio	integer	preempt_queue_prio=150	qmgr -c "s policy p1 preempt_queue_prio=190"
preempt_prio	string	preempt_prio="express_queue, normal_jobs"	qmgr -c 's policy p1 preempt_prio="starving_jobs, normal_jobs, starving_jobs+fairshare"'
preempt_order	string	preempt_order="SCR"	qmgr -c 's policy p1 preempt_order="SCR 70 SC 30"'
preempt_sort	string	preempt_sort="min_time_since_start"	qmgr -c 's policy p1 preempt_sort="min_time_since_start"'
peer_queue	array_string	NOT SET BY DEFAULT	qmgr -c 's policy p1 peer_queue=" workq workq@svr1"
server_dyn_res	array_string	NOT SET BY DEFAULT	qmgr -c 's policy p1 server_dyn_res="mem !/bin/get_mem"'
dedicated_queues	array_string	NOT_SET_BY_DEFAULT	qmgr -c 's policy p1 dedicated_queues="queue1,queue2"'
log_event	integer	log_event=4607	qmgr -c "s policy p1 log_event=255"
job_sort_formula	string	NOT SET BY DEFAULT	qmgr -c 's policy p1 job_sort_formula="ncpus*walltime"'
backfill_depth	integer	Set to 1 by default	qmgr -c 's policy p1 backfill_depth=1'
job_sort_key	array_string	NOT_SET_BY_DEFAULT	qmgr -c 's policy p1 job_sort_key="ncpus HIGH, mem LOW"'
prime_spill	string	NOT_SET_BY_DEFAULT	qmgr -c 's policy p1 prime_spill="01:00:00"'
prime_exempt_anytime_queues	Boolean	prime_exempt_anytime_queues=false	qmgr -c 's policy p1 prime_exempt_anytime_queues=false'
backfill_prime	Boolean	backfill_prime=false	qmgr -c 's policy p1 backfill_prime=false'

Following are the configurations that are moved/removed:
- mom_resources - removed (mom periodic hooks can update custom resources)
- unknown_shares - moved to resource_group file.
- smp_cluster_dist - It was already deprecated, removed now
- sort_queues - It was already deprecated, removed now
- nonprimetime_prefix - New policy object does not differentiate between prime/non-prime time
- primetime_prefix - New policy object does not differentiate between prime/non-prime time
- resources - New policy object will now list the resources that needs to be excluded from scheduling. By default all resources will be used for scheduling.
- dedicated_prefix - New policy object will expose "dedicated_queues" which is a list of queues associated with dedicated time.
- preemptive_sched - This has been renamed to "preemption".
- log_filter - log_filter has been renamed to "log_event" to be in sync with the option server object exposes.
Admin will now be allowed to add different policy object for prime/non-prime time.
- If the values of "policy" scheduler attribute is prefixed with "p:", it will be considered as prime-time policy.
- If the values of "policy" scheduler attribute is prefixed with "np:", it will be considered as non-prime-time policy.
- Policy name specified without any prefix will be used as all time policy.
- Admin will not be allowed to submit a prime/non-prime policy unless an "all time policy" (without any prefix) is specified. On doing so following error will be throw
  - qmgr -c "s sched sched1 policy+='p:p1'"
    Cannot set prime/non-prime time policy without setting an all time policy
- More than one policy object can be specified at the same time in policy scheduler attribute.
  - example: qmgr -c "s sched sched1 policy=p:p1,np:p2"
  - a primetime policy/non-primetime policy/all time policy can not be specified more than once while setting scheduler's policy attribute.
During dedicated time, if prime and non-prime time policies are defined then scheduler will use "prime" time policy to schedule jobs from dedicated queues, else it will apply all time policy.
If one wants to use policies mentioned under old sched config file then they need to keep a copy of the config file in the directory mentioned under "sched_priv" attribute.
If both policy and sched_config files are present then sched_config file will be ignored.
One can unset all the policies in one shot using "qmgr -c "unset sched <sched_name> policy" and this will make scheduler read the sched_config file in the next iteration.
If there is any change in policy object, it will take effect in the very next cycle it's corresponding scheduler runs. Schedulers do not need SIGHUP to have the change in policy object take effect.

Interface 4: Changes to PBS server.
- Visibility: Public
- Change Control: Stable
- Details:
  - PBS does not allow attributes like scheduling, scheduler_iteration to be set on PBS server object.
  - scheduling and scheduler_iteration now belong to the sched object
  - backfill_depth will also be an attribute of scheduler's policy object.
    - If scheduler is configured to use sched_config instead of policy object, then it will take value of backfill_depth from scheduler object. If not set on scheduler object then it will take what is set on the server object (We should deprecate backfill_depth on the server object).
    - If scheduler is configured to use policy object instead of sched_config file, then it will take value of backfill_depth from scheduler's policy object.
    - If there is backfill_depth set on per queue level then that value will take precedence over the value set in sched object or server object.
  - These attributes now belong to a scheduler object and needs to be set on scheduler object using a scheduler name
    - qmgr -c "s policy p1 backfill_depth=3"
    - qmgr -c "s sched multi_sched_1 policy = p1"
  - Setting these attributes on server will result into following warning:
    - qmgr -c "s s backfill_depth=3"
    - qmgr: Warning: backfill_depth in server is deprecated. Set backfill_depth in a scheduler policy object.
  - Attribute job_sort_formula has been moved from server to scheduler policy attribute.
Interface 5: Changes to PBS Nodes objects.
- Visibility: Public
- Change Control: Stable
- Details:
  - Node object in PBS will have an additional attribute called "partition" which can be used to associate a node to a particular partition.
    - This attribute will be of type string and it will be settable only by Manager/operator and viewable by all users.
  - If "partition" attribute is not set, node will not belong to any partition and default scheduler will schedule jobs on this node.
  - PBS admin/manager can set node's partition attribute to an existing partition name and it's corresponding scheduler will be scheduling jobs on this node.
  - When a scheduler object is deleted all the nodes that were associated to the deleted scheduler's partition will move back to default scheduler and their "partition" attribute will be unset.
  - If nodes are associated to a partition then they can not be linked to any queue which isn't part of that partition. Trying to set a node to a queue which isn't part of it's partition will result into following error:
    - qmgr -c "s n node1 queue=workq3"
      workq3 is not part of partition <node's partition name>
Interface 6: Changes to Queues.
- Visibility: Public
- Change Control: Stable
- Details:
  - Queue_type attribute on the queue will extend itself and accept 3 more values - "execution_prime, execution_non_prime, dedicated".
    - If queue_type is set to "execution_prime", jobs from this queue will be considered only during primetime by the scheduler.
    - If queue_type is set to "execution_non_prime", jobs from this queue will be considered only during non-primetime by scheduler.
    - if queue_type is set to "dedicated", jobs from this queue will be considered only during dedicated time.
    - If queue_type is set to "execution", jobs from this queue will be considered to run irrespective of prime/non-prime time.
  - Queue will have a new "partition" attribute which can be used to associate a node to a particular partition.
    - This attribute will be of type string and it will be settable only by Manager/operator and viewable by all users.
  - If "partition" attribute is not set to anything, queue will not belong to any partition and default scheduler will schedule jobs from this queue.
  - When a scheduler object is deleted all the queues that were associated to the deleted scheduler's partition will move back to default scheduler and their "partition" attribute will be unset.
Interface 7: How PBS server runs scheduler.
- Visibility: Public
- Change Control: Stable
- Details:
  - Upon startup PBS server will start all schedulers which have their scheduling attribute set to "True"
    - "PBS_START_SCHED" environment variable is now deprecated and it's value will get overridden by schedulers "scheduling" attribute.
  - PBS server will connect to these schedulers on their respective hostnames and port number.
  - If server is unable to connect to these schedulers it will check to see if the scheduler is running, try to connect 5 times, and finally restart the scheduler.
  - Scheduling cycles for all configured schedulers are started by PBS server when a job is queued, finished, when scheduling attribute is set to True or when scheduler_iteration is elapsed.
    - When a job gets queued or finished, server will check it's corresponding queue and try to connect to it's corresponding scheduler to run a scheduling cycle.
    - If a scheduler is already running a scheduling cycle while server will just wait for the previous cycle to finish before trying to start another one.
    - If job_accumulation_time is set then server will wait until that time has passed after the submission of a job before starting a new cycle.
  - Each scheduler while querying server specifies it's scheduler name and then gets only a chunk of the universe which is relevant to this scheduler.
    - It gets all the running, queued, exiting jobs from the queues it is associated with one of it's partitions.
    - It gets all the list of nodes which are associated with the partition managed by the scheduler.
    - It gets the list of all the global policies like run soft/hard limits set on the server object.
Interface 8: What does not work when multiple scheduler objects are present.
- Visibility: Public
- Change Control: Experimental
- Details:
  - When there are multiple scheduler objects configures following things might be broken.
    - Run limits set on server may seem to be broken because a scheduler object may not have a view of whole of the universe.
    - Fairshare is now only limited to what a specific scheduler views, it can not be done complex wide with multiple schedulers.

PP-337: Multiple schedulers servicing the PBS cluster

Introduction: