...
Design proposal mentioned below tends to address both these problem.
Interface 1: Extend PBS to support a list of scheduler objects
- Visibility: Public
- Change Control: Stable
- Details:
- PBS supports a list of scheduler objects to be created using qmgr. It is similar to how we create nodes in server.
- qmgr command can be used to create a scheduler object . It must be invoked by a PBS admin/manager.
- To create a scheduler object and make it run, the following are the mandatory attributes that needs to be set by the user
- Name of the scheduler is mandatory to be given while creating a scheduler object.
- qmgr -c "c sched multi_sched_1"
- This will create/set the following attributes for the sched object
- port - If not defined by the user, It will start from 15050 and try to run the scheduler on the next available port number.
- host (read-only for now, Has the same value as PBS server host)
- queues = None (default)
- sched_priv = $PBS_HOME/multi_sched_1_priv (default)
- sched_log = $PBS_HOME/multi_sched_1_log (default)
- scheduling = False (default)
- scheduler_iteration = 600 (default)
- This will create/set the following attributes for the sched object
- qmgr -c "c sched multi_sched_1"
- Set the priv directory for the scheduler.
- The directory must be root owned (Does this still need to be the case? Should it be something pbsadmin or configurable?) and should have permissions as "750". By default a sched object has
it's priv directory set as $PBS_HOME/<sched-name>_priv - qmgr -c "s sched multi_sched_1 sched_priv=/var/spool/pbs/sched_priv_1"
- The directory must be root owned (Does this still need to be the case? Should it be something pbsadmin or configurable?) and should have permissions as "750". By default a sched object has
- Set the log directory for the scheduler.
- The directory must be root owned and should have permissions as "755". By default a sched object has
it's logs directory set as $PBS_HOME/<sched_name>_logs - qmgr -c "s sched multi_sched_1 sched_log=/var/spool/pbs/sched_logs"
- The directory must be root owned and should have permissions as "755". By default a sched object has
- To set scheduling on one of the newly created scheduler object one must make use of scheduler name.
- By default a multi-sched object has scheduling set as False.
If no name is specified then PBS server will enable/disable scheduling on default scheduler. - qmgr -c " s sched <scheduler name> scheduling = 1"
- By default a multi-sched object has scheduling set as False.
- Name of the scheduler is mandatory to be given while creating a scheduler object.
- By default PBS server will configure a default scheduler which will run out of the box.
- The name of this default scheduler will be "pbs_sched"
- The sched_priv directory of this default scheduler will be set to the $PBS_HOME/sched_priv
- Default scheduler will log in $PBS_HOME/sched_logs directory.
- Interface 2: Changes to PBS scheduler
- Visibility: Public
- Change Control: Stable
- Details:
- Scheduler now has additional attributes which can be set in order to run it.
- sched_priv - to point to the directory where scheduler keeps the fairshare usage, resource_group, holidays file and sched_config
- sched_logs - to point to the directory where scheduler logs.
- policy - collection of various attributes (as mentioned below) which can be used to configure scheduler.
- queues - list of all the queues for which this scheduler is going to schedule jobs.
- host - hostname on which scheduler is running. For default scheduler it is set to pbs server hostname.
- port - port number on which scheduler is listening.
- job_accumulation_time - amount of time server will wait after the submission of a job before starting a new cycle.
- state - This attribute shows the status of the scheduler. It is a parameter that is set only by pbs server.
- One can set a queue or a list of queues to scheduler object. Once set, given scheduler object will only schedule jobs from the queues specified.
- qmgr -c "s sched multi_sched_1 queues=hp_queue1,hp_queue2"
- If no queues are specified with a given scheduler object then that scheduler will not schedule any jobs.
- By default, All new queues created will be attached to the default scheduler, unless specified otherwise.
- A queue once attached to a scheduler can not be attached to another scheduler. If tried, then it will throw following error:
- qmgr -c "s sched multi_sched_1 queues=workq"
Queue workq is already associated with scheduler <sched_name>.
- qmgr -c "s sched multi_sched_1 queues=workq"
- Scheduler can now accept a set of policy that it can work on:
- Policy can be specified by using - qmgr -c "s sched <sched_name> policy=<policy object>" command.
- Scheduler object "state" attribute will show one of these 3 values - DOWN, IDLE, SCHEDULING
- If a scheduler object is created but scheduler is not running for some reason state will be shown as "DOWN"
- If a scheduler is up and running but waiting for a cycle to be triggered the state will be shown as "IDLE"
- If a scheduler is up and running and also running a scheduling cycle then the state will be shown as "SCHEDULING"
- Scheduler now has additional attributes which can be set in order to run it.
- Interface 3: New policy object
- Visibility: Public
- Change Control: Stable
- Details:
- Admins will now be allowed to create policy objects and give a name to these policy object.
- Admins can then assign these policy objects to specific schedulers, they can have one policy object assigned to more than one scheduler.
- One can delete a policy object only when it is not assigned to any scheduler.
- Example:
qmgr -c "c policy p1"
qmgr -c "s p p1 by_queue=False, strict_ordering=True"
qmgr -c "s sched scheduler1 policy=p1" Below is the list of policies that reside in the policy attribute of scheduler.
Policy name Type Default value example round_robin Boolean round_robin=False qmgr -c "s sched sched1 policy.round_robin=True" by_queue Boolean by_queue=True qmgr -c "s sched sched1 policy.by_queue=True" strict_ordering Boolean strict_ordering=False qmgr -c "s sched sched1 policy.strict_ordering=True" help_starving_jobs Boolean help_starving_jobs=True qmgr -c "s sched sched1 policy.help_starving_jobs=True" max_starve string max_starve="24:00:00" qmgr -c "s sched sched1 policy.max_starve=24:00:00" node_sort_formula string node_sort_formula="sort_priority" qmgr -c "s sched sched1 policy.node_sort_formula="resources_available.ncpus - resources_assigned.ncpus" provision_policy string provision_policy="aggressive_provision" qmgr -c "s sched sched1 policy.provision_policy="aggressive_provision" exclude_resources array_string NOT SET BY DEFAULT qmgr -c 's sched sched1 policy.exclude_resources="vmem, color"' load_balancing Boolean load_balancing=False qmgr -c "s sched sched1 policy.load_balancing=True" fairshare Boolean fairshare=False qmgr -c "s sched sched1 policy.fairshare=True" fairshare_usage_res string fairshare_usage_res=cput qmgr -c "s sched sched1 policy.fairshare_usage_res=cput" fairshare_entity string fairshare_entity=euser qmgr -c "s sched sched1 policy.fairshare_entity=euser" fairshare_decay_time string fairshare_decay_time="24:00:00" qmgr -c "s sched sched1 policy.fairshare_decay_time=24:00:00" fairshare_enforce_no_shares Boolean fairshare_enforce_no_shares=True qmgr -c "s sched sched1 policy.fairshare_enforce_no_shared=True" preemption Boolean preemption=True qmgr -c " s sched sched1 policy.preemption=True" preempt_queue_prio integer preempt_queue_prio=150 qmgr -c "s sched sched1 policy.preempt_queue_prio=190" preempt_prio string preempt_prio="express_queue, normal_jobs" qmgr -c 's sched sched1 policy.preempt_prio="starving_jobs, normal_jobs, starving_jobs+fairshare"' preempt_order string preempt_order="SCR" qmgr -c 's sched sched1 policy.preempt_order="SCR 70 SC 30"' preempt_sort string preempt_sort="min_time_since_start" qmgr -c 's sched sched1 policy.preempt_sort="min_time_since_start"' peer_queue array_string NOT SET BY DEFAULT qmgr -c 's sched sched1 policy.peer_queue=" workq workq@svr1" server_dyn_res array_string NOT SET BY DEFAULT qmgr -c 's sched sched1 policy.server_dyn_res="mem !/bin/get_mem"' dedicated_queues string NOT_SET_BY_DEFAULT qmgr -c 's sched sched1 policy.dedicated_queues="queue1,queue2"' log_event integer log_event=3328 qmgr -c "s sched sched1 policy.log_event=255" job_sort_formula string NOT SET BY DEFAULT qmgr -c 's sched sched1 policy.job_sort_formula="ncpus*walltime"' backfill_depth integer Set to 1 by default qmgr -c 's sched sched1 policy.backfill_depth=1' - Following are the configurations that are moved/removed:
- mom_resources - removed (mom periodic hooks can update custom resources)
- unknown_shares - moved to resource_group file.
- smp_cluster_dist - It was already deprecated, removed now
- sort_queues - It was already deprecated, removed now
- nonprimetime_prefix - New policy object does not differentiate between prime/non-prime time (should we make a queue setting that indicates that it is a prime/non prime queue)
- primetime_prefix - New policy object does not differentiate between prime/non-prime time (should we make a queue setting that indicates that it is a prime/non prime queue)
- job_sort_key - uses job_sort_formula instead.
- node_sort_key - replaced by node_sort_formula instead.
- prime_spill - New policy object does not differentiate between prime/non-prime time
- prime_exempt_anytime_queues - New policy object does not differentiate between prime/non-prime time
- backfill_prime - New policy object does not differentiate between prime/non-prime time
- resources - New policy object will now list the resources that needs to be excluded from scheduling. By default all resources will be used for scheduling.
- dedicated_prefix - New policy object will expose "dedicated_queues" which is a list of queues associated with dedicated time.
- preemptive_sched - This has been renamed to "preemption".
- log_filter - log_filter has been renamed to "log_event" to be in sync with the option server object exposes.
- Following are the configuration that have been moved from being a "PRIME OPTION" to " NO PRIME OPTION"
- round_robin - Now works irrespective of prime/non-prime time
- by_queue - Now works irrespective of prime/non-prime time
- strict_ordering - Now works irrespective of prime/non-prime time
- help_starving_jobs - Now works irrespective of prime/non-prime time
- load_balancing - Now works irrespective of prime/non-prime time
- fair_share - Now works irrespective of prime/non-prime time
- preemptive_sched - Now works irrespective of prime/non-prime time
- If one wants to use policies mentioned under old sched config file then they need to keep a copy of the config file in the directory mentioned under "sched_priv" attribute.
- If both policy and sched_config files are present then sched_config file will be ignored.
- One can unset all the policies in one shot using "qmgr -c "unset sched <sched_name> policy" and this will make scheduler read the sched_config file in the next iteration.
- Interface 4: Changes to PBS server.
- Visibility: Public
- Change Control: Stable
- Details:
- PBS does not allow attributes like scheduling, scheduler_iteration to be set on PBS server object.
- scheduling and scheduler_iteration now belong to the sched object
- backfill_depth will also be an attribute of scheduler's policy object.
- If scheduler is configured to use sched_config instead of policy object, then it will take value of backfill_depth from scheduler object. If not set on scheduler object then it will take what is set on the server object (We should deprecate backfill_depth on the server object).
- If scheduler is configured to use policy object instead of sched_config file, then it will take value of backfill_depth from scheduler's policy object.
- If there is backfill_depth set on per queue level then that value will take precedence over the value set in sched object or server object.
- These attributes now belong to a scheduler object and needs to be set on scheduler object using a scheduler name
- qmgr -c "s policy p1 backfill_depth=3"
- qmgr -c "s sched multi_sched_1 policy = p1"
- Setting these attributes on server will result into following warning:
- qmgr -c "s s backfill_depth=3"
- qmgr: Warning: backfill_depth in server is deprecated. Set backfill_depth in a scheduler policy object.
- If no scheduler name is specified then also it will throw the following error:
- qmgr -c "s sched policy.backfill_depth=3"
No scheduler specified, nothing done
- qmgr -c "s sched policy.backfill_depth=3"
- Attribute job_sort_formula has been moved from server to scheduler policy attribute.
- Interface 5: Changes to PBS Nodes objects.
- Visibility: Public
- Change Control: Stable
- Details:
- Each of the node object in PBS will have an additional attribute called "sched" which can be used to associate a node to a particular scheduler.
- This attribute will by default be set to the default scheduler started by the server (which is pbs_sched)
- PBS admin/manager can set node's sched attribute to an existing scheduler name which will be scheduling jobs on this node.
- When a scheduler object is deleted all the queues/nodes that were associated to the deleted scheduler moves back to default scheduler.
- Interface 6: How PBS server runs scheduler.
- Visibility: Public
- Change Control: Stable
- Details:
- Upon startup PBS server will start all schedulers which have their scheduling attribute set to "True"
- If "PBS_START_SCHED" is set to 0 in pbs.conf then server will not start any scheduler.
- PBS server will connect to these schedulers on their respective hostnames and port number.
- If server is unable to connect to these schedulers it will check to see if the scheduler is running, try to connect 5 times, and finally restart the scheduler.
- Scheduling cycles for all configured schedulers are started by PBS server when a job is queued, finished, when scheduling attribute is set to True or when scheduler_iteration is elapsed.
- When a job gets queued or finished, server will check it's corresponding queue and try to connect to it's corresponding scheduler to run a scheduling cycle.
- If a scheduler is already running a scheduling cycle while server will just wait for the previous cycle to finish before trying to start another one.
- If job_accumulation_time is set then server will wait until that time has passed after the submission of a job before starting a new cycle.
- Each scheduler while querying server specifies it's scheduler name and then gets only a chunk of the universe which is relevant to this scheduler.
- It gets all the running, queued, exiting jobs from the queues it is associated with.
- It gets all the list of nodes which are associated with this scheduler and queues managed by the scheduler.
- It gets the list of all the global policies like run soft/hard limits set on the server object.
- Upon startup PBS server will start all schedulers which have their scheduling attribute set to "True"
- Interface 7: What does not work when multiple scheduler objects are present.
- Visibility: Public
- Change Control: Experimental
- Details:
- When there are multiple scheduler objects configures following things might be broken.
- Run limits set on server may seem to be broken because a scheduler object may not have a view of whole of the universe.
- Fairshare is now only limited to what a specific scheduler views, it can not be done complex wide with multiple schedulers.
- When there are multiple scheduler objects configures following things might be broken.
...