Preemption Optimization - Scheduler to send list of jobs to the server.

forum discussion.

Problem

  • When preemption is enabled, the scheduler, if unable to run a high priority job, finds low priority job(s) that can be preempted in order to run the high priority jobs.
  • The scheduler then sends the shortlisted jobs one-by-one to the server, following the below process -
    • Find the preempt_order to be used for this job and try each method one after the another.
      • For example - if preempt_order in sched_config is set to "SCR 80 SC 50 S" and the job has utilized between 100-81% of the requested time
        • The scheduler will send a signal job request to the server to suspend the job.
        • If suspension fails, the scheduler will send a hold job request.
        • If checkpointing fails, the scheduler will send a rerun job request.
  • With the above sequence of events, we have two issues -
    • If the number of jobs to be preempted is large, say 100, the scheduler will send at least 100 separate batch requests to the server in the best case scenario. In the worst case scenario, the number of requests could be as high as 300.
    • On systems like Cray, suspending a job could take 3-5 seconds, and if the scheduler is suspending 100 jobs, scheduling will stop for 300-500 seconds.

Proposed changes:

  • Preemption related configuration parameters will be removed from the default sched_config.
  • The parameters that will be removed are - preempt_queue_prio, preempt_prio, preempt_order and preempt_sort.
  • These parameters will be set through qmgr as explained in the Interface Changes section.
  • The scheduler will now send a list of jobs to be preempted to the server.
  • For this, a new batch request will be introduced which will have the below fields -
    • <job1><job2>....<jobn>
  • With this change, it will be the server to try preempting each job based on the order sent by the scheduler, thus reducing the number of requests/response between scheduler and server.
  • The server will send requests to respective MoMs (single MoM for Cray), collect the replies and then send the reply to the scheduler indicating success or failure.
  • The format of the response will be -
    • Success <0><job1:S/C/R><job2:S/C/R>....<jobn:S/C/R>.
    • Failure <1><job1:S/C/R><job2:0>....<jobn:S/C/R>

Interface Changes

  1. Interface 1: preempt_queue_prio parameter to be set through qmgr.
    1. Visibility: Public
    2. Change control: Stable
    3. Synopsis: preempt_queue_prio parameter will be removed from sched_config and will be set through qmgr
    4. Details:
      1. Type: Integer
      2. Examples:
        1. qmgr: set scheduler <sched_name> preempt_queue_prio=250
      3. default: 150
      4. If found to be set in sched_config, there will be a log recorded mentioning that this parameter is to be set from qmgr and setting it through sched_config file will not take any effect.
      5. Access permissions: Only managers will have permission to read/write the parameter. Operators and users will have read permission.
      6. When this parameter is unset, it will take the default value - 150
  2. Interface 2: preempt_prio parameter to be set through qmgr.
    1. Visibility: Public
    2. Change control: Stable
    3. Synopsis: preempt_prio parameter will be removed from sched_config and will be set through qmgr
    4. Details:
      1. Type: String_Array
      2. Examples:
        1. qmgr: set scheduler <sched_name> preempt_prio="express_queues, normal_jobs"
        2. qmgr: set scheduler <sched_name> preempt_prio="starving_jobs, normal_jobs, starving_jobs+fairshare, fairshare"
      3. default: express_queue, normal_jobs
      4. If found to be set in sched_config, there will be a log recorded mentioning that this parameter is to be set from qmgr and setting it through sched_config file will not take any effect.
      5. Access permissions: Only managers will have permission to read/write the parameter. Operators and users will have read permission.
      6. When this parameter is unset, it will take the default value - "express_queues, normal_jobs"
  3. Interface 3: preempt_order parameter to be set through qmgr.
    1. Visibility: Public
    2. Change control: Stable
    3. Synopsis: preempt_order parameter will be removed from sched_config and will be set through qmgr
    4. Details:
      1. Type: String
      2. Examples:
        1. qmgr: set scheduler <sched_name> preempt_order="SR"
        2. qmgr: set scheduler <sched_name> preempt_order="SCR 80 SC 50 S"
      3. default: SCR
      4. If found to be set in sched_config, there will be a log recorded mentioning that this parameter is to be set from qmgr and setting it through sched_config file will not take any effect.
      5. Access permissions: Only managers will have permission to read/write the parameter. Operators and users will have read permission.
      6. When this parameter is unset, it will take the default value - "SCR"
  4. Interface 4: preempt_sort parameter to be set through qmgr.
    1. Visibility: Public
    2. Change control: Stable
    3. Synopsis: preempt_sort parameter will be removed from sched_config and will be set through qmgr
    4. Details:
      1. Type: String
      2. This parameter will either be set to the default (min_time_since_start) or unset.
      3. default: min_time_since_start
      4. If found to be set in sched_config, there will be a log recorded mentioning that this parameter is to be set from qmgr and setting it through sched_config file will not take any effect.
      5. Access permissions: Only managers will have permission to read/write the parameter. Operators and users will have read permission.
      6. When this parameter is unset, the scheduler will not consider this parameter for preemption.
  5. Interface 5: A new PBS IFL API for sending a list of jobs to be preempted.
    1. Visibility: Public
    2. Change control: Stable
    3. Synopsis: This API will be used by the scheduler to send a list of jobs to the server for preemption.
    4. Details:
      1. Signature: preempt_job_info *pbs_preempt_jobs(int conn, char **job_id_list)
        1. conn - server socket
        2. job_id_list - A NULL-terminated list of job_ids.
      2. Return Value:
        1. preempt_job_info*. The preempt_job_info structure is described below.
          1. It has two fields
            1. job_id - char*
            2. preempt_method - char
              1. The value will indicate the status 
                1. 'S' - Indicating that the job was preempted using suspension.
                2. 'C' - Indicating that the job was preempted using checkpointing.
                3. 'R' - Indicating that the job was preempted using re-queue.
                4. '0' - Indicating that the job could not be preempted.
      3. The scheduler uses the preempt_method from the response to determine whether or not to release all the resources the job owned.
  6. Interface 6: qmgr to display the values of preemption parameters.
    1. Visibility: Public
    2. Change control: Stable
    3. Details: 
      1. When these parameters are unset, they will take default values and will continue to be displayed in qmgr print sched.
      2. Qmgr: p sched default

                            #
                            # Create and define scheduler default
                            #
                            create sched default
                            set sched sched_host = d_server
                            set sched sched_cycle_length = 00:20:00
                            set sched sched_priv = /var/spool/pbs/sched_priv
                            set sched sched_log = /var/spool/pbs/sched_logs
                            set sched scheduling = True
                            set sched scheduler_iteration = 600
                            set sched state = scheduling

                            set sched preempt_queue_prio = 150
                            set sched preempt_prio = express_queues, normal_jobs
                            set sched preempt_order = SCR
                            set sched preempt_sort = min_time_since_start

Results before change: out1.txt and after changes: out2.txt

Overlay Upgrade considerations:

The preemption related configuration parameters will be removed from the default sched_config available in $PBS_EXEC/etc/pbs_sched_config.

If there were any modifications done to these parameters before the upgrade, same modifications should be done after the upgrade is successful, but this time, we need to carry out these modifications through qmgr as explained above.

Proposed changes for the future

  • The server will send requests to all the respective MoMs in parallel and each request will have a list of jobs to be preempted on that MoM.