Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • When preemption is enabled, the scheduler, if unable to run a high priority job, finds low priority job(s) that can be preempted in order to run the high priority jobs.
  • The scheduler then sends the shortlisted jobs one-by-one to the server, following the below process -
    • Find the preempt_order to be used for this job and try each method one after the another.
      • For example - if preempt_order in sched_config is set to "SCR 80 SC 50 S" and the job has utilized between 100-81% of the requested time
        • The scheduler will send a signal job request to the server to suspend the job.
        • If suspension fails, the scheduler will send a hold job request.
        • If checkpointing fails, the scheduler will send a rerun job request.
  • With the above sequence of events, we have two issues -
    • If the number of jobs to be preempted is large, say 100, the scheduler will send at least 100 separate batch requests to the server in the best case scenario. In the worst case scenario, the number of requests could be as high as 300.
    • On systems like Cray, suspending a job could take 3-5 seconds, and if the scheduler is suspending 100 jobs, scheduling will stop for 300-500 seconds.

Proposed changes in this phase:

  • Preemption related configuration parameters will be removed from the default sched_config.
  • The parameters that will be removed are - preempt_queue_prio, preempt_prio, preempt_order and preempt_sort.
  • These parameters will be set through qmgr as explained in the Interface Changes section.
  • The scheduler will now send a list of jobs to be preempted to the server.
  • For this, a new batch request will be introduced which will have the below fields -
    • <job1><job2>....<jobn>
  • With this change, it will be the server to try preempting each job based on the order sent by the scheduler, thus reducing the number of requests/response between scheduler and server.
  • The server will send requests to respective MoMs (single MoM for Cray), collect the replies and then send the reply to the scheduler indicating success or failure.
  • The format of the response will be -
    • Success <0><job1:S/C/R><job2:S/C/R>....<jobn:S/C/R>.
    • Failure <1><job1:S/C/R><job2:0>....<jobn:S/C/R>

...

If there were any modifications done to these parameters before the upgrade, same modifications should be done after the upgrade is successful, but this time, we need to carry out these modifications through qmgr as explained above.

Proposed changes in for the next phasefuture

  • The server will send requests to all the respective MoMs in parallel and each request will have a list of jobs to be preempted on that MoM.