Reconfirming degraded reservations that are running

Follow the PBS Pro Design Document Guidelines.

Overview

Currently in PBS, degraded reservations can only be reconfirmed before they start.  If a reservation starts degraded when it starts running, it will be degraded through the lifetime of that occurrence.  PBS will be enhanced to reconfirm reservations that are running.  This applies to degraded reservation and reservations in-conflict.


Technical Details

If a non-running degraded reservation is reconfirmed, its node solution can completely change.  When a reservation is not running, where it is going to run does not matter.  We only care that the nodes given to the reservation satisfy the select spec.  Reconfirming a running reservation is different.  There might be jobs running on nodes, so only the unavailable nodes can be replaced.  All other nodes must remain with the reservation.

Changes to how degraded reservations are reconfirmed

There are the following attributes which affect how reservations are reconfirmed:

reserve_retry_init (server) - Amount of time after nodes become unavailable before the first reconfirmation attempt is made (default 2hr).

reserve_retry_cutoff (server) - Amount of time before a reservation starts where we stop trying to reconfirm (default 2hr).

resv_retry (reservation) - Epoch time of the next time reconfirmation attempt will be made.


Current workflow of a degraded or in-conflict reservation

  1. When nodes of a reservation become unavailable, the server will wait resv_retry_init before attempting to reconfirm a reservation.
  2. After each unsuccessful reconfirmation attempt, we try again in half the time between then and the start of the reservation (e.g. if the reservation starts in one day, we wait half a day).
  3. We stop attempting to reconfirm the reservation at the start time - reserve_retry_cutoff


New workflow of a degraded or in-conflict reservation

The times between reconfirmation attempts will need to change.  Since we are now reconfirming running reservations, setting the time between attempts to half the time to the start will no longer work.

  1. resv_retry_cutoff will no longer be used
  2. When a node becomes unavailable, PBS will wait reserve_retry_time seconds before trying the first reconfirmation attempt
  3. After an unsuccessful reconfirmation attempt is made, PBS will reset resv_retry to the next reconfirmation time.
  4. One reconfirmation attempt will be made right before the reservation starts.
  5. Once a reservation starts, we will attempt to reconfirm it every reserve_retry_time seconds.
    1. If a node of a running reservation goes into state offline or maintenance and there is a job running in the reservation on that node, the reservation will not be reconfirmed until those jobs are finished.
    2. Running reservations in state in-conflict will not be reconfirmed.
      1. Standing in-conflict reservations can be reconfirmed once an occurrence ends and before the next occurrence starts.


Changes to the external interface

Server attribute reserve_retry_time

Perms: Manager write / Everyone read

Status: New - An attempt to reconfirm a degraded reservation every reserve_retry_time seconds.  The default is 600s.


Server attribute reserve_retry_init

Status: Deprecated - The first attempt to reconfirm a degraded reservation resv_retry_time seconds after it was originally degraded.


Server attribute reserve_retry_cutoff

Status: obsolete


Reservation attribute: reserve_retry

Perms: Manager read

Description: Epoch time of the next reconfirmation attempt


Accounting log: 'Y' record

Description: Every time a degraded reservation is reconfirmed (running or not running) PBS server will log a 'Y' record in the accounting logs.

This 'Y' record will have the following format - Y;<resvID>;requestor=Scheduler@<server> start=<(new/original) start time> end=<(new/original) end time> nodes=(<allotted nodes>)






OSS Site Map

Project Documentation Main Page

Developer Guide Pages