Follow the PBS Pro Design Document Guidelines.
Currently in PBS, degraded reservations can only be reconfirmed before they start. If a reservation starts degraded when it starts running, it will be degraded through the lifetime of that occurrence. PBS will be enhanced to reconfirm reservations that are running. This applies to degraded reservation and reservations in-conflict.
If a non-running degraded reservation is reconfirmed, its node solution can completely change. When a reservation is not running, where it is going to run does not matter. We only care that the nodes given to the reservation satisfy the select spec. Reconfirming a running reservation is different. There might be jobs running on nodes, so only the unavailable nodes can be replaced. All other nodes must remain with the reservation.
There are the following attributes which affect how reservations are reconfirmed:
reserve_retry_init (server) - Amount of time after nodes become unavailable before the first reconfirmation attempt is made (default 2hr).
reserve_retry_cutoff (server) - Amount of time before a reservation starts where we stop trying to reconfirm (default 2hr).
resv_retry (reservation) - Epoch time of the next time reconfirmation attempt will be made.
Current workflow of a degraded or in-conflict reservation
New workflow of a degraded or in-conflict reservation
The times between reconfirmation attempts will need to change. Since we are now reconfirming running reservations, setting the time between attempts to half the time to the start will no longer work.
Changes to the external interface
Server attribute reserve_retry_time
Perms: Manager write / Everyone read
Status: New - An attempt to reconfirm a degraded reservation every reserve_retry_time seconds. The default is 600s.
Server attribute reserve_retry_init
Status: Deprecated - The first attempt to reconfirm a degraded reservation resv_retry_time seconds after it was originally degraded.
Server attribute reserve_retry_cutoff
Status: obsolete
Reservation attribute: reserve_retry
Perms: Manager read
Description: Epoch time of the next reconfirmation attempt
Accounting log: 'Y' record
Description: Every time a degraded reservation is reconfirmed (running or not running) PBS server will log a 'Y' record in the accounting logs.
This 'Y' record will have the following format - Y;<resvID>;requestor=Scheduler@<server> start=<(new/original) start time> end=<(new/original) end time> nodes=(<allotted nodes>)
Project Documentation Main Page