"runone" job dependency

Follow the PBS Pro Design Document Guidelines.

Forum Link: http://community.pbspro.org/t/new-runone-job-dependency-type/1988

Motivation:

  • In a complex where there is a crunch of resources, A user might be flexible to relax the resources requested by his/her job in order to start a job as early as possible. There should be a way for a user to specify multiple resource requests to run a job.

Overview:

A new job dependency named “runone” will allow users to create a group of dependent jobs out of which only one job would be allowed to run. As soon as one job out of the group starts running, all other jobs will be put in ‘H’ (hold) state. Once the job that ran finishes (or is deleted) all other dependent jobs will be deleted.

Technical Details:

Users can submit jobs and specify “runone” dependency type to create a group of dependent jobs.

  • This new dependency type accepts a ':' separated list of job ids on which the job been submitted is dependent on.

    • Example:

    •  

      % qsub -lncpus=4 -- /bin/sleep 1000 7.centos % qsub -lncpus=4 -Wdepend=runone:7 -- /bin/sleep 1000 8.centos % qsub -lncpus=2 -Wdepend=runone:8 -- /bin/sleep 1000 9.centos
  • When user specifies “runone” dependency on one or more jobs, PBS server will add reverse “runone” dependency on all the dependent jobs.

    • Example:

    • In the example below, job 7 was submitted as an independent job, job 8 was submitted with runone dependency on job 7 and job 9 was submitted with run one dependency on job 8. In such a scenario PBS server puts a reverse “runone” dependency on interdependent jobs.

      % qstat -f | grep -e "Job Id" -e "depend" Job Id: 7.centos depend = runone:9.centos@centos:8.centos@centos Job Id: 8.centos depend = runone:9.centos@centos:7.centos@centos Submit_arguments = -lncpus=2 -Wdepend=runone:7 -- /bin/sleep 1000 Job Id: 9.centos depend = runone:7.centos@centos:8.centos@centos Submit_arguments = -lncpus=2 -Wdepend=runone:8 -- /bin/sleep 1000
  • When one of the jobs in the group of “runone” dependency starts running, PBS server puts a “System” hold on all the dependent jobs. It is only when the running job ends (or is deleted), its dependency is released and all the dependent jobs are deleted. It is done in this order because there may be cases where a running job gets requeued (by preemption or user triggered) and those cases PBS will reconsider all the dependent jobs and run whichever it could.

  • When the dependent jobs are deleted from the system an abort accounting record is logged by server stating why dependency was released

    • Example:

    • In the following case, job 9 had a “runone” dependency on job 7 and 8. When job 9 finished, the server released the dependency on job 7 and 8 and logged the following accounting record.

      02/06/2020 17:28:18;A;7.centos;Job deleted as result of dependency on job 9.centos 02/06/2020 17:28:18;A;8.centos;Job deleted as result of dependency on job 9.centos
  • In a scheduling cycle if the scheduler looks at multiple jobs with runone dependency on each other then it will mark the job as “can not run” as soon as it is able to run one of the dependent jobs.

    • There may however be a case that scheduler may calendar jobs which are part of “runone” dependency but have not been able to run and one of the other dependent jobs that scheduler considers to run end up running.

      • In such cases, the job that couldn’t run but was added to the calendar will remain in the calendar for the rest of the cycle. The scheduler will correct itself from next cycle onwards because the calendared job from the next cycle will start showing up as “Held”.

  • When a running job that belongs to a “runone” dependency group is requeued by PBS server (in case of preemption or qrerun) then system hold is released on all the dependent jobs and all those jobs move into “queued” state.

  • If a user tries to submit a job to “runone” dependency group when one of the job from that group is already running, the newly submitted job will be immediately put under system hold.

  • If a user tries to remove the system hold set on a dependent job which is part of “runone” dependency group, PBS will consider this job as an independent job and try to run it.