Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A new job dependency named “runone” will allow users to create a group of dependent jobs out of which only one job would be allowed to run. As soon as one job out of the group starts running, all other jobs will be put in ‘H’ (hold) state. Once the job that ran finishes (or is deleted) all other dependent jobs will be deleted.

Technical Details:

Interface 1:

Users can submit jobs and specify “runone” dependency type to create a group of dependent jobs.

  • This new dependency type accepts a ':' separated list of job ids on which the job been submitted is dependent on.

    • Example:

    • Code Block
      % qsub -lncpus=4 -- /bin/sleep 1000
      7.centos
      % qsub -lncpus=4 -Wdepend=runone:7 -- /bin/sleep 1000
      8.centos
      % qsub -lncpus=2 -Wdepend=runone:7:8 -- /bin/sleep 1000
      9.centos 
      

  • When user specifies “runone” dependency on one or more jobs, PBS server will add reverse “runone” dependency on all the dependent jobs.

    • Example:

    • In the example below, job 7 and 8 were was submitted as independent jobs an independent job, job 8 was submitted with runone dependency on job 7 and job 9 was submitted with run one dependency on job 7 and 8. This In such a scenario PBS server put puts a reverse “runone” dependency on job 7 and 8 as wellinterdependent jobs.

      Code Block
      % qstat -f | grep -e "Job Id" -e "depend"
      Job Id: 7.centos
          depend = runone:9.centos@centos:8.centos@centos
      Job Id: 8.centos
          depend = runone:9.centos@centos:7.centos@centos
          Submit_arguments = -lncpus=2 -Wdepend=runone:7 -- /bin/sleep 1000
      Job Id: 9.centos
          depend = runone:7.centos@centos:8.centos@centos
          Submit_arguments = -lncpus=2 -Wdepend=runone:7:8 -- /bin/sleep 1000
      
  • When one of the jobs in the group of “runone” dependency starts running, PBS server puts a “System” hold on all the dependent jobs. It is only when the running job ends (or is deleted), its dependency is released and all the dependent jobs are deleted. It is done in this order because there may be cases where a running job gets requeued (by preemption or user triggered) and those cases PBS will reconsider all the dependent jobs and run whichever it could.

  • When the dependent jobs are deleted from the system an abort accounting record is logged by server stating why dependency was released

    • Example:

    • In the following case, job 9 had a “runone” dependency on job 7 and 8. When job 9 finished, the server released the dependency on job 7 and 8 and logged the following accounting record.

      Code Block
      02/06/2020 17:28:18;A;7.centos;Job deleted as result of dependency on job 9.centos
      02/06/2020 17:28:18;A;8.centos;Job deleted as result of dependency on job 9.centos
  • In a scheduling cycle if the scheduler looks at multiple jobs with runone dependency on each other then it will mark the job as “can not run” as soon as it is able to run one of the dependent jobs.

    • There may however be a case that scheduler may calendar jobs which are part of “runone” dependency but have not been able to run and one of the other dependent jobs that scheduler considers to run end up running.

      • In such cases, the job that couldn’t run but was added to the calendar will remain in the calendar for the rest of the cycle. The scheduler will correct itself from next cycle onwards because the calendared job from the next cycle will start showing up as “Held”.

  • When a running job that belongs to a “runone” dependency group is requeued by PBS server (in case of preemption or qrerun) then system hold is released on all the dependent jobs and all those jobs move into “queued” state.

  • If a user tries to submit a job to “runone” dependency group when one of the job from that group is already running, such qsub request will be rejected with error code “Invalid request”the newly submitted job will be immediately put under system hold.

  • If a user tries to remove the system hold set on a dependent job which is part of “runone” dependency group, PBS will consider this job as an independent job and try to run it.