[STALLED] PP-506: To support job submissions with multiple resource requests and capability to run only one of them.

Currently we have moved resources away from this project. So this project is stalled for now.

This is a design proposal for PBS to support job submissions with multiple resource requests and capability to run only one of them.

 

For conditional requests (e.g., allocate resources A or resources B for a job): 

Motivation for #1 & #2:  start the job sooner, trading lower performance/efficiency/cost/utilization… for a faster start time

    • Often the goal is to craft a request that makes the job start “now”
    • Visible progress provides a more positive user experience, and starting to run is evidence of progress
    • The underlying motivation is most likely “earliest finish time”, but multiple confounding factors lead users to desire “earliest start time”.  For example, once a job is started, it is unlikely to be delayed by higher priority work entering the system, so there is more assurance that the end time is fixed.

Use Cases:

1.      User requests job allocating 64 cores; if the job will be started sooner if it requested 32 cores, then run it on 32 cores instead; ditto for 16 cores

a.      Multiple distinct resource request options are provided, and only one is chosen and allocated for the job

      1. The use case has only a single, node-level resource

b.      The resource request options are prioritized first by site-policy, then by the order provided by the user.  If any resource request option can be started (based on site-policy and available resources), the highest priority option is started.

2.      User requests or admin forces job to allocate “like nodes”; “like nodes” all have the same value for some property, resource, or attribute, such as (a) all nodes have the same CPU type (e.g., Intel Sandybridge) or (b) all nodes are attached to the same network fabric (e.g., QDR Infiniband).  (Note: so far, this is exactly the behavior of “place=group=X” in PBS Pro.)   Further, if the job will be started sooner if the job requested like nodes based on a different “like value”, then run it on nodes with that “like value” (e.g., use Intel IvyBridge versus Intel SandyBridge or use FDR Infiniband versus QDR Infiniband).  Ditto for a third choice of “like value”.

a.      Multiple distinct resource request options are provided, and only one is chosen and allocated for the job

      1. The use case has only a single, non-consumable, node-level resource

b.      The resource request options are prioritized first by site-policy, then by the order provided by the user.  If any resource request option can be started (based on site-policy and available resources), the highest priority option is started.

c.       The use case only has two resource request options, but it makes sense to assume there may be more than two, but the usual number is less than 10.

  

link to forum discussion

Interface 1: New Job attribute called “job_set” - qsub option “-W job_set”

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • A new job attribute “job_set” is added to the job.
    • This attribute is of type string and user/operator/manager has privileges to read/write this attribute.
    • Users can submit jobs specifying “-W job_set” option during submission. This attribute can only take an already submitted job-id as a value.
      • If a user specifies an invalid job id then job submission can fail with the following error -  “qsub: a nonexistent job_set specified
      • If a user specifies a legitimate job-id but this job-id isn't a job_set leader then, job submission can fail with the following error - "qsub: a nonexistent job_set specified
  • When a job is submitted with a legitimate job-id specified in the job_set (‘-W job_set' option), PBS server will submit this job and make it part of job_set which is led by specified job-id.
  • If a user wants to modify the job_set of and already existing job, then can do so by issuing “qalter -W job_set=<new job_set id> <job id to be modified>” command.


Interface 2: New qsub option “-L” (resource request).

  • Visibility: Public
  • Change Control: Stable
  • Details: 
    • A new qsub option “-L” can be used to specify more than one resource specification within one single command. This option is of type string and can be used multiple times while submitting a job.
      • For example, users can now submit jobs with multiple resource specifications in following manner: 
        qsub -qworkq2 -L select=2:ncpus=4:mem=20gb,walltime=10000,place=free -L select=4:ncpus=2,mem=20gb,walltime=8000,place=scatter job.scr
    • Implementation wise, qsub command internally will submit jobs one after another for each “resource request” specified in the command. It will also make sure that after submitting the first job it will use that job-id with “-Wjob_set” option (making the first job as job_set leader) for every subsequent “resource request” specified with the command.
    • “-L” option and “-Wjob_set” options can not be used together. Using them together will make qsub throw following error on console. “qsub: -Wjob_set option can not be used with -L option".
  • When a job is requested with multiple select specifications, PBS server will honor the queued limits set on the server/queue and run job submission hook on each of the resource specification. If one of the resource specification is found to be exceeding the limits then that resource specification will be ignored.
    • Example: If server has a limit set as qmgr -c "s s max_queued_res.ncpus=[u:user1=10] and user "user1" has no jobs queued then
      submitting a job like this "qsub -Lselect="3:ncpus=2:mem=18gb -Lselect=4:ncpus=3:mem=12gb -Lselect=2:ncpus=2:mem=24gb job.scr". This will make server ignore the second resource specification (which is "select=4:ncpus=3:mem=12gb") because it exceeds the queued limits and it will continue to accept the third resource specification.
  • There are two ways of creating/adding a job in a job set.
    • using qsub "-L" option:
      • When a user submit job with multiple resource specifications in a single qsub request. Server enqueues all the resource requests as individual jobs and makes them part of a job_set. 
      • The first job which is successfully submitted in server becomes the head of the job_set.
      • Example:
        • using -L option on command line - 
                  qsub –A “abcd” -L select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00  -L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 job.scr
    • using #PBS directive -

      #PBS –A “abcd”
      #PBS –L select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00
      #PBS –L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 
  • using qsub "-W job_set" option:

    • If user already knows that there is a job_set that exists in server then he/she can submit another job to the same job_set by specifying it's name using "-W job_set" option.
  • Every resource request specified by “-L” option will get queued as a separate job and will get it’s own job id.


Interface 3: New job substate “JOB_SUBSTATE_RUNNING_SET” (95)

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When a job of a job_set starts running then all other jobs of the same job_set will be marked in hold state and their substate will be set to 95.
    • Job substate 95 identifies that this held job is part of a job_set which has one job running in it.


Interface 4: New error code PBSE_MOVE_JOBSET

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When a job which is part of a job_set is being moved to another complex, following error code will the returned “PBSE_MOVE_JOBSET” (15211)


Interface 5: New error code PBSE_NO_JOBSET

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When pbs server tries to find out a job_set with a specified job_set id but unable to find it will use the error_code “PBSE_NO_JOBSET” (15212)


Interface 6: New job comment for jobs in substate 95

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When a job of a job_set starts running then all other queued jobs of the same job_set that are in substate 95 will have a new job comment as “Job held, job <job-id> running from this job_set


Interface 7: New qselect option “- - job_set”

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • A new select command option “- - job_set” is added.
    • It accepts a string as an input value. This string must be the job_id which represents the leader of the job_set user is trying to query.
    • If server could not find any such job_set then select command will fail with the following error message - “qselect: a nonexistent job_set specified


Interface 8: move or peer-scheduling of job_set jobs is not allowed.

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • Jobs that are part of a job_set are not allowed to be peered or moved to another complex.
  • If a peering complex tries to move a job that is part of a job_set from furnishing complex following error code will be returned “PBSE_MOVE_JOBSET” (15211)


Interface 9: When a running job of a job_set is requeued.

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • When a running job of a job_set is requeued all other held jobs are released and put back in queued state.
    • job_comment of all the jobs of the job_set is cleared.


Interface 10: When a job of a job_set starts running

  • Visibility: Public
  • Change Control: Experimental
  • Details:
    • When a job of a job_set start running, all the other jobs of the job_set are also moved to finished state.


FUTURE ENHANCEMENTS

—————————————


Going forward the same concept can be interpreted in terms job arrays as well and job arrays just becomes a subset of a job_set case. Job arrays are essentially job_set with a difference that in this case user wants all its subjobs to run instead of running only one.

 

If we expose a way to tell server whether we need only one job to run out of the set or all the jobs (like –R RUN_ONE|RUN_ALL), then server can internally take a decision when to delete the job_set.

 

Same syntax can be used to even submit job arrays.

EXAMPLE 1:

qsub –R RUN_ALL –L “select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00  -L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 job.scr

 

Since job arrays mostly consists of same resource specifications users can also do something like this –

EXAMPLE 5:

Qsub –R RUN_ALL –J 0-9 –l select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00 job.scr

 

If the job is received out of this submission (Example 1  or Example 2) is 123.server1 then one can access the first pool job as 123.server1 or 123[0].server1 and to access second job they can do 123[1].server1. Internally PBS server will map the index of the subjob specified to an actual job which was part of the same job_set.