This is a design proposal for PBS to support job submissions with multiple resource requests (with conditional operators) and capability to run only one of them.

 

For conditional requests (e.g., allocate resources A or resources B for a job): 

Motivation for #1 & #2:  start the job sooner, trading lower performance/efficiency/cost/utilization… for a faster start time

Use Cases:

1.      User requests job allocating 64 cores; if the job will be started sooner if it requested 32 cores, then run it on 32 cores instead; ditto for 16 cores

a.      Multiple distinct resource request options are provided, and only one is chosen and allocated for the job

      1. The use case has only a single, node-level resource

b.      The resource request options are prioritized first by site-policy, then by the order provided by the user.  If any resource request option can be started (based on site-policy and available resources), the highest priority option is started.

2.      User requests or admin forces job to allocate “like nodes”; “like nodes” all have the same value for some property, resource, or attribute, such as (a) all nodes have the same CPU type (e.g., Intel Sandybridge) or (b) all nodes are attached to the same network fabric (e.g., QDR Infiniband).  (Note: so far, this is exactly the behavior of “place=group=X” in PBS Pro.)   Further, if the job will be started sooner if the job requested like nodes based on a different “like value”, then run it on nodes with that “like value” (e.g., use Intel IvyBridge versus Intel SandyBridge or use FDR Infiniband versus QDR Infiniband).  Ditto for a third choice of “like value”.

a.      Multiple distinct resource request options are provided, and only one is chosen and allocated for the job

      1. The use case has only a single, non-consumable, node-level resource

b.      The resource request options are prioritized first by site-policy, then by the order provided by the user.  If any resource request option can be started (based on site-policy and available resources), the highest priority option is started.

c.       The use case only has two resource request options, but it makes sense to assume there may be more than two, but the usual number is less than 10.

 

Motivation for #3:  Some combination of better node utilization and better application performance

3.      User requests or admin forces jobs requesting N cores to be allocated exclusively onto the smallest quantity of “like nodes” (with respect to resources_available.ncpus), where each node is fully allocated (with respect to ncpus).  E.g., job requests select=64:ncpus=1 and system has both 16-core nodes and 32-core nodes – either allocate 4 16-core nodes (each with 16 chunks) or allocate 2 32-core nodes (each with 32 chunks)

a.      Resource request is for a total number of cores (ncpus), in PBS Pro a request for N cores corresponds to select=N:ncpus=1

b.      Unknown whether there is a prioritization/preference among different “like values”

 

For filtering nodes (e.g., using ==, !=, <, >):

Motivation:  Resilience – ensure jobs run “correctly” and are unlikely to experience faults due to use of nodes with incompatible properties (with respect to the applications)

Use Cases: 

1.      User requests all allocated nodes will have CPU speed > 2 GHz

2.      User requests none of the allocated nodes will be node X, node Y, node Z, …

3.      User requests none of the allocated nodes will be ARM nor POWER architecture

4.      User requests all of the allocated nodes should be running Linux version 6.5 or higher, but none will be running 6.5.2

 

link to forum discussion

Interface 1: New Job attribute called “job_set” - qsub option “-W job_set”


Interface 2: New qsub option “-L” (resource request).

                  qsub –A “abcd” -L select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00  -L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 job.scr


Interface 3: Extend PBS to allow users to submit jobs with a node-filter (nfilter) resource.


Interface 4: New job substate “JOB_SUBSTATE_RUNNING_SET” (95)


Interface 5: New error code PBSE_MOVE_JOBSET


Interface 6: New error code PBSE_NO_JOBSET


Interface 7: New job comment for jobs in substate 95


Interface 8: New qselect option “- - job_set”


Interface 9: move or peer-scheduling of job_set jobs is not allowed.


Interface 10: When a running job of a job_set is requeued.


Interface 11: When a job of a job_set starts running


FUTURE ENHANCEMENTS

—————————————


Going forward the same concept can be interpreted in terms job arrays as well and job arrays just becomes a subset of a job_set case. Job arrays are essentially job_set with a difference that in this case user wants all its subjobs to run instead of running only one.

 

If we expose a way to tell server whether we need only one job to run out of the set or all the jobs (like –R RUN_ONE|RUN_ALL), then server can internally take a decision when to delete the job_set.

 

Same syntax can be used to even submit job arrays.

EXAMPLE 1:

qsub –R RUN_ALL –L “select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00  -L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 job.scr

 

Since job arrays mostly consists of same resource specifications users can also do something like this –

EXAMPLE 5:

Qsub –R RUN_ALL –J 0-9 –l select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00 job.scr

 

If the job is received out of this submission (Example 1  or Example 2) is 123.server1 then one can access the first pool job as 123.server1 or 123[0].server1 and to access second job they can do 123[1].server1. Internally PBS server will map the index of the subjob specified to an actual job which was part of the same job_set.