Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is a design proposal for PBS

...

to support job submissions with multiple resource requests (with conditional operators) and capability to run only one of them.

Gist of proposed changes:

There are basically two requirements that we are trying to fulfill here - 

One requirement is for the user/admin to be able to specify their

...

node

...

filter criteria. This requirement can be met if we have support

...

to specify some kind or node filter with the job. This filter could be a python expression that can consist of conditional operators with non-consumable resources present on the nodes for now. The filter concept can also be extended to be used as queue/server limits or while trying to find a preemption candidate out of running jobs etc.

Second requirement is for the user/admin to be able to provide multiple resource specifications and make PBS use one as soon as it knows that it can start the job with that resource specification.

...

PBS scheduler

...

shall look at each of the resource specifications in the order they

...

get sorted according to scheduling policies and may choose to run the job with a specification as soon as it knows that it can. This fall in line with PBS scheduler's way of finding a node solution based on the "first fit" algorithm.

In case scheduler finds out that it can not run such a job because of resource unavailability and tries to calendar the job so that resources can be reserved for this job in future, it will use only the first resource specification that it encounters in it's sorted list of jobs and use that to calendar the job.

If running job which was initially submitted with multiple resource specifications gets requeued for any reason (like qrerun or node_fail_requeue or preemption by requeue), the job will get reevaluated to run by looking at each of the multiple resource specifications it was initially submitted with.


link to forum discussion

Interface 1

...

: New Job attribute called “job_set” - qsub option “-s”

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • A new job attribute “job_set” is added to the job.
    • This attribute is of type string and user/operator/manager has privileges to read/write this attribute.
    • Users can submit jobs specifying “-s” option during submission. This attribute can only take an already submitted job-id as a value.
      • If a user specifies an invalid job id then job submission can fail with the following error -  “qsub: a nonexistent job_set specified
  • When a job is submitted with a legitimate job-id specified in the job_set (‘-s” option), PBS server will submit this job and make it part of job_set which is led by specified job-id.
  • If a user wants to modify the job_set of and already existing job, then can do so by issuing “qalter -s <new job_set id> <job id to be modified>” command.


Interface 2: New qsub option “-L” (resource request).

  • Visibility: Public
  • Change Control: Stable
  • Details: 

...

    • A new qsub option “-L” can be used to specify more than one resource specification within one single command. This option is of type string and can be used multiple times while submitting a job.
      • For example, users can now submit jobs with multiple

...

      • resource specifications in following manner: 
        qsub -qworkq2 -L select=2:ncpus=4

...

      • :mem=20gb,walltime=10000,place=free -L select=4:ncpus=2

...

      • ,mem=20gb,walltime=8000,place=scatter job.scr

...

    • Implementation wise, qsub command internally will submit jobs one after another for each “resource request” specified in the command. It will also make sure that after submitting the first job it will use that job-id with “-s” option (making the first job as job_set leader) for every subsequent “resource request” specified with the command.
    • “-L” option and “-s” options can not be used together. Using them together will make qsub throw following error on console. “qsub: -s option can not be used with -L option".
  • When a job is requested with multiple select specifications, PBS server will honor the queued limits set on the server

...

  • /queue and run job submission hook on each of the resource specification. If one of the resource specification is found to be exceeding the limits then that resource specification will be ignored.
    • Example: If server has a limit set as qmgr -c "s s max_queued_res.ncpus=[u:user1=10] and user "user1" has no jobs queued then
      submitting a job like this "qsub -

...

    • Lselect="3:ncpus=

...

    • 2:mem=18gb

...

    • -Lselect=4:ncpus=3:mem=12gb

...

    • -Lselect=2:ncpus=2:mem=24gb job.scr". This will make server

...

    • ignore the second resource specification (which is "select=4:ncpus=3:mem=12gb") because it exceeds the queued limits and it will continue to accept the third resource specification.
  • There are two ways of creating/adding a job in a job set.
    • using qsub "-L" option:
      • When a user submit job with multiple resource specifications in a single qsub request. Server enqueues all the resource requests as individual jobs and makes them part of a job_set. 
      • The first job which is successfully submitted in server becomes the head of the job_set.
      • Example:
        • using -L option on command line - 
                  qsub –A “abcd” -L select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00  -L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 job.scr
  • using #PBS directive -
    #PBS –A “abcd”
    #PBS –L select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00
    #PBS –L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 
  • using qsub "-s" option:

    • If user already knows that there is a job_set that exists in server then he/she can submit another job to the same job_set by specifying it's name using "-s" option.
  • Every resource request specified by “-L” option will get queued as a separate job and will get it’s own job id.


Interface 3: Extend PBS to allow users to submit jobs

...

with a node-filter (nfilter) resource.

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • User can now specify a node_filter with each of their jobs and this filter will help scheduler to filter out nodes that this job is allowed to run on.
    • A new resource “nfilter” is created. This resource is of type string. Users/operator/manager has privileges to read/write this resource.

...

    • nfilter is evaluated as an expression by PBS scheduler to filter out nodes that can be used to run the job in hand.
    • Users can specify a node filter with node resources using conditional operator like "<, >, <=, >=, !=.
      • Example: qsub -

...

      • Lselect=

...

      • 3:ncpus=2:mem=18gb,nfilter=“resources_available[‘ncpus’]>=4

...

      • and resouces_available[‘color’] != ‘green’”,walltime=10000 -Lselect=2:ncpus=2:mem=24gb,nfilter=“resources_available[‘ncpus’]>16 and resources_available[‘color’]=‘blue’”,walltime=8000 job.scr
    • nfilter can make use of resources which are available with the nodes using “resources_available.” prefix with the resource, it can use resources that are assigned in the resource using “resources_assigned.” prefix before the resource. These are the only two inputs it can use to filter out the nodes.
    • To access a specific resource out of resources_available, resources_assigned inputs, users must enclose each resource name within square brackets “[ ]” like this - “resources_available[‘ncpus’]



Interface 4: New job substate “JOB_SUBSTATE_RUNNING_SET” (95)

  • Visibility: Private
  • Change Control:

...

  •  Stable
  • Details:
    • When a job of a job

...

  • In this case scheduler also logs an INFORMATION log stating that none of the select specifications can be satisfied.
    "None of the select specification could be satisfied"

...

qsub -l select="1:ncpus=5:mem>21gb||2:ncpus=4:mem=100mb" -- /bin/sleep 1000
qsub: Consumable resource can only be requested with = operator

...

If PBS server fails to parse multiple select specification, it will log the following error log in server's log file - 
"failed to parse ORed select spec: <select spec>"

Interface 4: New resource call "job_wide"

    Visibility: Public
    • _set starts running then all other jobs of the same job_set will be marked in hold state and their substate will be set to 95.
    • Job substate 95 identifies that this held job is part of a job_set which has one job running in it.


Interface 5: New error code PBSE_MOVE_JOBSET

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When a job which is part of a job_set is being moved to another complex, following error code will the returned “PBSE_MOVE_JOBSET” (15211)


Interface 6: New error code PBSE_NO_JOBSET

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When pbs server tries to find out a job_set with a specified job_set id but unable to find it will use the error_code “PBSE_NO_JOBSET” (15212)


Interface 7: New job comment for jobs in substate 95

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • A new resource called "job_wide" is added. This resource will be used to specify all job wide resources (resources that can not be part of a select specification) like walltime/min_walltime/max_walltime/place etc.
    • It is a string type resource and a user/operator/manager will be able to read or write this resource.
    • It is not mandatory to have this resource specified with "multiple select specifications". If not specified, PBS will consider job wide resources specified outside of select to be valid for all select specifications.
    • PBS will match the job_wide resource specification with select specification and use it for running the job.
      example: qsub -lselect="2:ncpus=3:mem=2gb||2:ncpus=4:mem=1gb"  -l job_wide="walltime=720||walltime=480" job.scr
      For this job walltime of 12 minutes is considered for the first select specification (2:ncpus=3:mem=2gb) and walltime of 8 minutes is considered for second select specification (2:ncpus=4:mem=1gb).
    • If a user wants to specify more than one job_wide resource then they can use '+' as the delimiter between the two resources
      example: qsub -l select="2:ncpus=3:mem=2gb||2:ncpus=4:mem=1gb"  -l job_wide="walltime=720+place=scatter||walltime=480+place=pack" job.scr
    • This resource can be used to specify multiple "ORed" job_wide resources only when there are multiple select specifications present in the job. If this condition is not met following error is thrown on console:
      "
      qsub: multiple job_wide resource can only be used when multiple select specifications are specified"
    • The number of "ORed" select specification must match the number of "ORed" job_wide resources. If either of them does not match then the job submission is rejected with the following error:
      "qsub: job_wide resources and select specifications do not match"
    • If a chunk level resource is specified in the job_wide resource then that job submission will be rejected with following error:
      "qsub: Resource invalid in "job_wide" specification: <resource name>"

Interface 5: New Job attribute called "sched_job_wide"

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • A new job attribute "sched_job_wide" is added to the job.
    • This attribute is of type string and only a manager has privileges to read this attribute.
    • This attribute will only be filled by PBS server when it encounters a job that has "job_wide" resource list specified in the job.
    • It consist of all job wide resources as specified with the job along with all the default resources that are specified on the queue/server.
      • If the resource specified in the job_wide resources matches the one specified as a default on queue/server then resource mentioned in "job_wide" will take precedence.

Interface 6: New job attribute "max_resc_req"

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • When a job with "multiple select specification and/or job_wide resources" is specified then server goes through each one of these select specification/job_wide resources and pick up maximum amount requested for each resource.
    • This list of resource request is then used to compare with max_queued or resources_max limits to make sure that none of these select specification/job_wide resources can exceed limits set on queue or server in any circumstances.
    • It is of type "resource" and only operator and manager has read privileges to it.
    • Example: for a job like this - qsub -l select="1:ncpus=23:mem=2gb||2:ncpus=4:mem=100mb"  -l job_wide="walltime=00:12:00||walltime=00:08:00" job.scr
      qstat output for max_resc_req will show up - 

      qstat -f 22 | grep max_resc_req
      max_resc_req.mem = 2gb
      max_resc_req.ncpus = 23
      max_resc_req.nodect = 1
      max_resc_req.walltime = 00:08:00

Interface 7: New job attribute "min_resc_req"

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • When a job with "multiple select specification and/or job_wide resources" is specified then server goes through each one of these select specification/job_wide resource and pick up minimum amount requested for each resource.
    • This list of resource request is then used to compare with resources_min limits to make sure that none of these select specification/job_wide resources can hit limit set on queue or server in any circumstances.
    • It is of type "resource" and only operator and manager has read privileges to it.
    • Example: for a job like this - qsub -l select="1:ncpus=23:mem=2gb||2:ncpus=4:mem=100mb" job.scr
      qstat output for min_resc_req will show up - 

      qstat -f 22 | grep min_resc_req
      min_resc_req.mem = 200mb
      min_resc_req.ncpus = 8
      min_resc_req.nodect = 1

Interface 8: Limitation in running jobs with "qrun -H" option.

...

    • When a job of a job_set starts running then all other queued jobs of the same job_set that are in substate 95 will have a new job comment as “Job held, job <job-id> running from this job_set


  • Interface 8: New qselect option “- - job_set”
  • Visibility: Private
  • Change Control: Stable
  • Details:
  • A new select command option “- - job_set” is added.
  • It accepts a string as an input value. This string must be the job_id which represents the leader of the job_set user is trying to query.
  • If server could not find any such job_set then select command will fail with the following error message - “qselect: a nonexistent job_set specified


  • Interface 9: move or peer-scheduling of job_set jobs is not allowed.
  • Visibility: Private
  • Change Control: Stable
  • Details:
  • Jobs that are part of a job_set are not allowed to be peered or moved to another complex.
  • If a peering complex tries to move a job that is part of a job_set from furnishing complex following error code will be returned “PBSE_MOVE_JOBSET” (15211)


Interface 10: When a running job of a job_set is requeued.

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • When a running job of a job_set is requeued all other held jobs are released and put back in queued state.
    • job_comment of all the jobs of the job_set is cleared.


Interface 11: When a job of a job_set ends

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • When a running job of a job_set finishes, all the held jobs of the job_set are also moved to finished state.




FUTURE ENHANCEMENTS

—————————————


Going forward the same concept can be interpreted in terms job arrays as well and job arrays just becomes a subset of a job_set case. Job arrays are essentially job_set with a difference that in this case user wants all its subjobs to run instead of running only one.

 

If we expose a way to tell server whether we need only one job to run out of the set or all the jobs (like –R RUN_ONE|RUN_ALL), then server can internally take a decision when to delete the job_set.

 

Same syntax can be used to even submit job arrays.

EXAMPLE 1:

qsub –R RUN_ALL –L “select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00  -L select=2:ncpus=8:mem=2gb,nfilter=“resources_available[‘os_ver’]!=rhel7 and resources_available[‘color’]!=‘black’ ”,walltime=01:45:00 job.scr

 

Since job arrays mostly consists of same resource specifications users can also do something like this –

EXAMPLE 5:

Qsub –R RUN_ALL –J 0-9 –l select=1:ncpus=16:mem=2gb,nfilter=“resources_available[‘os_ver’]>=rhel6 and resources_available[‘color’]==‘blue’ ”,walltime=02:00:00 job.scr

 

If the job is received out of this submission (Example 1  or Example 2) is 123.server1 then one can access the first pool job as 123.server1 or 123[0].server1 and to access second job they can do 123[1].server1. Internally PBS server will map the index of the subjob specified to an actual job which was part of the same job_set.