Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

c.       The use case only has two resource request options, but it makes sense to assume there may be more than two, but the usual number is less than 10.

 

Motivation for #3:  Some combination of better node utilization and better application performance

3.      User requests or admin forces jobs requesting N cores to be allocated exclusively onto the smallest quantity of “like nodes” (with respect to resources_available.ncpus), where each node is fully allocated (with respect to ncpus).  E.g., job requests select=64:ncpus=1 and system has both 16-core nodes and 32-core nodes – either allocate 4 16-core nodes (each with 16 chunks) or allocate 2 32-core nodes (each with 32 chunks)

a.      Resource request is for a total number of cores (ncpus), in PBS Pro a request for N cores corresponds to select=N:ncpus=1

b.      Unknown whether there is a prioritization/preference among different “like values”

 

For filtering nodes (e.g., using ==, !=, <, >):

Motivation:  Resilience – ensure jobs run “correctly” and are unlikely to experience faults due to use of nodes with incompatible properties (with respect to the applications)

Use Cases: 

1.      User requests all allocated nodes will have CPU speed > 2 GHz

2.      User requests none of the allocated nodes will be node X, node Y, node Z, …

3.      User requests none of the allocated nodes will be ARM nor POWER architecture

4.      User requests all of the allocated nodes should be running Linux version 6.5 or higher, but none will be running 6.5.2

 

link to forum discussion

Interface 1: New Job attribute called “job_set” - qsub option “-W job_set”

...

  • using qsub "-W job_set" option:

    • If user already knows that there is a job_set that exists in server then he/she can submit another job to the same job_set by specifying it's name using "-W job_set" option.
  • Every resource request specified by “-L” option will get queued as a separate job and will get it’s own job id.


Interface 3: Extend PBS to allow users to submit jobs with a node-filter (nfilter) resource.

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • User can now specify a node_filter with each of their jobs and this filter will help scheduler to filter out nodes that this job is allowed to run on.
    • A new resource “nfilter” is created. This resource is of type string. Users/operator/manager has privileges to read/write this resource.
    • nfilter is evaluated as an expression by PBS scheduler to filter out nodes that can be used to run the job in hand.
    • Users can specify a node filter with node resources using conditional operator like "<, >, <=, >=, !=.
      • Example: qsub -Lselect=3:ncpus=2:mem=18gb,nfilter=“resources_available[‘ncpus’]>=4 and resouces_available[‘color’] != ‘green’”,walltime=10000 -Lselect=2:ncpus=2:mem=24gb,nfilter=“resources_available[‘ncpus’]>16 and resources_available[‘color’]=‘blue’”,walltime=8000 job.scr
    • nfilter can make use of resources which are available with the nodes using “resources_available.” prefix with the resource, it can use resources that are assigned in the resource using “resources_assigned.” prefix before the resource. These are the only two inputs it can use to filter out the nodes.
    • To access a specific resource out of resources_available, resources_assigned inputs, users must enclose each resource name within square brackets “[ ]” like this - “resources_available[‘ncpus’]

Interface 4: New job substate “JOB_SUBSTATE_RUNNING_SET” (95)

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When a job of a job_set starts running then all other jobs of the same job_set will be marked in hold state and their substate will be set to 95.
    • Job substate 95 identifies that this held job is part of a job_set which has one job running in it.


Interface 54: New error code PBSE_MOVE_JOBSET

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When a job which is part of a job_set is being moved to another complex, following error code will the returned “PBSE_MOVE_JOBSET” (15211)


Interface 65: New error code PBSE_NO_JOBSET

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When pbs server tries to find out a job_set with a specified job_set id but unable to find it will use the error_code “PBSE_NO_JOBSET” (15212)


Interface 76: New job comment for jobs in substate 95

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • When a job of a job_set starts running then all other queued jobs of the same job_set that are in substate 95 will have a new job comment as “Job held, job <job-id> running from this job_set


Interface 87: New qselect option “- - job_set”

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • A new select command option “- - job_set” is added.
    • It accepts a string as an input value. This string must be the job_id which represents the leader of the job_set user is trying to query.
    • If server could not find any such job_set then select command will fail with the following error message - “qselect: a nonexistent job_set specified


Interface 98: move or peer-scheduling of job_set jobs is not allowed.

  • Visibility: Private
  • Change Control: Stable
  • Details:
    • Jobs that are part of a job_set are not allowed to be peered or moved to another complex.
  • If a peering complex tries to move a job that is part of a job_set from furnishing complex following error code will be returned “PBSE_MOVE_JOBSET” (15211)


Interface 109: When a running job of a job_set is requeued.

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • When a running job of a job_set is requeued all other held jobs are released and put back in queued state.
    • job_comment of all the jobs of the job_set is cleared.


Interface 1110: When a job of a job_set starts running

...