Overview

Node placement is difficult in the general case. There are many variables that have to be taken into account in order to place jobs on nodes. If some of these variables can be removed, a simpler and much faster node placement algorithm can be created.

The normal node allocation algorithm maps the chunks the job requested onto the individual nodes of the cluster. If we can map chunks onto "buckets" of similar nodes, it can drastically decrease the number of objects needed to map. Each bucket has a count of how many free nodes are in it. If we need N chunks, then we can map the chunk to the bucket once, and decrease a counter instead of checking N individual nodes. If a bucket does not map to a chunk, we don't check all individual nodes in that bucket, we just move onto the next bucket.

There are very limited changes to the external design. For the most part, the scheduler will place a job faster than before for certain types of jobs.

Technical Details

The scheduler's server structure now contains an array of nodes whose order does not change during the scheduling cycle. A bucket is defined by a resource signature, four bitmaps, and maybe a queue. The resource signature is made up of the resources that are on the resources line in sched_config file. For the bitmaps, each 1 bit is an index into the new node array. The first bitmap contains all the nodes of the bucket. The second bitmap contains nodes that are free and have no calendared events on them. The third bitmap contains nodes that are free now, but busy later. The last bitmap contains the nodes that are busy now. The queue is used if the nodes are assigned to a queue.

If placement sets are in use(either node_group_key or place=group), each placement set has its own group of buckets based on the nodes in the placement set. The bucket algorithm is run on each placement set's group of node buckets.

We first determine if we should use our new bucket code path or the normal node code path. The following has to be true to use the bucket codepath:

Any time we are not sorting the nodes per-job
- node sorting by unused resources are not used.
- provisioning policy of avoid provisioning is not used. Avoid provisioning is done by sorting the nodes for each job so matching aoes are at the front.
The job is not the qrun job.
If the job is not in a reservation. A reservation has its own universe of nodes.
The job requests excl. Nodes are now bits in a bitmap. They can't be partially allocated.
The job requests vscatter/scatter/free placement. Pack placement is not supported.
The job does not request the host or vnode resource
The job is not suspended or checkpointed.
There can not be any multi-vnoded hosts in the complex. The algorithm can not spread resources across multiple vnodes of the same host.

What defines a node bucket is the following:

Any resources_available resource value for resources on the sched_config resources line
Any boolean that is not already on the sched_config resources line
The queue if the nodes are associated to a queue
The node's priority attribute

The part of the scheduler this algorithm is replacing is the node placement algorithm. The standard node placement algorithm takes in a job and nodes and will return an internal representation of the exec_vnode(nspec). Our new algorithm needs to do the same. It just allocates nodes differently.

The algorithm goes like this:

First we create a mapping(cb_map) between the chunks (the thing between the pluses), and the buckets. We look at each bucket's resource signature and compare them to the chunk. If the bucket matches, we add it to the cb_map with the number of chunks each node in this bucket can satisfy. If scatter/vscatter is requested, the satisfiable chunks is set to 1. We also keep a sum of the total number of nodes in each bucket. At the end, if the total is less than the requested number of chunks, we know the job can never run. Now we have a mapping between all the chunks and the buckets that can satisfy them. Note that a bucket can appear multiple times in the cb_map, if it can satisfy multiple chunks.

It's now time to create the final mapping between the chunks and the nodes. To do this we create a working copy of the free, busy_later, and busy bitmaps. These are used to make sure we don't allocate the same node to multiple chunks. We also create a bitmap which will contain the final node mapping.

We take each chunk and collect nodes for it. We take our cb_map, and look over each chunk and each bucket mapping to that chunk. We start by looking at the busy later nodes. We should allocate nodes which have time constraints first before nodes that have no time constraints. We allocate all nodes in the busy later bitmap to the job where the job will complete before the node is busy. If we still need more nodes, we get them from the free bitmap. If we still need more nodes for this chunk, we move onto the next bucket.

After we have gone over the chunks and buckets in our cb_map, we will know if the job can run now or not.

We store which nodes in which the job will run in a set of allocation bitmaps (one per chunk). They are stored in the cb_map. This is because the final exec_vnode needs to match 1:1 with the chunks in the select statement. If we had only one allocation bitmap, the exec_vnode would be in order of the nodes in the bitmaps, not the select statement.

We finally turn the cb_map allocation bitmaps into an nspec array. We go over each chunk in the cb_map and allocate nodes. For each on bit in the bitmap, we allocate the node N times to the exec_vnode where N is the number of times the bucket can be allocated to the chunk. In the end we have an nspec array of all the allocated nodes.

A new bitmap library has been implemented for arbitrary length bitmaps.

Interactions with other features

Preemption does not use the new bucket algorithm. Once we determine a job can't run, we will attempt preemption. Preemption will use the normal node placement algorithm.
Calendaring will use the bucket code path assuming all the above conditions are met.
The bucket code path will not be used to confirm reservations. It will only be used on jobs

External changes

The bucket code path will be used or it won't be used. This decision is completely up to the scheduler to decide. There are no external controls to this feature. That being said, there are external behavior changes.

Since we are not allocating buckets and not individual nodes, a less specific reason why your job is run is placed in the job comment. The normal code path would give the reason why the job couldn't be placed on one node. Now the more generic "No available resources on nodes" will be used in most cases.
Buckets are sorted. Individual nodes are not. Nodes within a bucket are allocated from busy later nodes first before using completely free nodes. This means if the two code paths were run on a single job, two different but perfectly correct node allocations would be created. A job is either run through one or the other code path, so this is not going to be visible.
When free placement used when a select spec of the form -lselect=A+B+C, a node can be allocated to either A, B, or C, but not to multiple of them. The normal algorithm would take what it needs for A, and if there was any resources left over, B or C could have them. While this is sub-par free placement, perfect placement is not guarantied. Even the normal node algorithm does nor provide perfect free placement.
- Example: There are two nodes, one with 12 cpus and the other with 24 cpus. A job with -lselect=3:ncpus=12 -l place=free:excl will run. A job with -lselect=2:ncpus=12+1:ncpus=6 -lplace=free:excl will not run. In this case, the first chunk will allocate both nodes. The node bucket algorithm will allocate a node to either chunk, but not split it across two.