Overview

Node placement is difficult in the general case.  There are many variables that have to be taken into account in order to place jobs on nodes.  If some of these variables can be removed, a simpler and much faster node placement algorithm can be created. 

The normal node allocation algorithm maps the chunks the job requested onto the individual nodes of the cluster.  If we can map chunks onto "buckets" of similar nodes, it can drastically decrease the number of objects needed to map.  Each bucket has a count of how many free nodes are in it.  If we need N chunks, then we can map the chunk to the bucket once, and decrease a counter instead of checking N individual nodes.  If a bucket does not map to a chunk, we don't check all individual nodes in that bucket, we just move onto the next bucket.


There are very limited changes to the external design.  For the most part, the scheduler will place a job faster than before for certain types of jobs.

Technical Details

The scheduler's server structure now contains an array of nodes whose order does not change during the scheduling cycle.  A bucket is defined by a resource signature, four bitmaps, and maybe a queue.  The resource signature is made up of the resources that are on the resources line in sched_config file.  For the bitmaps, each 1 bit is an index into the new node array.  The first bitmap contains all the nodes of the bucket.  The second bitmap contains nodes that are free and have no calendared events on them.  The third bitmap contains nodes that are free now, but busy later.  The last bitmap contains the nodes that are busy now.  The queue is used if the nodes are assigned to a queue.


If placement sets are in use(either node_group_key or place=group), each placement set has its own group of buckets based on the nodes in the placement set.  The bucket algorithm is run on each placement set's group of node buckets.


We first determine if we should use our new bucket code path or the normal node code path.  The following has to be true to use the bucket codepath:



What defines a node bucket is the following:

The part of the scheduler this algorithm is replacing is the node placement algorithm.  The standard node placement algorithm takes in a job and nodes and will return an internal representation of the exec_vnode(nspec).  Our new algorithm needs to do the same.  It just allocates nodes differently.


The algorithm goes like this:

First we create a mapping(cb_map) between the chunks (the thing between the pluses), and the buckets.  We look at each bucket's resource signature and compare them to the chunk.  If the bucket matches, we add it to the cb_map with the number of chunks each node in this bucket can satisfy.  If scatter/vscatter is requested, the satisfiable chunks is set to 1.  We also keep a sum of the total number of nodes in each bucket.  At the end, if the total is less than the requested number of chunks, we know the job can never run.  Now we have a mapping between all the chunks and the buckets that can satisfy them.  Note that a bucket can appear multiple times in the cb_map, if it can satisfy multiple chunks.


It's now time to create the final mapping between the chunks and the nodes.  To do this we create a working copy of the free, busy_later, and busy bitmaps.  These are used to make sure we don't allocate the same node to multiple chunks.  We also create a bitmap which will contain the final node mapping.

We take each chunk and collect nodes for it.  We take our cb_map, and look over each chunk and each bucket mapping to that chunk.  We start by looking at the busy later nodes.  We should allocate nodes which have time constraints first before nodes that have no time constraints.  We allocate all nodes in the busy later bitmap to the job where the job will complete before the node is busy.  If we still need more nodes, we get them from the free bitmap.  If we still need more nodes for this chunk, we move onto the next bucket.

After we have gone over the chunks and buckets in our cb_map, we will know if the job can run now or not.


We store which nodes in which the job will run in a set of allocation bitmaps (one per chunk).  They are stored in the cb_map.  This is because the final exec_vnode needs to match 1:1 with the chunks in the select statement.  If we had only one allocation bitmap, the exec_vnode would be in order of the nodes in the bitmaps, not the select statement.


We finally turn the cb_map allocation bitmaps into an nspec array.  We go over each chunk in the cb_map and allocate nodes.  For each on bit in the bitmap, we allocate the node N times to the exec_vnode where N is the number of times the bucket can be allocated to the chunk.  In the end we have an nspec array of all the allocated nodes.


A new bitmap library has been implemented for arbitrary length bitmaps.

Interactions with other features

External changes

The bucket code path will be used or it won't be used.  This decision is completely up to the scheduler to decide.  There are no external controls to this feature.  That being said, there are external behavior changes.