KNL Analysis

Brief about for KNL.

Intel® Xeon Phi™ processors codenamed “Knights Landing,” are based on Intel’s
Many Integrated Core (MIC) architecture, offering an alternative performance/power configuration
to Intel Xeon processor products.

“Knights Landing” processor family is comprised of self-booting host processors which can be embedded up to
64–68 cores(each core support 4 threads with hyperthreading) in Cray XC compute blade configurations. This new many-core
device supports wider vector units and more threads per core to deliver in excess of 3 TF per device. One of the keys to
scaling parallelism is localizing and optimizing data movement.

Each of the new XC series compute nodes implements an Intel Xeon Phi processor socket with pipeline to local DDR memory, as well
as PCIe Gen 3 x16 access to the high-performance Cray-developed Aries interconnect. This processor family implements innovative
new onboard high-bandwidth DRAM memory (HBM configurations up to 16 GB), which is tightly coupled with the host compute die.
Users can see a significant performance boost from identifying high-bandwidth data and placing that data in the on-chip memory for
their HPC applications

The high-performance memory can be configured on the flexible XC series compute nodes at boot time (job launch) to be used as local
cache or as directly-accessible fast memory. This device is Xeon processor binary compatible, and whether programmers write all their
own code or users load pre-existing ISVs, this productivity capability makes it easy to support different use modes.


Following are the Numa Modes for KNL processors:

Mode NameDescription
a2a All-to-AllAddresses are uniformly hashed across distributed directories
quadQuadrantAddresses are hashed to a directory in
the same quadrant as the memory
hemiHemisphereAddresses are hashed to a directory in
the same hemisphere as the memory
snc4(4) Sub-NUMATiles are divided into four sub-NUMA
Clusters cluster, each cluster is one NUMA node.
snc2(2) Sub-NUMATiles are divided into two sub-NUMA
Clusters clusters; each cluster is one NUMA node


Following are the MCDRAM modes for KNL processors:

Mode NameDescription
cache Cache MCDRAM is used as a cache between the
processor and DDR4 memory
flat Flat MCDRAM is physically addressable,
in a separate NUMA node
equal Hybrid Equal 50% of MCDRAM is Flat, and
50% of MCDRAM is Cache
splitHybrid 

Split 75% of MCDRAM is Flat, and

25% of MCDRAM is Cache


The KNL processor and MCDRAM can be configured in twenty different combinations of NUMA and MCDRAM
modes. For example, the quadrant NUMA mode can be combined with the cache MCDRAM mode, and this
combination is known as quad/cache. For more detail on same please check here.



Requirements & Analysis

Requirements:

  1. Create a vnode per KNL node.

Analysis:

Currently the smallest theoretical node entry in the BASIL QUERY(INVENTORY) response requires 21lines of XML,
and actual node entries are much larger. Thus a size of a full inventory for a system of thousands of nodes is
extremely large. Much of a full system inventory's information is redundant

From BASIL 1.5 onwards new BASIL query types(SYSTEM) will return simpler, smaller portions of the Cray system's inventory 
information. Inasmuch as possible, these new queries' response XML node subelements will be converted into attributes
containing unique values or counts. This will allow node information to be grouped with multiple nodes per record. These 
node lists will be specified in rangelist format.

To Support KNL node, BASIL 1.7 was introduced, and Cray considered changing the Inventory and Summary queries, but decided 
that the changes to the SYSTEM query fully describe the current configuration of KNL nodes.

We thought of using the new "SYSTEM" query to generate both knl and non-knl node. But there are certain elements/attrib which are
are exclusive to the query types. And just using "SYSTEM" query will not be viable.

These diffrences are as following:

ELEMENT/ATTRIBUTESINVENTORY_QUERYSYSTEM_QUERY
Architecture Available
(usually value is XT
But it seems, there 
are other supported 
architecture)
-NA-
SegmentAvailable-NA- (But can be computed)

PBScrayorder

Available( Easily computed)-NA-
LabelArrayAvailable-NA-
numa_cfgNAAvailable
hbm_size_mbNAAvailable
hbm_cache_pct NAAvailable

NA: Not Available

Solution:

PBS will make two basil query. One to get information about KNL("SYSTEM" Query) and other for non-KNL("INVENTORY" query) node.
and combine the results to create vnode

Existing implementation already handles the non-KNL node infromation through basil INVENTORY query. Implementation for handling SYSTEM query 
response is required.



Sample Alps SYSTEM query response:
<Nodes role="interactive" state="up" speed="1200" numa_nodes="1" dies="1" compute_units="68" cpus_per_cu="4" page_size_kb="4" page_count="25165874"\
numa_cfg="quad" ="16584" hbm_cache_pct="100">
40-47
</Nodes>
Sample Alps INVENTORY query response single node:
<Node node_id="28" name="xxxx" architecture="XT" role="BATCH" state="UP">
     <SocketArray>
      <Socket ordinal="0" architecture="x86_64" clock_mhz="2100">
       <SegmentArray>
        <Segment ordinal="0">
         <ComputeUnitArray>
          <ComputeUnit ordinal="0">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="1">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="2">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="3">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
         </ComputeUnitArray>
         <MemoryArray>
          <Memory type="OS" page_size_kb="4" page_count="4194304"/>
         </MemoryArray>
         <LabelArray/>
        </Segment>
        <Segment ordinal="1">
         <ComputeUnitArray>
          <ComputeUnit ordinal="0">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="1">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="2">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
          <ComputeUnit ordinal="3">
           <ProcessorArray>
            <Processor ordinal="0"/>
            <Processor ordinal="1"/>
           </ProcessorArray>
          </ComputeUnit>
         </ComputeUnitArray>
         <MemoryArray>
          <Memory type="OS" page_size_kb="4" page_count="4194304"/>
         </MemoryArray>
         <LabelArray/>
        </Segment>
       </SegmentArray>
      </Socket>
     </SocketArray>
     <AcceleratorArray>
      <Accelerator ordinal="0" type="GPU" state="UP" family="Tesla_K20X" memory_mb="6144" clock_mhz="732"/>
     </AcceleratorArray>
</Node>