Non-Uniform Memory Access (NUMA)

This section describes how to obtain the best NUMA performance from VMware Cloud on AWS.

Note:

A different feature, Virtual NUMA (vNUMA), allowing the creation of NUMA virtual machines, is described in Guest Operating System CPU Considerations.

More information about using NUMA systems can be found in the Using NUMA Systems with ESXi section of vSphere Resource Management. Though not specifically addressing VMware Cloud on AWS, most of the mechanisms detailed there are nevertheless relevant.

Manual NUMA Configuration

The intelligent, adaptive NUMA scheduling and memory placement policies in VMware Cloud on AWS can manage all virtual machines transparently, so that administrators don’t need to deal with the complexity of balancing virtual machines between nodes by hand. Manual controls are available to override this default behavior, however, and advanced administrators might prefer to manually set NUMA placement (through the numa.nodeAffinity advanced option).

Virtual machines can be separated into the following two categories:

Virtual machines with a number of vCPUs equal to or less than the number of cores in each physical NUMA node.
These virtual machines will be assigned to cores all within a single NUMA node and will be preferentially allocated memory local to that NUMA node. This means that, subject to memory availability, all their memory accesses will be local to that NUMA node, resulting in the lowest memory access latencies.
Virtual machines with more vCPUs than the number of cores in each physical NUMA node (called “wide virtual machines”).
These virtual machines will be assigned to two NUMA nodes and will be preferentially allocated memory local to those NUMA nodes. Because vCPUs in these wide virtual machines might sometimes need to access memory outside their own NUMA node, they might experience higher average memory access latencies than virtual machines that fit entirely within a NUMA node.

Note:
This potential increase in average memory access latencies can be mitigated by appropriately configuring Virtual NUMA (described in Guest Operating System CPU Considerations), thus allowing the guest operating system to take on part of the memory-locality management task.

Because of this difference, there can be a slight performance advantage in some environments to virtual machines configured with no more vCPUs than the number of cores in each physical NUMA node.

Conversely, some memory bandwidth bottlenecked workloads can benefit from the increased aggregate memory bandwidth available when a virtual machine that would fit within one NUMA node is nevertheless split across multiple NUMA nodes. This split can be accomplished by limiting the number of vCPUs that can be placed per NUMA node by using the maxPerMachineNode option (do also consider the impact on vNUMA, however, which is described in Guest Operating System CPU Considerations).

As described in Hyper-Threading, i3en.metal instances have hyper-threaded enabled. Therefore on these systems virtual machines with a number of vCPUs greater than the number of cores in a NUMA node but lower than the number of logical processors in each physical NUMA node might benefit from using logical processors with local memory instead of full cores with remote memory. This behavior can be configured for a specific virtual machine with the numa.vcpu.preferHT flag.