This section explains the features of SE groups, such as TSO, GRO, RSS, and multiple dispatchers and queues.

TCP Segmentation Offload (TSO)

TCP segmentation offload is used to reduce the CPU overhead of TCP/ IP on fast networks. A host with TSO-enabled hardware sends TCP data to the Network Interface Card (NIC) without segmenting the data in software. This type of offload relies on the NIC to segment the data and add the TCP, IP, and data link layer headers to each segment.

TSO Support in Routing

With routing support enabled in SE, Generic Receive Offload (GRO) feature cannot be utilized because routing is stateless and SE will not be able to segment the large GRO-coalesced packet if the packets are not allowed to be IP fragmented. As a result of the support for this feature, GRO can be utilised for the routed traffic, allowing SE to segment larger packets into smaller TCP segments, either through TSO if the interface supports it or the routing layer in SE.

During the three-way handshake, both client and server advertise their respective MSS so that the peers will not send TCP segments larger than the MSS. This is enabled by default.

Generic Receive Offload (GRO)

GRO is a software technique for increasing the inbound throughput of high-bandwidth network connections by reducing CPU overhead. It works by aggregating multiple incoming packets from a single flow into a larger packet chain before they are passed higher up the networking stack, as a result reducing the number of packets that have to be processed.

Note:

The GRO is used if multiple packets for the same flow are received in a short period. If the incoming packets belong to different flows, the benefits of having GRO enabled might not be seen.

The following are the two modes of GRO operation in DPDK library.

Static mode:

The packets received in a single burst are subjected to the GRO layer. This is the mode of operation in NSX Advanced Load Balancer prior to version 22.1.1.

Timer mode:

The packets received are subjected to the GRO layer for a configured timeout. Special packets such as SYN, PSH, RST, and so on are not subjected to this timeout.

GRO is based on timer mode. The timer mode GRO can deliver better performance and reduces CPU utilization compared to the static mode GRO.

The timeout value can be configured in ServiceEngineGroup under dpdk_gro_timeout_interval. The default GRO timeout value is 50us for new Service Engine Group. The range of this knob can vary from 0 to 900us. When dpdk_gro_timeout_interval is configured to zero, SEs in the Service Engine Group revert to static mode of GRO. For SE upgrades to 22.1.1, the dpdk_gro_timeout_interval will be zero, implying burst mode GRO, which is the legacy mode of operation.

Starting with NSX Advanced Load Balancer version 22.1.2, if the SE group has SEs with greater than or equal to 8 vCPUs, GRO will be enabled.

Multi-Queue Support

The dispatcher on NSX Advanced Load Balancer is responsible for fetching the incoming packets from an NIC, sending them to the appropriate core for proxy work and sending back the outgoing packets to the NIC. A 40G NIC or even a 10G NIC receiving traffic at a high packet per second (PPS) rate in the case of small UDP packets, for example, might not be efficiently processed by a single-core dispatcher.

This problem can be solved by distributing traffic from a single physical NIC across multiple queues where each queue gets processed by a dispatcher on a different core. Receive Side Scaling (RSS) enables the use of multiple queues on a single physical NIC.

Large Receive Offload (LRO)

The Large Receive Offload (LRO) is a hardware technique for increasing inbound throughput of high-bandwidth network connections by reducing CPU overhead. The incoming packets are merged at reception time so that our Packet Processing Unit sees far fewer of them. This merging can be done either in the driver or in the hardware; (or) even LRO emulation in the driver has performance benefits.

LRO is much more aggressive, which can lead to packets being combined in a lossy fashion (discarding important header data), whereas GRO is more restrictive. In particular LRO seems to be known to problematic in environments with Routing and/or forwarding, which is common in virtualisation setups.

LRO is supported only in VCenter and NSX-T Environments. LRO is validated on NSX-T (ENS mode).

For more details on LRO routing use case and LRO configuration, refer Configuring TSO, LRO, GRO, and RSS section in this guide.

Receive Side Scaling (RSS)

When RSS is enabled on NSX Advanced Load Balancer, NICs make use of multiple queues in the receive path. The NIC pins flow to queues and put packets belonging to the same flow to be used in the same queue. This helps the driver to spread packet processing across multiple CPUs thereby improving efficiency.

On the NSX Advanced Load Balancer SE, the multi-queue feature is also enabled on the transmit side that is, different flows are pinned to different queues (packets belonging to the same flow in the same queue) to distribute the packet processing among CPUs.

Note:

The multi-queue feature (RSS) is not supported along with IPv6 addresses. If RSS is enabled, then IPv6 address cannot be configured for NSX Advanced Load Balancer Service Engine interfaces. Similarly, if the IPv6 address is already configured on NSX Advanced Load Balancer Service Engine interfaces, the multi-queue feature (RSS) cannot be enabled on those interfaces.

Multiple Dispatcher and Queues per NIC

Depending on the traffic processed by the Service Engine, the number of dispatchers can be configured with one or more than one core. Systems with high PPS load are configured with high number of dispatchers whereas proxy heavy load such as SSL workloads may not need high number of dispatchers.

Also, queues per NIC can be set for each dispatcher core for better performance. The Service Engine tries to detect best settings automatically for each environment.

Service Engine Datapath Isolation mode

NSX Advanced Load Balancer Service Engines can dedicate one or more service engine core for non se-dp tasks. This configuration particularly helps if service engines are hosting latency sensitive applications. Also, this will have a penalty on overall service engine performance as one or more core are dedicated for non se-dp tasks.

Hybrid RSS Mode

The SE hybrid RSS mode works only for DPDK mode with RSS configured and allows each SE vCPU to function as an independent unit, allowing every core to handle the dispatch and proxy job simultaneously and also disallowing the cross-core punting of the packets, that is for a 2-Core Service Engine, with each core tagged as (dispatcher-0, proxy-0), (dispatcher-1, proxy-1) of vCPU0 and vCPU1 respectively, any ingress flow on dispatcher-0 is egressed through proxy-0 and not punted to proxy-1 and vice versa.

The hybrid mode is brought in as a configurable property and aims at achieving higher performances on low core SE, especially 1 core and 2 core SE on vCenter/ NSX-T cloud.

Auto RSS for Public Cloud

The network bandwidth (capacity) provisioned for virtual machines on public cloud depends on the instance type and not on the aggregate network bandwidth of the attached network interfaces. The SE will determine the published network capacity of its instance type and configures the RSS, that is, max_queues_per_vnics and num_dispatcher_cores, appropriately in auto mode. The administrator can configure the values manually. Updating these knobs requires a reboot.

Depending on the traffic profile, the dedicated dispatcher mode can also be enabled. This is a runtime property and can be toggled through dedicated_dispatcher_core (boolean) knob.

Upgrade Considerations

For future releases, you can configure the Auto-RSS feature to auto (0) for new SE Groups.

Note:

Previous configurations will be preserved after upgrade.