vSphere 6.5 and later releases support remote direct memory access (RDMA) communication between virtual machines that have paravirtualized RDMA (PVRDMA) network adapters .

Overview of RDMA

RDMA allows direct memory access from the memory of one computer to the memory of another computer without involving the operating system or CPU. The transfer of memory is offloaded to the RDMA-capable Host Channel Adapters (HCA) . A PVRDMA network adapter provides remote direct memory access in a virtual environment.

Using RDMA in vSphere

In vSphere, a virtual machine can use a PVRDMA network adapter to communicate with other virtual machines that have PVRDMA devices. The virtual machines must be connected to the same vSphere Distributed Switch.

The PVRDMA device automatically selects the method of communication between the virtual machines . For virtual machines that run on the same ESXi host with or without a physical RDMA device, the data transfer is a memcpy between the two virtual machines. The physical RDMA hardware is not used in this case.

For virtual machines that reside on different ESXi hosts and that have a physical RDMA connection, the physical RDMA devices must be uplinks on the distributed switch. In this case, the communication between the virtual machines by way of PVRDMA uses the underlying physical RDMA devices.

For two virtual machines that run on different ESXi hosts, when at least one of the hosts does not have a physical RDMA device, the communication falls back to a TCP-based channel and the performance is reduced.

Assign a PVRDMA Adapter to a Virtual Machine

To enable a virtual machine to exchange data by using RDMA, you must associate the virtual machine with a PVRDMA network adapter.

You can add up to ten PVRDMA network adapters to a virtual machine when using vSphere 7.0.2 and later.

Prerequisites

  • Verify that the host on which the virtual machine is running is configured for RDMA. See Configure an ESXi Host for PVRDMA.
  • Verify that the host is connected to a vSphere Distributed Switch.
  • Verify that the virtual machine uses virtual hardware version 13 and above.
  • Verify that the guest operating system is a Linux 64-bit distribution.

Procedure

  1. Locate the virtual machine in the vSphere Client.
    1. Select a data center, folder, cluster, resource pool, or host and click the VMs tab.
    2. Click Virtual Machines and click the virtual machine from the list.
  2. Power off the virtual machine.
  3. From the Actions menu, select Edit Settings.
  4. Select the Virtual Hardware tab in the dialog box displaying the settings.
  5. From the Add new device drop-down menu, select Network Adapter.
    The New Network section is added to the list in the Virtual Hardware tab.
  6. Expand the New Network section and connect the virtual machine to a distributed port group.
  7. From the Adapter type drop-down menu, select PVRDMA.
  8. Expand the Memory section, select Reserve all guest memory (All locked), and click OK .
  9. Power on the virtual machine.

Configure a Virtual Machine to Use PVRDMA Native Endpoints

PVRDMA native endpoints are available as an advanced virtual machine configuration.

PVRDMA native endpoints are supported in virtual machine hardware version 18 and later beginning with vSphere 7.0 Update 1 and later releases. To use PVRDMA native endpoints, you must enable PVRDMA namespaces. To learn how to enable PVRDMA namespaces on your environment's specific hardware, refer to the vendor documentation.

You can use the vSphere Client to configure native endpoints or edit the virtual machine's VMX file. If editing the VMX file directly, add the parameter vrdmax.nativeEndpointSupport = "TRUE", where x is the index of the PVRDMA adapter. The following procedure uses the vSphere Client to configure native endpoints.

Prerequisites

Verify that your environment supports PVRDMA. See PVRDMA Support.

Procedure

  1. Browse to the virtual machine in the vSphere Client.
    1. To find a virtual machine, select a data center, folder, cluster, resource pool, or host.
    2. Click the VMs tab.
  2. Right-click the virtual machine and select Edit Settings.
  3. Click VM Options.
  4. Expand Advanced.
  5. Under Configuration Parameters, click the Edit Configuration button.
  6. In the dialog box that appears, click Add Row to enter a new parameter and its value.
  7. Enter the parameter vrdmax.nativeEndpointSupport, where x is the index of the PVRDMA adapter, and set the value to TRUE.
    The index x is the number of the PVRDMA adapter minus 1. For example, if the PVRDMA adapter you want to enable native endpoints with is labelled "Network Adapter 2," then the index is 1.

Configure a Virtual Machine to use PVRDMA Asynchronous Mode

Learn how to configure a virtual machine to use PVRDMA asynchronous mode. It is available as an advanced virtual machine configuration.

PVRDMA asynchronous mode is available in virtual machines running on vSphere 8.0 and later. Asynchronous mode might improve the throughput and latency for RDMA workloads running in the virtual machine. When asynchronous mode is enabled, an increased CPU use might be observed in the host. When asynchronous mode is in use, it is recommended that the virtual machine is configured for high latency sensitivity.

Prerequisites

Verify that your environment supports PVRDMA. See PVRDMA Support.

Procedure

  1. Locate the virtual machine in the vSphere Client.
    1. Select a data center, folder, cluster, resource pool, or host and click the VMs tab.
    2. Click Virtual Machines and click the virtual machine from the list.
  2. Right-click the virtual machine and select Edit Settings.
  3. Click VM Options.
  4. Expand Advanced.
  5. Under Configuration Parameters, click the Edit Configuration button.
  6. In the dialog box that appears, click Add Row to enter a new parameter and its value.
  7. Enter the parameter vrdma.asyncMode, and set the value to TRUE.

Network Requirements for RDMA over Converged Ethernet

RDMA over Converged Ethernet ensures low-latency, light-weight, and high-throughput RDMA communication over an Ethernet network. RoCE requires a network that is configured for lossless traffic of information at layer 2 alone or at both layer 2 and layer 3.

RDMA over Converged Ethernet (RoCE) is a network protocol that uses RDMA to provide faster data transfer for network-intensive applications. RoCE allows direct memory transfer between hosts without involving the hosts' CPUs.

There are two versions of the RoCE protocol. RoCE v1 operates at the link network layer (layer 2). RoCE v2 operates at the Internet network layer (layer 3) . Both RoCE v1 and RoCE v2 require a lossless network configuration. RoCE v1 requires a lossless layer 2 network, and RoCE v2 requires that both layer 2 and layer 3 are configured for lossless operation.

Lossless Layer 2 Network

To ensure lossless layer 2 environment, you must be able to control the traffic flows. Flow control is achieved by enabling global pause across the network or by using the Priority Flow Control (PFC) protocol defined by Data Center Bridging group (DCB). PFC is a layer 2 protocol that uses the class of services field of the 802.1Q VLAN tag to set individual traffic priorities. It puts on pause the transfer of packets towards a receiver in accordance with the individual class of service priorities. This way, a single link carries both lossless RoCE traffic and other lossy, best-effort traffic. With traffic flow congestion, important lossy traffic can be affected. To isolate different flows from one another, use RoCE in a PFC priority-enabled VLAN.

Lossless Layer 3 Network

RoCE v2 requires that lossless data transfer is preserved at layer 3 routing devices. To enable the transfer of layer 2 PFC lossless priorities across layer 3 routers, configure the router to map the received priority setting of a packet to the corresponding Differentiated Serviced Code Point (DSCP) QoS setting that operates at layer 3. The transferred RDMA packets are marked with layer 3 DSCP, layer 2 Priority Code Points (PCP) or with both. To extract priority information from the packet routers use either DSCP or PCP. In case PCP is used, the packet must be VLAN-tagged and the router must copy the PCP bits of the tag and forward them to the next network. If the packet is marked with DSCP, the router must keep the DSCP bits unchanged.

Like RoCE v1, RoCE v2 must run on a PFC priority-enabled VLAN.

Note: Do not team RoCE NICs, if you intend to use RDMA on those NICs.

For vendor-specific configuration information, refer to the official documentation of the respective device or the switch vendor.