vSphere Distributed Services Engine® is a core vSphere capability that enables customers to use DPUs with vSphere and VMware Cloud Foundation.

vSphere 8.0 enables breakthrough workload performance to meet ever-increasing throughput and latency needs. With vSphere Distributed Services Engine, infrastructure services are distributed across the different compute resources available on the ESXi host, with networking functions offloaded to the DPU. Such a capability works well for modern applications, which are developed using a microservices architecture approach that seeks to break down the application into multiple independent but cooperating services. This increased complexity places new demand for the CPU. For example, processing storage requests or shuttling network traffic for these microservices leaves fewer CPU cycles for the actual workload. In this context, purpose-built accelerators such as DPUs can take on the new compute burden and help you improve the performance and efficiency of infrastructure.

With vSphere Distributed Services Engine, DPUs can accelerate the performance of your network and increase data throughput, while placing no operational burden of managing the lifecycle of DPUs, as the existing Day-0, Day-1, and Day-2 vSphere experience does not change. vSphere Distributed Services Engine is supported by DPUs from NVIDIA and AMD, and server designs from Dell, HPE, Lenovo, and Fujitsu. vSphere Distributed Services Engine is available on servers with pre-installed DPUs.

Starting with vSphere 8.0, you can offload functionality that runs on the core CPU onto the DPU to significantly improve the network and security performance. As illustrated in the Evolving vSphere Architecture diagram, DPUs can also handle additional capabilities such as storage offload and bare metal management, but these additional capabilities are currently not supported.

Figure 1. Evolving vSphere Architecture.
VMware moves functionality that runs on the core CPU complex to the DPU CPU complex to enable network acceleration.

vSphere Distributed Services Engine offloads and accelerates infrastructure functions on the DPU by introducing a VMware vSphere Distributed Switch on the DPU and VMware NSX Networking and Observability, which allows you to proactively monitor, identify, and mitigate network infrastructure bottlenecks without complex network taps. The DPU becomes a new control point to scale infrastructure functions and enables security controls that are agentless and decoupled from the workload domain.

With vSphere Distributed Services Engine, you can:

vSphere Distributed Services Engine does not require a separate ESXi license. An internal network that is isolated from other networks, connects the DPUs with ESXi hosts. ESXi 8.0 server builds are unified images, which contain both x86 and DPU content. In your vSphere system, you see DPUs as new objects during installation and upgrade, and in networking, storage, and host profile workflows.

High Availability with VMware vSphere Distributed Services Engine

With ESXi 8.0 Update 3, you can opt for a VMware vSphere Distributed Services Engine installation with 2 data processing units (DPUs) to achieve high availability.

In vSphere systems with a single DPU, the device might become the single point of failure for workloads offloaded to the DPU, such as networking functions, and impact data and productivity. With ESXi 8.0 Update 3, vSphere Distributed Services Engine is also available on servers with 2 pre-installed DPUs, which provides hardware redundancy and resiliency.

You can utilize the two DPUs in Active/Standby mode to provide high availability. Such configuration provides redundancy in the event one of the DPUs fails. In the high availability configuration, both DPUs are assigned to the same NSX-backed vSphere Distributed Switch. For example, DPU-1 is attached to vmnic0 and vmnic1 of the vSphere Distributed Switch and DPU-2 is attached to vmnic2 and vmnic3 of the same vSphere Distributed Switch.

You can also utilize the two DPUs as independent devices to increase offload capacity per ESXi host. Each DPU is attached to a separate vSphere Distributed Switch and you have no failover between DPUs in such configuration.

Dual-DPU systems can use NVIDIA or Pensando devices. In ESXi 8.0 Update 3, dual-DPU systems are supported by Lenovo server designs. The DPU devices on a dual DPU server must be identical in all aspects: same vendor, same hardware version and same firmware. For a list of current vendors and server designs for VMware vSphere Distributed Services Engine, see the VMware Compatibility Guide.

Installation of VMware vSphere Distributed Services Engine with 2 DPUs

vSphere Distributed Services Engine does not require a separate ESXi license. ESXi 8.0 Update 3 server builds are unified images, which contain both x86 and DPU content, and you cannot install x86 and DPU content separately. The installation procedure on both DPUs, either interactive or scripted, also happens in parallel and you see minimal performance loss as compared to a single-DPU system.

With vSphere 8.0 Update 3, you can get a pre-installed server configuration with 2 DPUs from Dell or Lenovo, or add a second DPU to a single DPU system on the supported dual DPU servers from Dell or Lenovo.
Note: In any case, you need to run a complete fresh ESXi 8.0 Update 3 installation on your system, not only on the newly added DPUs.

For more information on the installation, see Install ESXi Interactively and Installation and Upgrade Scripts Used for ESXi Installation.

Error Handling, Failover, and Rollback for VMware vSphere Distributed Services Engine

Before installing VMware vSphere Distributed Services Engine, see the error handling, failover, and rollback options.

Error Handling

An installation failure of either x86 and DPU content on an ESXi host marks the entire installation procedure as failed.

While the expectation is that the software state of DPUs remains identical at all times, in the unlikely case of an error during a lifecycle operation, such as installation or upgrade of a Component, the operation might pass on one DPU but fail on the other. Since each lifecycle operation occurs within the boundaries of each DPU, errors do not affect the state of the other DPU, but the overall result of the installation is still marked as a failure.

During interactive install, in vSphere Lifecycle Manager workflows, and when you use ESXCLI, you receive information about the DPU on which the operation failed.

After a successful installation, in case of DPU errors, the recommended action is to restart the affected ESXi host. If the DPU is still accessible from the host, the general log bundle collection is sufficient for troubleshooting. If the DPU is not accessible from the host, logging in to the DPU from a BMC, iLO, or iDRAC interface can provide troubleshooting logs.

Failover

Failover support in vSphere 8.0 Update 3 is limited to one of the DPUs becoming non-functional due to software errors within the DPU or a physical disconnect of one of the DPUs, such as cable disconnect. Failover due to Peripheral Component Interconnect (PCI) level errors is not supported.

Rollback

Rollback is a best effort mechanism to restore the system to a previous working state in case of a failure before the jumpstart phase of the ESXi boot. Rollback on both x86 servers and the attached supported DPUs is automatic in case of an error during booting. You can also opt for a manual rollback by pressing Shift+R before the bootloader starts, to return to a previous good state.

Any failure after the jumpstart phase starts does not result in a rollback.

Table 1. Rollback scenarios for VMware vSphere Distributed Services Engine installation
Scenario Number of reboots required
Both DPUs boot correctly. ESXi does not boot correctly. 2
Both DPUs do not boot correctly. ESXi boots correctly. 1
One of the DPUs boots with an earlier version than the other DPU and ESXi. 2
One of the DPUs boots with an earlier version than the other DPU and ESXi does not boot correctly. 2