vSphere Distributed Services Engine® is a core vSphere capability that enables customers to use DPUs with vSphere and VMware Cloud Foundation.
vSphere 8.0 enables breakthrough workload performance to meet ever-increasing throughput and latency needs. With vSphere Distributed Services Engine, infrastructure services are distributed across the different compute resources available on the ESXi host, with networking functions offloaded to the DPU. Such a capability works well for modern applications, which are developed using a microservices architecture approach that seeks to break down the application into multiple independent but cooperating services. This increased complexity places new demand for the CPU. For example, processing storage requests or shuttling network traffic for these microservices leaves fewer CPU cycles for the actual workload. In this context, purpose-built accelerators such as DPUs can take on the new compute burden and help you improve the performance and efficiency of infrastructure.
With vSphere Distributed Services Engine, DPUs can accelerate the performance of your network and increase data throughput, while placing no operational burden of managing the lifecycle of DPUs, as the existing Day-0, Day-1, and Day-2 vSphere experience does not change. vSphere Distributed Services Engine is supported by DPUs from NVIDIA and AMD, and server designs from Dell, HPE, Lenovo, and Fujitsu. vSphere Distributed Services Engine is available on servers with pre-installed DPUs.
Starting with vSphere 8.0, you can offload functionality that runs on the core CPU onto the DPU to significantly improve the network and security performance. As illustrated in the Evolving vSphere Architecture diagram, DPUs can also handle additional capabilities such as storage offload and bare metal management, but these additional capabilities are currently not supported.
vSphere Distributed Services Engine offloads and accelerates infrastructure functions on the DPU by introducing a VMware vSphere Distributed Switch on the DPU and VMware NSX Networking and Observability, which allows you to proactively monitor, identify, and mitigate network infrastructure bottlenecks without complex network taps. The DPU becomes a new control point to scale infrastructure functions and enables security controls that are agentless and decoupled from the workload domain.
With vSphere Distributed Services Engine, you can:
- Install and update ESXi images simultaneously on the x86 server and the attached supported DPU to reduce operational overhead of DPU lifecycle management with integrated vSphere workflows. For more information, see Using vSphere Lifecycle Manager With VMware vSphere Distributed Services Engine.
- Set alarms for DPU hardware alerts and monitor performance metrics on core, memory, and network throughput from the familiar vCenter interfaces, without the need of new tools. For more information, see CPU (DPU) and Memory (DPU).
- Accelerate vSphere Distributed Switch on the DPU to improve network performance and utilize available CPU cycles to achieve higher workload consolidation per ESXi host. For more information, see What is Network Offloads Capability and Create a vSphere Distributed Switch.
- Get vSphere DRS and vSphere vMotion support for VMs running on hosts with DPUs attached to get the benefits of passthrough without sacrificing on VM portability. For more information, see Homogenous clusters for DPUs.
- Improve the security of infrastructure with zero-trust security. For more information, see vSphere Distributed Services Engine Security Best Practices.
vSphere Distributed Services Engine does not require a separate ESXi license. An internal network that is isolated from other networks, connects the DPUs with ESXi hosts. ESXi 8.0 server builds are unified images, which contain both x86 and DPU content. In your vSphere system, you see DPUs as new objects during installation and upgrade, and in networking, storage, and host profile workflows.
High Availability with VMware vSphere Distributed Services Engine
With ESXi 8.0 Update 3, you can opt for a VMware vSphere Distributed Services Engine installation with 2 data processing units (DPUs) to achieve high availability.
In vSphere systems with a single DPU, the device might become the single point of failure for workloads offloaded to the DPU, such as networking functions, and impact data and productivity. With ESXi 8.0 Update 3, vSphere Distributed Services Engine is also available on servers with 2 pre-installed DPUs, which provides hardware redundancy and resiliency.
You can utilize the two DPUs in Active/Standby mode to provide high availability. Such configuration provides redundancy in the event one of the DPUs fails. In the high availability configuration, both DPUs are assigned to the same NSX-backed vSphere Distributed Switch. For example, DPU-1 is attached to vmnic0 and vmnic1 of the vSphere Distributed Switch and DPU-2 is attached to vmnic2 and vmnic3 of the same vSphere Distributed Switch.
You can also utilize the two DPUs as independent devices to increase offload capacity per ESXi host. Each DPU is attached to a separate vSphere Distributed Switch and you have no failover between DPUs in such configuration.
Dual-DPU systems can use NVIDIA or Pensando devices. In ESXi 8.0 Update 3, dual-DPU systems are supported by Lenovo server designs. The DPU devices on a dual DPU server must be identical in all aspects: same vendor, same hardware version and same firmware. For a list of current vendors and server designs for VMware vSphere Distributed Services Engine, see the VMware Compatibility Guide.
Installation of VMware vSphere Distributed Services Engine with 2 DPUs
vSphere Distributed Services Engine does not require a separate ESXi license. ESXi 8.0 Update 3 server builds are unified images, which contain both x86 and DPU content, and you cannot install x86 and DPU content separately. The installation procedure on both DPUs, either interactive or scripted, also happens in parallel and you see minimal performance loss as compared to a single-DPU system.
For more information on the installation, see Install ESXi Interactively and Installation and Upgrade Scripts Used for ESXi Installation.
Error Handling, Failover, and Rollback for VMware vSphere Distributed Services Engine
Before installing VMware vSphere Distributed Services Engine, see the error handling, failover, and rollback options.
Error Handling
An installation failure of either x86 and DPU content on an ESXi host marks the entire installation procedure as failed.
While the expectation is that the software state of DPUs remains identical at all times, in the unlikely case of an error during a lifecycle operation, such as installation or upgrade of a Component, the operation might pass on one DPU but fail on the other. Since each lifecycle operation occurs within the boundaries of each DPU, errors do not affect the state of the other DPU, but the overall result of the installation is still marked as a failure.
During interactive install, in vSphere Lifecycle Manager workflows, and when you use ESXCLI, you receive information about the DPU on which the operation failed.
After a successful installation, in case of DPU errors, the recommended action is to restart the affected ESXi host. If the DPU is still accessible from the host, the general log bundle collection is sufficient for troubleshooting. If the DPU is not accessible from the host, logging in to the DPU from a BMC, iLO, or iDRAC interface can provide troubleshooting logs.
Failover
Failover support in vSphere 8.0 Update 3 is limited to one of the DPUs becoming non-functional due to software errors within the DPU or a physical disconnect of one of the DPUs, such as cable disconnect. Failover due to Peripheral Component Interconnect (PCI) level errors is not supported.
Rollback
Rollback is a best effort mechanism to restore the system to a previous working state in case of a failure before the jumpstart phase of the ESXi boot. Rollback on both x86 servers and the attached supported DPUs is automatic in case of an error during booting. You can also opt for a manual rollback by pressing Shift+R before the bootloader starts, to return to a previous good state.
Any failure after the jumpstart phase starts does not result in a rollback.
Scenario | Number of reboots required |
---|---|
Both DPUs boot correctly. ESXi does not boot correctly. | 2 |
Both DPUs do not boot correctly. ESXi boots correctly. | 1 |
One of the DPUs boots with an earlier version than the other DPU and ESXi. | 2 |
One of the DPUs boots with an earlier version than the other DPU and ESXi does not boot correctly. | 2 |