You deploy components of VMware Private AI Foundation with NVIDIA in your VMware Cloud Foundation environment in a VI workload domain where you must have certain NVIDIA components installed.
Required VMware Software Versions
See VMware Components in VMware Private AI Foundation with NVIDIA.
Supported NVIDIA GPU Devices
Before you start using VMware Private AI Foundation with NVIDIA, make sure that the GPUs on your ESXi hosts are supported by VMware by Broadcom:
NVIDIA Component | Supported Options |
---|---|
NVIDIA GPUs |
|
GPU sharing mode |
|
Required NVIDIA Software
The GPU device must support the latest NVIDIA AI Enterprise (NVAIE) vGPU profiles. See the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.
- NVIDIA vGPU host driver (including the VIB for ESXi hosts), that is compatible with your VMware Cloud Foundation version. See Virtual GPU Software for VMware vSphere Release Notes.
- NVIDIA GPU Operator that is compatible with the Kubernetes version of the deployed TKG clusters. See NVIDIA GPU Operator Release Notes and VMware Tanzu Kuberenetes releases Release Notes.
Required VMware Cloud Foundation Setup
Before you deploy VMware Private AI Foundation with NVIDIA, a specific configuration must be available in VMware Cloud Foundation.
- A VMware Cloud Foundation license.
- A VMware Private AI Foundation with NVIDIA add-on license.
- Licensed NVIDIA vGPU product including the host driver VIB file for ESXi hosts and the guest OS drivers. See the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.
- The VIB file of the NVIDIA vGPU host driver downloaded from https://nvid.nvidia.com/
- A vSphere Lifecycle Manager image with the VIB file of the vGPU host manager driver available in SDDC Manager. See Managing vSphere Lifecycle Manager Images in VMware Cloud Foundation.
- A VI workload domain with at least 3 ESXi GPU-enabled hosts that is based on the vSphere Lifecycle Manager image containing the host manager driver VIB file. See Deploy a VI Workload Domain Using the SDDC Manager UI and Managing vSphere Lifecycle Manager Images in VMware Cloud Foundation.
- NVIDIA vGPU host driver installed and vGPU configured on each ESXi host in the cluster for AI workloads.
- On each ESXi host, enable SR-IOV in the BIOS and Shared Direct on the graphics devices for AI operations.
For information about configuring SR-IOV, see the documentation from your hardware vendor. For information about configuring Shared Direct on graphics devices, see Configure Virtual Graphics on vSphere.
- Install the NVIDIA vGPU host manager driver on each ESXi host in one of the following ways:
- Install the driver on each host and add the VIB file of the driver to the vSphere Lifecycle image for the cluster.
- Add the VIB file of the driver to the vSphere Lifecycle image for the cluster and remediate the hosts.
- If you want to use the Multi-Instance GPU (MIG) sharing, enable it on each ESXi host in the cluster.
- On the vCenter Server instance for VI workload domain, set the
vgpu.hotmigrate.enabled
advanced setting totrue
so that virtual machines with vGPU can be migrated by using vSphere vMotion.
- On each ESXi host, enable SR-IOV in the BIOS and Shared Direct on the graphics devices for AI operations.