Requirements for Deploying VMware Private AI Foundation with NVIDIA

You deploy components of VMware Private AI Foundation with NVIDIA in your VMware Cloud Foundation environment in a VI workload domain where you must have certain NVIDIA components installed.

Required VMware Software Versions

See VMware Components in VMware Private AI Foundation with NVIDIA.

Supported NVIDIA GPU Devices

Before you start using VMware Private AI Foundation with NVIDIA, make sure that the GPUs on your ESXi hosts are supported by VMware by Broadcom:

Table 1. Supported NVIDIA Components for VMware Private AI Foundation with NVIDIA
NVIDIA Component	Supported Options
NVIDIA GPUs	NVIDIA A100 NVIDIA L40S NVIDIA H100
GPU sharing mode	Time slicing Multi-Instance GPU (MIG)

Required NVIDIA Software

The GPU device must support the latest NVIDIA AI Enterprise (NVAIE) vGPU profiles. See the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.

NVIDIA vGPU host driver (including the VIB for ESXi hosts), that is compatible with your VMware Cloud Foundation version. See Virtual GPU Software for VMware vSphere Release Notes.
NVIDIA GPU Operator that is compatible with the Kubernetes version of the deployed TKG clusters. See NVIDIA GPU Operator Release Notes and VMware Tanzu Kuberenetes releases Release Notes.

Required VMware Cloud Foundation Setup

Before you deploy VMware Private AI Foundation with NVIDIA, a specific configuration must be available in VMware Cloud Foundation.

A VMware Cloud Foundation license.
A VMware Private AI Foundation with NVIDIA add-on license.
Licensed NVIDIA vGPU product including the host driver VIB file for ESXi hosts and the guest OS drivers. See the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.
The VIB file of the NVIDIA vGPU host driver downloaded from https://nvid.nvidia.com/
A vSphere Lifecycle Manager image with the VIB file of the vGPU host manager driver available in SDDC Manager. See Managing vSphere Lifecycle Manager Images in VMware Cloud Foundation.
A VI workload domain with at least 3 ESXi GPU-enabled hosts that is based on the vSphere Lifecycle Manager image containing the host manager driver VIB file. See Deploy a VI Workload Domain Using the SDDC Manager UI and Managing vSphere Lifecycle Manager Images in VMware Cloud Foundation.
NVIDIA vGPU host driver installed and vGPU configured on each ESXi host in the cluster for AI workloads.
1. On each ESXi host, enable SR-IOV in the BIOS and Shared Direct on the graphics devices for AI operations.
  For information about configuring SR-IOV, see the documentation from your hardware vendor. For information about configuring Shared Direct on graphics devices, see Configure Virtual Graphics on vSphere.
2. Install the NVIDIA vGPU host manager driver on each ESXi host in one of the following ways:
  - Install the driver on each host and add the VIB file of the driver to the vSphere Lifecycle image for the cluster.
    See NVIDIA Virtual GPU Software Quick Start Guide.
  - Add the VIB file of the driver to the vSphere Lifecycle image for the cluster and remediate the hosts.
3. If you want to use the Multi-Instance GPU (MIG) sharing, enable it on each ESXi host in the cluster.
  See NVIDIA MIG User Guide.
4. On the vCenter Server instance for VI workload domain, set the vgpu.hotmigrate.enabled advanced setting to true so that virtual machines with vGPU can be migrated by using vSphere vMotion.
  See Configure Advanced Settings.