Requirements for Deploying VMware Private AI Foundation with NVIDIA

You deploy components of VMware Private AI Foundation with NVIDIA in your VMware Cloud Foundation environment in a VI workload domain where you must have certain NVIDIA components installed.

Required VMware Software Versions

See VMware Components in VMware Private AI Foundation with NVIDIA.

Supported NVIDIA GPU Devices

Before you start using VMware Private AI Foundation with NVIDIA, make sure that the GPUs on your ESXi hosts are supported by VMware by Broadcom:

Table 1. Supported NVIDIA Components for VMware Private AI Foundation with NVIDIA
NVIDIA Component	Supported Options
NVIDIA GPUs	NVIDIA A100 NVIDIA L40S NVIDIA H100
GPU sharing mode	Time slicing Multi-Instance GPU (MIG)

Required NVIDIA Software

The GPU device must support the latest NVIDIA AI Enterprise (NVAIE) vGPU profiles. See the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.

NVIDIA vGPU host driver (including the VIB for ESXi hosts), that is compatible with your VMware Cloud Foundation version. See Virtual GPU Software for VMware vSphere Release Notes.
NVIDIA GPU Operator that is compatible with the Kubernetes version of the deployed TKG clusters. See NVIDIA GPU Operator Release Notes and VMware Tanzu Kuberenetes releases Release Notes.

Required VMware Cloud Foundation Setup

Before you deploy VMware Private AI Foundation with NVIDIA, a specific configuration must be available in VMware Cloud Foundation.

VMware Cloud Foundation on vSAN ReadyNodes™.
A VMware Cloud Foundation license.
A VMware Private AI Foundation with NVIDIA add-on license.
You need the VMware Private AI Foundation with NVIDIA add-on license to access the following functionality:
- Private AI setup in VMware Aria Automation for catalog items for easy provisioning of GPU-accelerated deep learning virtual machines and TKG clusters.
- Provisioning of PostgreSQL databases with the pgvector extension with enterprise support.
- Deploying and using the deep learning virtual machine image delivered by VMware by Broadcom.
- Guided deployment workflow in the vSphere Client.
You can deploy AI workloads with and without a Supervisor enabled and use the GPU metrics in vCenter Server and VMware Aria Operations under the VMware Cloud Foundation license.
You add your VMware Private AI Foundation with NVIDIA license as a solution license to the license management system in the management vCenter Server. You can add the license in one of the following ways:
- When using the guided deployment workflow in the vSphere Client for the first time.
- By using the license management UI in the vSphere Client. See Managing vSphere Licenses.
Licensed NVIDIA vGPU product including the host driver VIB file for ESXi hosts and the guest OS drivers. See the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.
The VIB file of the NVIDIA vGPU host driver downloaded from https://nvid.nvidia.com/
A vSphere Lifecycle Manager image with the VIB file of the vGPU host manager driver available in SDDC Manager. See Managing vSphere Lifecycle Manager Images in VMware Cloud Foundation.
At least 3 GPU-enabled ESXi hosts to include in the default cluster of a VI workload domain.
NVIDIA vGPU host driver installed and vGPU configured on each ESXi host in the cluster for AI workloads.
1. On each ESXi host, enable SR-IOV in the BIOS and Shared Direct on the graphics devices for AI operations.
  For information about configuring SR-IOV, see the documentation from your hardware vendor. For information about configuring Shared Direct on graphics devices, see Configure Virtual Graphics on vSphere.
2. Install the NVIDIA vGPU host driver on each ESXi host in one of the following ways:
  - Install the driver on each host and add the VIB file of the driver to the vSphere Lifecycle image for the cluster.
    See NVIDIA Virtual GPU Software Quick Start Guide.
  - Add the VIB file of the driver to the vSphere Lifecycle image for the cluster and remediate the hosts.
3. If you want to use the Multi-Instance GPU (MIG) sharing, enable it on each ESXi host in the cluster.
  See NVIDIA MIG User Guide.