Configure vSphere with Tanzu for VMware Private AI Foundation with NVIDIA

To provide DevOps engineers and data scientists with the ability to deploy deep learning virtual machines or TKG clusters with AI container workloads, you must deploy a Supervisor on a GPU-enabled cluster in a VI workload domain and create vGPU-enabled VM classes.

Prerequisites

See Requirements for Deploying VMware Private AI Foundation with NVIDIA.

Procedure

Deploy an NSX Edge Cluster in the VI workload domain by using SDDC Manager.
SDDC Manager also deploys a Tier-0 gateway that you specify at Supervisor deployment. The Tier-0 gateway is in active-active high availability mode.
Configure a storage policy for the Supervisor.
See Create Storage Policies for vSphere with Tanzu.
Deploy a Supervisor on a cluster of GPU-enabled ESXi hosts in the VI workload domain.
You use static IP address assignment for the management network. Assign the supervisor VM management network on the vSphere Distributed Switch for the cluster.
Configure the workload network in the following way:
- Use the vSphere Distributed Switch for the cluster or create one specifically for AI workloads.
- Configure the Supervisor with the NSX Edge cluster and Tier-0 gateway that you deployed by using SDDC Manager.
- Set the rest of the values according to your design.
Use the storage policy you created.

For more information on deploying a supervisor on a single cluster, see Deploy a One-Zone Supervisor with NSX Networking.
Configure vGPU-based VM classes for AI workloads.
In these VM classes, you set the compute requirements and a vGPU profile for an NVIDIA GRID vGPU device according to the vGPU devices configured on the ESXi hosts in the Supervisor cluster.
- For information about setting up vGPU-based VM classes for virtual machines, see Create a Custom VM Class Using the vSphere Client and Add PCI Devices to a VM Class in vSphere with Tanzu.
- For information about setting up vGPU-based VM classes for TKG worker nodes, see Create a Custom VM Class with a vGPU Profile in vSphere 8 Update 2b and later and Configuring vSphere Namespaces for TKG Clusters on Supervisor.
For the VM class for deploying deep learning VMs with NVIDIA RAG workloads, set the following additional settings in the VM class dialog box:
- Select the full-sized vGPU profile for time-slicing mode or a MIG profile. For example, for NVIDIA A100 40GB card in vGPU time-slicing mode, select nvidia_a100-40c.
- On the Virtual Hardware tab, allocate more than 16 virtual CPU cores and 64 GB of virtual memory.
- On the Advanced Parameters tab, set the pciPassthru<vgpu-id>.cfg.enable_uvm parameter to 1.
  where <vgpu-id> identifies the vGPU assigned to the virtual machine. For example, if two vGPUs are assigned to the virtual machine, you set pciPassthru0.cfg.parameter=1 and pciPassthru1.cfg.parameter = 1.
If you plan to use the kubectl command line tool to deploy a deep learning VM or an GPU-accelerated TKG cluster on a Supervisor, create and configure a vSphere namespace, adding resource limits, storage policy, permissions for DevOps engineers, and associating the vGPU-based VM classes with it.
- For information about setting up vSphere namespaces for virtual machines, see Create and Configure a vSphere Namespace on the Supervisor.
- For information about setting up vSphere namespaces for TKG clusters, see Configuring vSphere Namespaces for TKG Clusters on Supervisor.