Configure vSphere IaaS Control Plane for VMware Private AI Foundation with NVIDIA

To provide DevOps engineers and data scientists with the ability to deploy deep learning virtual machines or TKG clusters with AI container workloads, you must deploy a Supervisor on a GPU-enabled cluster in a VI workload domain and create vGPU-enabled VM classes.

Note: This documentation is based on VMware Cloud Foundation 5.2.1. For information on the VMware Private AI Foundation with NVIDIA functionality in VMware Cloud Foundation 5.2, see VMware Private AI Foundation with NVIDIA Guide for VMware Cloud Foundation 5.2.

Prerequisites

Verify that VMware Private AI Foundation with NVIDIA is configured up to this step of the deployment workflow. See Preparing VMware Cloud Foundation for Private AI Workload Deployment.

Procedure

For a VMware Cloud Foundation 5.2.1 instance, log in to the vCenter Server instance for the management domain at https://<vcenter_server_fqdn>/ui as [email protected].
In the vSphere Client side panel, click Private AI Foundation.
In the Private AI Foundation workflow, click the Set Up a Workload Domain section.
Deploy an NSX Edge cluster in the VI workload domain.
See Deploy an NSX Edge Cluster. The wizard in the guided deployment workflow has the same options as the analogous wizard in the SDDC Manager UI.

SDDC Manager also deploys a Tier-0 gateway that you specify at Supervisor deployment. The Tier-0 gateway is in active-active high availability mode.
In the Private AI Foundation workflow, click the Set Up Workload Management section.
Configure a storage policy for the Supervisor.
See Create Storage Policies for vSphere with Tanzu. The wizard for creating a VM storage policy in the guided deployment workflow is the same as the analogous wizard in the Policies and Profiles area of the vSphere Client.
Enable workload management, deploying a Supervisor on the default cluster of GPU-enabled ESXi hosts in the VI workload domain.
You use static IP address assignment for the management network. Assign the supervisor VM management network on the vSphere Distributed Switch for the cluster.
Configure the workload network in the following way:
- Use the vSphere Distributed Switch for the cluster or create one specifically for AI workloads.
- Configure the Supervisor with the NSX Edge cluster and Tier-0 gateway that you deployed by using SDDC Manager.
- Set the rest of the values according to your design.
Use the storage policy you created.

For more information on deploying a Supervisor on a single cluster, see Enable Workload Management and Deploy a One-Zone Supervisor with NSX Networking. The wizard in the guided deployment workflow is the same as the analogous wizard in the Workload Management area of the vSphere Client.