vSphere Administrator Workflow for Deploying AI/ML Workloads on TKGS Clusters (vGPU)

vSphere Administrator Workflow for Deploying AI/ML Workloads on TKGS Clusters

The high-level workflow for vSphere Administrators to enable the deployment of AI/ML workloads on TKGS clusters is listed in the table. Detailed instructions for each step follow.

Step	Action	Link
0	Review system requirements.	See Admin Step 0: Review System Requirements.
1	Install Supported NVIDIA GPU Device on ESXi Hosts.	See Admin Step 1: Install Supported NVIDIA GPU Device on ESXi Hosts.
2	Configure ESXi device graphics settings for vGPU operations.	See Admin Step 2: Configure Each ESXi Host for vGPU Operations.
3	Install the NVIDIA vGPU Manager (VIB) onto each ESXi host.	See Admin Step 3: Install the NVIDIA Host Manager Driver on Each ESXi Host.
4	Verify NVIDIA driver operation and GPU virtualization mode.	See Admin Step 4: Verify ESXi Hosts Are Ready for NVIDIA vGPU Operations.
5	Enable Workload Management on the GPU-configured cluster. The result is a Supervisor Cluster that is running on vGPU-enabled ESXi hosts.	See Admin Step 5: Enable Workload Management on the vGPU-configured vCenter Cluster.
6	Create* or Update a Content Library for Tanzu Kubernetes releases and populate the library with the supported Ubuntu OVA that is required for vGPU workloads.	See Admin Step 6: Create or Update a Content Library with the Tanzu Kubernetes Ubuntu Release. Note: *If necessary. If you already have a content library for TKGS clusters Photon images, not create a new content library for Ubuntu images.
7	Create a custom VM Class with a certain vGPU profile selected.	See Admin Step 7: Create a Custom VM Class with the vGPU Profile
8	Create and configure a vSphere Namespace for TKGS GPU clusters: add a user with Edit permissions and storage for persistent volumes.	See Admin Step 8: Create and Configure a vSphere Namespace for the TKGS GPU Cluster
9	Associate the Content Library with the Ubuntu OVA and the custom VM Class for vGPU with the vSphere Namespace you created for TGKS.	See Admin Step 9: Associate the Content Library and VM Class with the vSphere Namespace
10	Verify that the Supervisor Cluster is provisioned and accessible for the Cluster Operator.	See Admin Step 10: Verify that the Supervisor Cluster Is Accessible

Admin Step 0: Review System Requirements

Refer to the following system requirements to set up the environment for deploying AI/ML workloads on TKGS clusters.

Requirement	Description
vSphere infrastructure	vSphere 7 Update3 Monthly Patch 1 ESXi build `18778458` or later vCenter Server build `18644231` or later
Workload Management	vSphere Namespace version `0.0.11-18610518` or later
Supervisor Cluster	Supervisor Cluster version `v1.21.0+vmware.1-vsc0.0.11-18610518` or later
TKR Ubuntu OVA	Tanzu Kubernetes release Ubuntu `ob-18691651-tkgs-ova-ubuntu-2004-v1.20.8---vmware.1-tkg.2`
NVIDIA vGPU Host Driver	Download the VIB from the NGC web site. For more information, see the vGPU Software Driver documentation. For example: `NVIDIA-AIE_ESXi_7.0.2_Driver_470.51-1OEM.702.0.0.17630552.vib`
NVIDIA License Server for vGPU	FQDN provided by your organization

Admin Step 1: Install Supported NVIDIA GPU Device on ESXi Hosts

To deploy AI/ML workloads on TKGS, you install one or more supported NVIDIA GPU devices on each ESXi host comprising the vCenter Cluster where Workload Management will be enabled.

To view compatible NVIDIA GPU devices, refer to the VMware Compatibility Guide.

The list of compatible NVIDIA GPU devices. Click on a GPU device model to view more details and to subscribe to RSS feeds.

The NVIDIA GPU device should support the latest NVIDIA AI Enterprise (NVAIE) vGPU profiles. Refer to the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.

For example, the following ESXi host has two NVIDIA GPU A100 devices installed on it.

The Graphics Devices tab in the vSphere Client lists the NVIDIA GPU A100 devices.

Admin Step 2: Configure Each ESXi Host for vGPU Operations

Configure each ESXi host for vGPU by enabling Shared Direct and SR-IOV.

Enable Shared Direct on Each ESXi Host

For NVIDIA vGPU functionality to be unlocked, enable Shared Direct mode on each ESXi host comprising the vCenter Cluster where Workload Management will be enabled.

To enable Shared Direct, complete the following steps. For additional guidance, see Configuring Graphics Devices in the vSphere documentation.

Log on to the vCenter Server using the vSphere Client.
Select an ESXi host in the vCenter Cluster.
Select Configure > Hardware > Graphics.
Select the NVIDIA GPU accelerator device.
Edit the Graphics Device settings.
Select Shared Direct.
Select Restart X.Org server.
Click OK to save the configuration.
Right-click the ESXi host and put it into maintenance mode.
Reboot the host.
When the host is running again, take it out of maintenance mode.
Repeat this process for each ESXi host in the vCenter cluster where Workload Management will be enabled.

The Edit Graphics Device Settings page with the Shared Direct and Restart X.Org server options selected.

The Graphics Devices tab in the vSphere Client lists the NVIDIA GPU A100 devices with Shared Direct mode enabled.

Turn On SR-IOV BIOS for NVIDIA GPU A30 and A100 Devices

If you are using the NVIDIA A30 or A100 GPU devices, which are required for Multi-Instance GPU (MIG mode), you must enable SR-IOV on the ESXi host. If SR-IOV is not enabled, Tanzu Kubernetes cluster node VMs cannot start. If this occurs, you see the following error message in the Recent Tasks pane of the vCenter Server where Workload Management is enabled.

Could not initialize plugin libnvidia-vgx.so for vGPU nvidia_aXXX-xx. Failed to start the virtual machine. Module DevicePowerOn power on failed.

To enable SR-IOV, log in to the ESXi host using the web console. Select Manage > Hardware . Select the NVIDIA GPU device and click Configure SR-IOV. From here you can turn on SR-IOV. For additional guidance, see Single Root I/O Virtualization (SR-IOV) in the vSphere documentation.

Note: If you are using vGPU with NIC Passthrough, refer to the following topic for an additional ESXi configuration step: vSphere Administrator Addendum for Deploying AI/ML Workloads on TKGS Clusters (vGPU and Dynamic DirectPath IO).

Admin Step 3: Install the NVIDIA Host Manager Driver on Each ESXi Host

To run Tanzu Kubernetes cluster node VMs with NVIDIA vGPU graphics acceleration, you install the NVIDIA host manager driver on each ESXi host comprising the vCenter Cluster where Workload Management will be enabled.

The NVIDIA vGPU host manager driver components are packaged in a vSphere installation bundle (VIB). The NVAIE VIB is provided to you by your organization through its NVIDIA GRID licensing program. VMware does not provide NVAIE VIBs or make them available for download. As part of the NVIDIA licensing program your organization sets up a licensing server. Refer to the NVIDIA Virtual GPU Software Quick Start Guide for more information.

Once the NVIDIA environment is set up, run the following command on each ESXi host, replace the NVIDIA license server address and the NVAIE VIB version and with the appropriate values for your environment. For additional guidance, see Installing and configuring the NVIDIA VIB on ESXi at the VMware Support Knowledge Base.

Note: The NVAIE VIB version installed on ESXi hosts must match the vGPU software version installed the node VMs. The version below is only an example.

esxcli system maintenanceMode set --enable true
esxcli software vib install -v ftp://server.domain.example.com/nvidia/signed/NVIDIA_bootbank_NVIDIA-VMware_ESXi_7.0_Host_Driver_460.73.02-1OEM.700.0.0.15525992.vib
esxcli system maintenanceMode set --enable false
/etc/init.d/xorg restart

Admin Step 4: Verify ESXi Hosts Are Ready for NVIDIA vGPU Operations

To verify that each ESXi host is ready for NVIDIA vGPU operations, perform the following checks on each ESXi host in the vCenter Cluster where Workload Management will be enabled:

SSH into the ESXi host, enter shell mode and run the command nvidia-smi. The NVIDIA System Management Interface is a command line utility provided by the NVIDIA vGPU host manager. Running this command returns the GPUs and drivers on the host.
Run the following command to verify that the NVIDIA driver is properly installed: esxcli software vib list | grep NVIDIA.
Verify that host is configured with GPU shared direct and that SR-IOV is turned on (if you are using NVIDIA A30 or A100 devices).
Using the vSphere Client, on the ESXi host that is configured for GPU, create a new virtual machine with a PCI device included. The NVIDIA vGPU profile should appear and be selectable.

Admin Step 5: Enable Workload Management on the vGPU-configured vCenter Cluster

Now that ESXi hosts are configured to support NVIDIA vGPU, create a vCenter Cluster comprising these hosts. To support Workload Management, the vCenter Cluster must meet specific requirements, including shared storage, high-availability, fully-automated DRS.

Enabling Workload Management also requires the selection of a networking stack, either native vSphere vDS networking or NSX-T Data Center networking. If you use vDS networking, you need to install a load balancer, either NSX Advanced or HAProxy.

The result of enabling Workload Management is a Supervisor Cluster that is running on vGPU-enabled ESXi hosts. Refer to the following tasks and documentation to enable Workload Management.

Note: Skip this step if you already have a vCenter Cluster with Workload Management enabled, assuming that cluster is using the ESXi hosts you have configured for vGPU.

Task	Instructions
Create a vCenter Cluster that meets the requirements for enabling Workload Management	Prerequisites for Configuring vSphere with Tanzu on a vSphere Cluster
Configure the networking for the Supervisor Cluster, either NSX-T or vDS with a load balancer.	Configuring NSX for vSphere with Tanzu. Configuring vSphere Networking and NSX Advanced Load Balancer for vSphere with Tanzu. Configuring vSphere Networking and HAProxy Load Balancer for vSphere with Tanzu.
Enable Workload Management	Enable Workload Management with NSX Networking. Enable Workload Management with vSphere Networking.

Task

Instructions

Create a vCenter Cluster that meets the requirements for enabling Workload Management

Prerequisites for Configuring vSphere with Tanzu on a vSphere Cluster

Configure the networking for the Supervisor Cluster, either NSX-T or vDS with a load balancer.

Configuring NSX for vSphere with Tanzu.

Configuring vSphere Networking and NSX Advanced Load Balancer for vSphere with Tanzu.

Configuring vSphere Networking and HAProxy Load Balancer for vSphere with Tanzu.

Enable Workload Management

Enable Workload Management with NSX Networking.

Enable Workload Management with vSphere Networking.

Admin Step 6: Create or Update a Content Library with the Tanzu Kubernetes Ubuntu Release

Once Workload Management is enabled on a GPU-configured vCenter Cluster, the next step is to create a Content Library for the Tanzu Kubernetes release OVA image.

Warning: If you already have a Content Library with Tanzu Kubernetes releases consisting of Photon images, you only have to synchronize the existing content library with the required Ubuntu image(s). Do not create a second content library for TKGS clusters. Doing so can cause system instability.

NVIDIA vGPU requires the Ubuntu operating system. VMware provides an Ubuntu OVA for such purposes. You cannot use the PhotonOS Tanzu Kubernetes release for vGPU clusters.

To import this image into your vSphere with Tanzu environment, choose one of the methods listed in the table and follow the corresponding instructions.

Content Library Type	Description
Create a Subscribed Content Library and automatically synchronize the Ubuntu OVA with your environment.	Create, Secure, and Synchronize a Subscribed Content Library for Tanzu Kubernetes releases
Create a Local Content Library and manually upload the Ubuntu OVA to your environment.	Create, Secure, and Synchronize a Local Content Library for Tanzu Kubernetes releases

When you have completed this task, you should see the Ubuntu OVA available in your content library.

The OVF & OVA Templates page in Ubuntu displays the Ubuntu OVA available in your content library.

Admin Step 7: Create a Custom VM Class with the vGPU Profile

To next step is to create a custom VM Class with a vGPU profile. The system will use this class definition when it creates the Tanzu Kubernetes cluster nodes.

Follow the instructions below to create a custom VM Class with a vGPU profile. For additional guidance, see Add PCI Devices to a VM Class in vSphere with Tanzu.

Note: If you are using vGPU with NIC Passthrough, refer to the following topic for an additional step: vSphere Administrator Addendum for Deploying AI/ML Workloads on TKGS Clusters (vGPU and Dynamic DirectPath IO).

Log on to the vCenter Server using the vSphere Client.
Select Workload Management.
Select Services.
Select VM Classes.
Click Create VM Class.

At the Configuration tab, configure the custom VM Class.

Configuration Field	Description
Name	Enter a self-descriptive name for the custom VM class, such as `vmclass-vgpu-1`.
vCPU Count	`2`
CPU Resource Reservation	Optional, OK to leave blank
Memory	`80` GB, for example
Memory Resource Reservation	100% (mandatory when PCI devices are configured in a VM Class)
PCI Devices	Yes Note: Selecting Yes for PCI Devices tells the system you are using a GPU device and changes the VM Class configuration to support vGPU configuration.

For example:

Click Next.
At the PCI Devices tab, select the Add PCI Device > NVIDIA vGPU option.

Configure the NVIDIA vGPU model.

NVIDIA vGPU Field	Description
Model	Select the NVIDIA GPU hardware device model from those available in the NVIDIA vGPU > Model menu. If the system does not show any profiles, none of the hosts in the cluster have supported PCI devices.
GPU Sharing	This setting defines how the GPU device is shared across GPU-enabled VMs. There are two types of vGPU implementations: Time Sharing and Multi-Instance GPU Sharing. In Time Sharing mode, the vGPU scheduler instructs the GPU to perform the work for each vGPU-enabled VM serially for a duration of time with the best effort goal of balancing performance across vGPUs. MIG mode allows multiple vGPU-enabled VMs to run in parallel on a single GPU device. MIG mode is based on a newer GPU architecture and is only supported on NVIDIA A100 and A30 devices. If you do not see the MIG option, the PCI device you selected does not support it.
GPU Mode	Compute
GPU Memory	`8` GB, for example
Number of vGPUs	1, for example

For example, here is a NVIDIA vGPU profile configured in Time Sharing mode:

The PCI Devices tab with the NVIDIA vGPU profile you configured in Time Sharing mode.

For example, here is a NVIDIA vGPU profile configured in MIG mode with supported GPU device:

The PCI Devices tab with the NVIDIA vGPU profile you configured in Multi-Instance GPU Sharing mode.

Click Next.
Review and confirm your selections.
Click Finish.
Verify that the new custom VM Class is available in the list of VM Classes.

Admin Step 8: Create and Configure a vSphere Namespace for the TKGS GPU Cluster

Create a vSphere Namespace for each TKGS GPU cluster you plan to provision. Configure the namespace by adding a vSphere SSO user with Edit permissions, and attach a storage policy for persistent volumes.

To do this, see Create and Configure a vSphere Namespace.

Admin Step 9: Associate the Content Library and VM Class with the vSphere Namespace

After you have created and configured the vSphere Namespace, associate the Content Library that includes the Ubuntu OVA with the vSphere Namespace, and associate the custom VM Class with the vGPU profile with the same vSphere Namespace.

Task	Description
Associate the Content Library with the Ubuntu OVA for vGPU with the vSphere Namespace where you will provision the TKGS cluster.	See Configure a vSphere Namespace for Tanzu Kubernetes releases.
Associate the custom VM Class with the vGPU profile with the vSphere Namespace where you will provision the TKGS cluster.	See Associate a VM Class with a Namespace in vSphere with Tanzu.

The following example shows a configured vSphere Namespace with an associated Content Library and custom VM Class for use with vGPU clusters.

Admin Step 10: Verify that the Supervisor Cluster Is Accessible

The last administration task is to verify that the Supervisor Cluster is provisioned and available for use by the Cluster Operator to provision a TKGS cluster for AI/ML workloads.

Download and install the Kubernetes CLI Tools for vSphere.
See Download and Install the Kubernetes CLI Tools for vSphere.
Connect to the Supervisor Cluster.
See Connect to the Supervisor Cluster as a vCenter Single Sign-On User.
Provide the Cluster Operator with the link to download the Kubernetes CLI Tools for vSphere and the name of the vSphere Namespace.
See Cluster Operator Workflow for Deploying AI/ML Workloads on TKGS Clusters.