To enable developers to deploy AI/ML workloads on TKGS clusters, as a vSphere Administrator you set up the vSphere with Tanzu environment to support NVIDIA GPU hardware.
vSphere Administrator Workflow for Deploying AI/ML Workloads on TKGS Clusters
Step | Action | Link |
---|---|---|
0 | Review system requirements. |
|
1 | Install Supported NVIDIA GPU Device on ESXi Hosts. |
See Admin Step 1: Install Supported NVIDIA GPU Device on ESXi Hosts. |
2 | Configure ESXi device graphics settings for vGPU operations. |
See Admin Step 2: Configure Each ESXi Host for vGPU Operations. |
3 | Install the NVIDIA vGPU Manager (VIB) onto each ESXi host. |
See Admin Step 3: Install the NVIDIA Host Manager Driver on Each ESXi Host. |
4 | Verify NVIDIA driver operation and GPU virtualization mode. |
See Admin Step 4: Verify ESXi Hosts Are Ready for NVIDIA vGPU Operations. |
5 | Enable Workload Management on the GPU-configured cluster. The result is a Supervisor Cluster that is running on vGPU-enabled ESXi hosts. |
See Admin Step 5: Enable Workload Management on the vGPU-configured vCenter Cluster. |
6 | Create* or Update a Content Library for Tanzu Kubernetes releases and populate the library with the supported Ubuntu OVA that is required for vGPU workloads. |
See
Admin Step 6: Create or Update a Content Library with the Tanzu Kubernetes Ubuntu Release.
Note: *If necessary. If you already have a content library for TKGS clusters Photon images, not create a new content library for Ubuntu images.
|
7 | Create a custom VM Class with a certain vGPU profile selected. |
See Admin Step 7: Create a Custom VM Class with the vGPU Profile |
8 | Create and configure a vSphere Namespace for TKGS GPU clusters: add a user with Edit permissions and storage for persistent volumes. |
See Admin Step 8: Create and Configure a vSphere Namespace for the TKGS GPU Cluster |
9 | Associate the Content Library with the Ubuntu OVA and the custom VM Class for vGPU with the vSphere Namespace you created for TGKS. |
See Admin Step 9: Associate the Content Library and VM Class with the vSphere Namespace |
10 | Verify that the Supervisor Cluster is provisioned and accessible for the Cluster Operator. |
See Admin Step 10: Verify that the Supervisor Cluster Is Accessible |
Admin Step 0: Review System Requirements
Requirement | Description |
---|---|
vSphere infrastructure |
vSphere 7 Update3 Monthly Patch 1 ESXi build vCenter Server build |
Workload Management |
vSphere Namespace version
|
Supervisor Cluster |
Supervisor Cluster version
|
TKR Ubuntu OVA | Tanzu Kubernetes release Ubuntu
|
NVIDIA vGPU Host Driver |
Download the VIB from the NGC web site. For more information, see the vGPU Software Driver documentation. For example:
|
NVIDIA License Server for vGPU |
FQDN provided by your organization |
Admin Step 1: Install Supported NVIDIA GPU Device on ESXi Hosts
To deploy AI/ML workloads on TKGS, you install one or more supported NVIDIA GPU devices on each ESXi host comprising the vCenter Cluster where Workload Management will be enabled.
To view compatible NVIDIA GPU devices, refer to the VMware Compatibility Guide.
The NVIDIA GPU device should support the latest NVIDIA AI Enterprise (NVAIE) vGPU profiles. Refer to the NVIDIA Virtual GPU Software Supported GPUs documentation for guidance.
For example, the following ESXi host has two NVIDIA GPU A100 devices installed on it.
Admin Step 2: Configure Each ESXi Host for vGPU Operations
Configure each ESXi host for vGPU by enabling Shared Direct and SR-IOV.
Enable Shared Direct on Each ESXi Host
For NVIDIA vGPU functionality to be unlocked, enable Shared Direct mode on each ESXi host comprising the vCenter Cluster where Workload Management will be enabled.
- Log on to the vCenter Server using the vSphere Client.
- Select an ESXi host in the vCenter Cluster.
- Select .
- Select the NVIDIA GPU accelerator device.
- Edit the Graphics Device settings.
- Select Shared Direct.
- Select Restart X.Org server.
- Click OK to save the configuration.
- Right-click the ESXi host and put it into maintenance mode.
- Reboot the host.
- When the host is running again, take it out of maintenance mode.
- Repeat this process for each ESXi host in the vCenter cluster where Workload Management will be enabled.
Turn On SR-IOV BIOS for NVIDIA GPU A30 and A100 Devices
If you are using the NVIDIA A30 or A100 GPU devices, which are required for Multi-Instance GPU (MIG mode), you must enable SR-IOV on the ESXi host. If SR-IOV is not enabled, Tanzu Kubernetes cluster node VMs cannot start. If this occurs, you see the following error message in the Recent Tasks pane of the vCenter Server where Workload Management is enabled.
Could not initialize plugin libnvidia-vgx.so for vGPU nvidia_aXXX-xx. Failed to start the virtual machine. Module DevicePowerOn power on failed.
To enable SR-IOV, log in to the ESXi host using the web console. Select Configure SR-IOV. From here you can turn on SR-IOV. For additional guidance, see Single Root I/O Virtualization (SR-IOV) in the vSphere documentation.
. Select the NVIDIA GPU device and clickAdmin Step 3: Install the NVIDIA Host Manager Driver on Each ESXi Host
To run Tanzu Kubernetes cluster node VMs with NVIDIA vGPU graphics acceleration, you install the NVIDIA host manager driver on each ESXi host comprising the vCenter Cluster where Workload Management will be enabled.
The NVIDIA vGPU host manager driver components are packaged in a vSphere installation bundle (VIB). The NVAIE VIB is provided to you by your organization through its NVIDIA GRID licensing program. VMware does not provide NVAIE VIBs or make them available for download. As part of the NVIDIA licensing program your organization sets up a licensing server. Refer to the NVIDIA Virtual GPU Software Quick Start Guide for more information.
esxcli system maintenanceMode set --enable true esxcli software vib install -v ftp://server.domain.example.com/nvidia/signed/NVIDIA_bootbank_NVIDIA-VMware_ESXi_7.0_Host_Driver_460.73.02-1OEM.700.0.0.15525992.vib esxcli system maintenanceMode set --enable false /etc/init.d/xorg restart
Admin Step 4: Verify ESXi Hosts Are Ready for NVIDIA vGPU Operations
- SSH into the ESXi host, enter shell mode and run the command
nvidia-smi
. The NVIDIA System Management Interface is a command line utility provided by the NVIDIA vGPU host manager. Running this command returns the GPUs and drivers on the host. - Run the following command to verify that the NVIDIA driver is properly installed:
esxcli software vib list | grep NVIDIA
. - Verify that host is configured with GPU shared direct and that SR-IOV is turned on (if you are using NVIDIA A30 or A100 devices).
- Using the vSphere Client, on the ESXi host that is configured for GPU, create a new virtual machine with a PCI device included. The NVIDIA vGPU profile should appear and be selectable.
Admin Step 5: Enable Workload Management on the vGPU-configured vCenter Cluster
Now that ESXi hosts are configured to support NVIDIA vGPU, create a vCenter Cluster comprising these hosts. To support Workload Management, the vCenter Cluster must meet specific requirements, including shared storage, high-availability, fully-automated DRS.
Enabling Workload Management also requires the selection of a networking stack, either native vSphere vDS networking or NSX-T Data Center networking. If you use vDS networking, you need to install a load balancer, either NSX Advanced or HAProxy.
Task | Instructions |
---|---|
Create a vCenter Cluster that meets the requirements for enabling Workload Management | Prerequisites for Configuring vSphere with Tanzu on a vSphere Cluster |
Configure the networking for the Supervisor Cluster, either NSX-T or vDS with a load balancer. | Configuring NSX for vSphere with Tanzu. Configuring vSphere Networking and NSX Advanced Load Balancer for vSphere with Tanzu. Configuring vSphere Networking and HAProxy Load Balancer for vSphere with Tanzu. |
Enable Workload Management |
Admin Step 6: Create or Update a Content Library with the Tanzu Kubernetes Ubuntu Release
NVIDIA vGPU requires the Ubuntu operating system. VMware provides an Ubuntu OVA for such purposes. You cannot use the PhotonOS Tanzu Kubernetes release for vGPU clusters.
Content Library Type | Description |
---|---|
Create a Subscribed Content Library and automatically synchronize the Ubuntu OVA with your environment. | Create, Secure, and Synchronize a Subscribed Content Library for Tanzu Kubernetes releases |
Create a Local Content Library and manually upload the Ubuntu OVA to your environment. | Create, Secure, and Synchronize a Local Content Library for Tanzu Kubernetes releases |
Admin Step 7: Create a Custom VM Class with the vGPU Profile
To next step is to create a custom VM Class with a vGPU profile. The system will use this class definition when it creates the Tanzu Kubernetes cluster nodes.
- Log on to the vCenter Server using the vSphere Client.
- Select Workload Management.
- Select Services.
- Select VM Classes.
- Click Create VM Class.
- At the Configuration tab, configure the custom VM Class.
Configuration Field Description Name Enter a self-descriptive name for the custom VM class, such as vmclass-vgpu-1. vCPU Count 2 CPU Resource Reservation Optional, OK to leave blank Memory 80 GB, for example Memory Resource Reservation 100% (mandatory when PCI devices are configured in a VM Class) PCI Devices Yes Note: Selecting Yes for PCI Devices tells the system you are using a GPU device and changes the VM Class configuration to support vGPU configuration.For example:
- Click Next.
- At the PCI Devices tab, select the option.
- Configure the NVIDIA vGPU model.
NVIDIA vGPU Field Description Model Select the NVIDIA GPU hardware device model from those available in the menu. If the system does not show any profiles, none of the hosts in the cluster have supported PCI devices.GPU Sharing This setting defines how the GPU device is shared across GPU-enabled VMs. There are two types of vGPU implementations: Time Sharing and Multi-Instance GPU Sharing.
In Time Sharing mode, the vGPU scheduler instructs the GPU to perform the work for each vGPU-enabled VM serially for a duration of time with the best effort goal of balancing performance across vGPUs.
MIG mode allows multiple vGPU-enabled VMs to run in parallel on a single GPU device. MIG mode is based on a newer GPU architecture and is only supported on NVIDIA A100 and A30 devices. If you do not see the MIG option, the PCI device you selected does not support it.
GPU Mode Compute GPU Memory 8 GB, for example Number of vGPUs 1, for example For example, here is a NVIDIA vGPU profile configured in Time Sharing mode:
For example, here is a NVIDIA vGPU profile configured in MIG mode with supported GPU device:
- Click Next.
- Review and confirm your selections.
- Click Finish.
- Verify that the new custom VM Class is available in the list of VM Classes.
Admin Step 8: Create and Configure a vSphere Namespace for the TKGS GPU Cluster
Create a vSphere Namespace for each TKGS GPU cluster you plan to provision. Configure the namespace by adding a vSphere SSO user with Edit permissions, and attach a storage policy for persistent volumes.
To do this, see Create and Configure a vSphere Namespace.
Admin Step 9: Associate the Content Library and VM Class with the vSphere Namespace
Task | Description |
---|---|
Associate the Content Library with the Ubuntu OVA for vGPU with the vSphere Namespace where you will provision the TKGS cluster. | See Configure a vSphere Namespace for Tanzu Kubernetes releases. |
Associate the custom VM Class with the vGPU profile with the vSphere Namespace where you will provision the TKGS cluster. | See Associate a VM Class with a Namespace in vSphere with Tanzu. |
Admin Step 10: Verify that the Supervisor Cluster Is Accessible
The last administration task is to verify that the Supervisor Cluster is provisioned and available for use by the Cluster Operator to provision a TKGS cluster for AI/ML workloads.
- Download and install the Kubernetes CLI Tools for vSphere.
See Download and Install the Kubernetes CLI Tools for vSphere.
- Connect to the Supervisor Cluster.
See Connect to the Supervisor Cluster as a vCenter Single Sign-On User.
- Provide the Cluster Operator with the link to download the Kubernetes CLI Tools for vSphere and the name of the vSphere Namespace.
See Cluster Operator Workflow for Deploying AI/ML Workloads on TKGS Clusters.