Configure the vSphere Environment for Private AI Ready Infrastructure for VMware Cloud Foundation

Before you activate vSphere with Tanzu, create storage policies for the Supervisor and vSphere namespaces. The policies represent available datastores in the vSphere environment. They control the storage placement of such objects as control plane VMs, pod ephemeral disks, container images, and persistent storage volumes. When you use VMware Tanzu™ Kubernetes Grid™ Service, the storage policies also dictate how the Tanzu Kubernetes Cluster nodes are deployed.

Enable vSphere vMotion for vGPU-Enabled Virtual Machines for Private AI Ready Infrastructure for VMware Cloud Foundation

You must explicitly turn on vSphere vMotion for virtual machines that use NVIDIA vGPUs without causing data loss.

During the stun time, you are unable to access the VM. Once the migration is completed, access to the VM resumes and all applications continue from their previous state.

The expected VM stun time (the time when the VM is inaccessible to users during vMotion) can vary depending on the amount of GPU memory that is currently being consumed by the VM For information on frame buffer size in vGPU profiles, refer to the NVIDIA Virtual GPU documentation. GPUs with more memory or vSphere infrastructure leveraging high speed networking (25 GbE,100 GbE, etc.) could potentially have a direct impact on the stun time.

Starting with vSphere 8.0 U2, DRS can estimate the Stun Time for a given vGPU VM configuration. When the DRS Cluster Advanced Options are set and the Estimated VM Devices Stun Time for a VM is lower than the VM Devices vMotion Stun Time limit, DRS will automate VM migrations potentially overriding the Default 100 seconds if required. For more information on how to setup this DRS advanced configuration refer to the following vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster (88271).

Procedure

Log in to the VI workload domain vCenter Server at https://<vcenter_server_fqdn>/ui as [email protected].
In the Hosts and clusters inventory, select the vCenter Server for the VI workload domain.
On the Configure tab, select Settings > Advanced Settings and click Edit settings.
In the Edit vCenter Server Advanced Settings dialog box, find the vgpu.hotmigrate.enabled property and set it to Enabled.
If the vgpu.hotmigrate.enabled property is not available in the advanced settings table, add it, setting its value to true.
Click Save.

Install the Vendor GPU Driver on the ESXi Hosts for Private AI Ready Infrastructure for VMware Cloud Foundation

You upload the driver to the vSphere Lifecycle image for the default cluster of the GPU-enabled workload domain and remediate the hosts.

Prerequisites

Procedure

Download the driver version for your GPU according to the VMware Compatibility Guide.

Go to the Shared Passthrough Graphics and AI/ML section of the VMware Compatibility Guide, click your GPU model, and take a note of the supported driver versions and other add-ons, such as GPU monitoring daemons.

Download the drivers directly from the vendor Web site.


Vendor	GPU Driver Download Location
NVIDIA	NVIDIA Application Hub

Log in to the management domain vCenter Server at https://<vcenter_server_fqdn>/ui as [email protected].
Import the GPU driver for the ESXi version compatible with this validated solution into the vSphere Lifecycle Manager depot.
1. From the vSphere Client Menu, select Lifecycle Manager.
2. On the Lifecycle Manager page, click Actions > Import Updates
3. In the Import Updates dialog box, click Browse, locate the GPU driver ZIP file, and click Open.
4. Click Import.
The driver appears in the Components table on the Lifecycle Manager page.
Add the GPU driver to the image for the default cluster of the workload domain.
1. In the Hosts and clusters inventory, select the cluster.
2. On the Updates tab, select Hosts > Image.
3. In the Image pane, click Edit.
4. In the Edit Image pane, next to Components, click the Show details link.
5. Click Add components above the component table that appears.
6. Select the driver component and version that you plan to use on the GPU-enabled ESXi hosts in the workload domain and click Select.
7. Click Validate.
8. Click Save.
A warning message that the ESXi hosts in the cluster are non-compliant appears.
Remediate the default cluster with the cluster image containing the GPU driver.
1. On the Updates tab for the cluster, select Hosts > Image.
2. In the Image Compliance pane, click Remediate All.
3. In the Review Remediation Impact dialog box, review the impact summary, the applicable remediation settings, and the EULA.
4. Accept the EULA.
5. Click Start remediation.
After the remediation process is complete, place each host in maintenance mode and restart it.
Log in to SDDC Manager at https://<sddc_manager_fqdn> with a user the Admin role.
Navigate to Lifecycle Management > Image Management.
Extract the vSphere Lifecycle Manager image with the GPU driver component.
1. On the Import Image tab, under the Option 1 section, select the management domain and the empty cluster.
2. Click Extract cluster image.
The extracted cluster image appears on the Available Images tab. It can be used for a new VI workload domain or a new cluster in a VI workload domain enabled for vSphere Lifecycle Manager images.

What to do next

For information on MIG and how to enable it, see NVIDIA Multi-Instance GPU User Guide..