Before you activate vSphere with Tanzu, create storage policies for the Supervisor and vSphere namespaces. The policies represent available datastores in the vSphere environment. They control the storage placement of such objects as control plane VMs, pod ephemeral disks, container images, and persistent storage volumes. When you use VMware Tanzu™ Kubernetes Grid™ Service, the storage policies also dictate how the Tanzu Kubernetes Cluster nodes are deployed.
Enable vSphere vMotion for vGPU-Enabled Virtual Machines for Private AI Ready Infrastructure for VMware Cloud Foundation
You must explicitly turn on vSphere vMotion for virtual machines that use NVIDIA vGPUs without causing data loss.
During the stun time, you are unable to access the VM. Once the migration is completed, access to the VM resumes and all applications continue from their previous state.
The expected VM stun time (the time when the VM is inaccessible to users during vMotion) can vary depending on the amount of GPU memory that is currently being consumed by the VM For information on frame buffer size in vGPU profiles, refer to the NVIDIA Virtual GPU documentation. GPUs with more memory or vSphere infrastructure leveraging high speed networking (25 GbE,100 GbE, etc.) could potentially have a direct impact on the stun time.
Starting with vSphere 8.0 U2, DRS can estimate the Stun Time for a given vGPU VM configuration. When the DRS Cluster Advanced Options are set and the Estimated VM Devices Stun Time for a VM is lower than the VM Devices vMotion Stun Time limit, DRS will automate VM migrations potentially overriding the Default 100 seconds if required. For more information on how to setup this DRS advanced configuration refer to the following vGPU Virtual Machine automated migration for Host Maintenance Mode in a DRS Cluster (88271).
Procedure
Install the Vendor GPU Driver on the ESXi Hosts for Private AI Ready Infrastructure for VMware Cloud Foundation
You upload the driver to the vSphere Lifecycle image for the default cluster of the GPU-enabled workload domain and remediate the hosts.
Prerequisites
Procedure
What to do next
For information on MIG and how to enable it, see NVIDIA Multi-Instance GPU User Guide..