You can use vMotion to perform a live migration of NVIDIA vGPU-powered virtual machines without causing data loss.

To enable vMotion for vGPU virtual machines, you need to set the vgpu.hotmigrate.enabled advanced setting to true. For more information about how to configure the vCenter Server advanced settings, see Configure Advanced Settings in the vCenter Server Configuration documentation.

In vSphere 6.7 Update 1 and vSphere 6.7 Update 2, when you migrate vGPU virtual machines with vMotion and vMotion stun time exceeds 100 seconds, the migration process might fail for vGPU profiles with 24 GB frame buffer size or larger. To avoid the vMotion timeout, upgrade to vSphere 6.7 Update 3 or later.

During the stun time, you are unable to access the VM, desktop, or application. Once the migration is completed, access to the VM resumes and all applications continue from their previous state. For information on frame buffer size in vGPU profiles, refer to the NVIDIA Virtual GPU documentation.

The expected VM stun times (the time when the VM is inaccessible to users during vMotion) and the estimated worst-case stun times are listed in the following tables. The expected stun times were tested over a 10Gb network with NVIDIA Tesla V100 PCIe 32 GB GPUs:

Table 1. Expected Stun Times for vMotion of vGPU VMs
Used vGPU Frame Buffer (GB) VM Stun Time (sec)
1 2
2 4
4 6
8 12
16 22
32 39
Table 2. Estimated Worst-Case Stun Times (sec)
vGPU Memory VM Memory 4 GB VM Memory 8 GB VM Memory 16 GB VM Memory 32 GB
1 GB 5 6 8 12
2 GB 7 9 11 15
4 GB 13 14 16 21
8 GB 24 25 28 32
16 GB 47 48 50 54
32 GB 91 92 95 99
Note: When you consider the expected and estimated worst-case stun times, keep in mind the following points:
  • The configured vGPU profile represents an upper bound to the used vGPU frame buffer. In many use cases, the amount of vGPU frame buffer memory used by the VM at any given time is below the assigned vGPU memory in the profile.
  • Both expected and estimated worst-case stun times are only valid when migrating a single virtual machine. If you are concurrently migrating multiple virtual machines, that is, for a vSphere manual remediation process, the stun times will have adverse effects.
  • The above estimates assume sufficient CPU, memory, PCIe, and network capacity to achieve 10 Gbps migration throughput.

DRS supports initial placement of vGPU VMs running vSphere 6.7 Update 1 and later without load balancing support.

VMware vSphere vMotion is supported only with and between compatible NVIDIA GPU device models and NVIDIA GRID host driver versions as defined and supported by NVIDIA. For compatibility information, refer to the NVIDIA Virtual GPU User Guide.

To check compatibility between NVIDIA vGPU host drivers, vSphere, and Horizon, refer to the VMware Compatibility Matrix.