GPU Sharing in Enterprise Edge

The NVIDIA vGPU software includes two components: the Virtual GPU Manager and a separate guest OS NVIDIA vGPU driver. This section will walk users through installing vGPU technology on the vSphere hosts and enabling vGPU for virtual machines.

In Enterprise Edge deployments running on vSphere, NVIDIA vGPU is a method for using and sharing a single GPU device between multiple virtual machines. Given the use cases at the Enterprise Edge mainly involve computer vision, machine learning, and AI workloads, the NVIDIA Virtual ComputeServer (vCS) is what we recommend for the vGPU software license.

The NVIDIA vGPU software consists of two separate parts. First is the NVIDIA Virtual GPU Manager that is loaded as a VMware Installation Bundle (VIB) on the vSphere host, and the second part is the vGPU driver that is installed within the guest operating system of the virtual machine using the GPU. Deploying GPUs in vSphere with vGPU allows administrators to choose between dedicating all the GPU resources to one virtual machine or allowing GPU sharing between multiple virtual machines. The figure below shows the architecture demonstrating how ESXi and virtual machines work in conjunction with vGPU software to deliver GPU sharing.

The instructions to set up NVIDIA vGPU on vSphere are detailed in this blog post. They can be summarized into the following steps:

Prerequisites

Access to the NVIDIA licensing portal and NVIDIA AI Enterprise software
vCenter access with read and write privileges for Host Graphics Settings
SSH access to the vSphere hosts in the cluster

Procedure

Get access to the NVIDIA licensing portal and download the NVAIE software package matching your vSphere version. The software package will include the VMware Installation Bundle (VIB) as well as the vGPU driver for different Linux operating systems.

Figure 2. Downloading NVAIE software package
Configure Host Graphics Settings to “Shared Direct”, which is required for vGPU. You can get to this setting through vCenter or vSphere Client by choosing the host and navigating to “Configure -> Hardware -> Graphics -> Host Graphics tab -> Edit”.

Figure 3. Configure Host Graphics Settings
Place the ESXi host server in maintenance mode and install the NVIDIA vGPU Manager VIB. This requires SSH access to the host and command line to execute the installation commands.
Verify the NVIDIA Driver setup and the GPU virtualization mode with the following commands:
```
# nvidia-smi
# nvidia-smi –q | grep –i virtualization
```
Create the virtual machine that will be using the GPU and configure EFI Boot Mode under Boot Options. The default boot option is BIOS.
Add a new PCI device to the virtual machine in question and assign the desired NVIDIA vGPU Profile to the virtual machine. You can dedicate the whole GPU to the virtual machine or only assign a partition of the GPU to allow multiple virtual machines to share the GPU device.

Figure 4. Add vGPU profile to virtual machine
Install the vGPU driver on the Guest operating system and ensure that the vGPU Manager version that was installed on the ESXi host as a VIB is the same as the driver being installed on the VM. You can leverage the “.run” file from the previously downloaded NVIDIA AI Enterprise software package.
Configure an NVIDIA client license on the virtual machine by obtaining a client configuration token from the NVIDIA licensing portal and applying the license to the virtual machine. You can follow the steps provided in this AI Enterprise documentation. Make sure to copy the token file to the appropriate directory and not just copy and paste the content of the file; otherwise, it will not work.
Power on the virtual machine and deploy applications using the vGPU assigned.

Results

Applications running in multiple virtual machines will now be able to share a GPU. The vGPU technology offers the choice between dedicating a full GPU device to one virtual machine or allowing partial sharing of a GPU device by multiple virtual machines, making it a useful option when applications do not need the full power of a GPU or when there is a limited number of GPU devices available.

Note:

At the time of writing this guide, Tanzu Kubernetes Grid (TKG) versions up to 1.6 do not support the deployment of GPU-enabled workload clusters using the vGPU method, only PCI passthrough is supported for TKG workloads. The ECS Enterprise Edge aims to support vGPU-enabled clusters leveraging NVIDIA GPU Operator in the future.

What to do next

Restart the nvidia-gridd service in the virtual machine after vGPU driver installation