GPUs are used for computer vision and machine learning at the edge, and can be deployed using PCI Passthrough or GPU Sharing through the ESXi hypervisor. NVIDIA supports GPU sharing through time slicing or multi-instance GPU mode. To use NVIDIA GPUs as a resource in TKG workload clusters, the NVIDIA GPU Operator is required, and GPU-enabled clusters only support PCI Passthrough as of TKG version 1.6.
The deployment of GPUs at the edge is typically used for computer vision solutions and compute workloads such as machine learning, deep learning, and video inferencing. Use of GPUs on Edge Compute Stack by virtual machines and containers can be achieved by deploying GPU in two methods – PCI Passthrough or GPU Sharing through ESXi hypervisor.
PCI Passthrough bypasses the ESXi hypervisor and assigns a virtual machine to the entire GPU card to provide it with all the GPU resources. PCI Passthrough only allows the GPU to map to a single virtual machine and supports HA, failover, and other advanced vSphere cluster features. GPU Sharing allows the GPU card to be shared between multiple virtual machines and containers through native integration between vSphere and GPU vendors such as NVIDIA, such that slices of the GPU resource can be assigned to different virtual machines. The reference architecture will focus on NVIDIA provided the product maturity of NVIDIA GPU support on vSphere. NVIDIA allows for GPU sharing through time slicing (vGPU) or NVIDIA multi-instance GPU mode (MIG) where memory and the computational cores are statically partitioned. MIG is only supported by GPUs starting with Ampere architecture.
For TKG workload clusters to leverage NVIDIA GPU as a resource in the cluster, the NVIDIA GPU Operator is required to manage the GPU resources in the Kubernetes clusters and automate the bootstrapping of worker nodes with those resources. As of TKG version 1.6, A GPU-enabled workload cluster only supports GPU in PCI Passthrough.
The deployment details in the subsequent sections were performed in the following test environment:
vSphere 7.0 Update 3. Listed below are the builds for 7.0u3 which is the minimum required to support this:
VMware vCenter Server 7.0 Update 3 | ISO Build 18700403.
VMware ESXi 7.0 Update 3c | ISO Build 19193900.
Tanzu Kubernetes Grid 1.6
NVIDIA AI Enterprise Release 2.2
NVIDIA vGPU Software v14.2
Virtual GPU Manager: 510.85.03
Graphics Driver for Linux: 510.85.02
NVIDIA GPU Operator 1.11.1