You can deploy AI/ML workloads on TKG clusters on Supervisor using NVIDIA vGPU technology.

TKG Support for AI/ML Workloads

You can deploy compute intensive workloads to TKG clusters on Supervisor. In this context a compute intensive workload is an artificial intelligence (AI) or machine learning (ML) application that requires the use of a GPU accelerator device.

To facilitate running AI/ML workloads in a Kubernetes environment, VMware has partnered with NVIDIA to support the NVIDIA vGPU Cloud platform on vSphere with Tanzu. This means that you can deploy container images from the NGC Catalog on TKG clusters on Supervisor.

For more information about vGPU support in vSphere 8, check out the vGPU article on Tech Zone.
Note: The vSphere Distributed Resource Scheduler (DRS) distributes vGPU VMs in a breadth-first manner across the hosts comprising a vSphere cluster. For more information, see DRS Placement of vGPU VMs in the vSphere Resource Management guide.

Supported vGPU Modes

Deploying AI/ML workloads on TKG requires the use of the Ubuntu OVA that is available through the vSphere with Tanzu content delivery network. TKG supports two modes of GPU operations: vGPU and vGPU with Dynamic DirectPath IO.

Mode Configuration Description

NVIDIA + TKGS + Ubuntu + vGPU


The GPU device is virtualized by the NVIDIA Host Manager Driver installed on each ESXi host. The GPU device is then shared across multiple NVIDIA virtual GPUs (vGPUs).

Each NVIDIA vGPU is defined by the amount of memory from the GPU device. For example, if the GPU device has a total amount of RAM of 32GB, you can create 8 vGPUs with 4 GB of memory each.

NVIDIA + TKG + Ubuntu + vGPU + NIC Passthrough



Dynamic DirectPath IO

In the same VM Class where you configure the NVIDIA vGPU profile, you include support for a passthrough networking device using Dynamic DirectPath IO. In this case vSphere DRS determines VM placement.