About Deploying AI/ML Workloads on TKG Service Clusters

You can deploy AI/ML workloads on TKG Service clusters using NVIDIA GPU technology.

TKGS Support for AI/ML Workloads

You can deploy compute intensive workloads to TKG Service clusters. In this context a compute intensive workload is an artificial intelligence (AI) or machine learning (ML) application that requires the use of a GPU accelerator device.

To facilitate running AI/ML workloads in a Kubernetes environment, VMware has partnered with NVIDIA to support the NVIDIA GPU Cloud platform. This means that you can deploy container images from the NGC Catalog on TKGS clusters. For more information about vSphere 8 NVIDIA GPU support, check out the vGPU article on Tech Zone.

Supported GPU Modes

Deploying NVIDIA-based AI/ML workloads on TKG Service clusters requires the use of the Ubuntu edition of TKG releases, versions 1.22 or later. vSphere supports two modes: NVIDIA Grid vGPU and GPU Passthrough using a Dynamic DirectPath I/O device. For more information, see Supported Operating Systems and Kubernetes Platforms in the NVIDIA documentation.

Table 1. vSphere VMs with NVIDIA vGPU
OS	TKr	vSphere with Tanzu	Description
Ubuntu 20.04 LTS	1.22 - 1.2x* (latest up through 1.28)	7.0 U3c 8.0 U2+	The GPU device is virtualized by the NVIDIA Host Manager Driver installed on each ESXi host. The GPU device is then shared across multiple NVIDIA virtual GPUs (vGPUs). Note: The vSphere Distributed Resource Scheduler (DRS) distributes vGPU VMs in a breadth-first manner across the hosts comprising a vSphere cluster. For more information, see DRS Placement of vGPU VMs in the vSphere Resource Management guide. Each NVIDIA vGPU is defined by the amount of memory from the GPU device. For example, if the GPU device has a total amount of RAM of 32GB, you can create 8 vGPUs with 4 GB of memory each.

Table 2. vSphere VMs with GPU Passthrough
OS	TKr	vSphere with Tanzu	Description
Ubuntu 20.04 LTS	1.22 - 1.2x* (latest up through 1.28)	7.0 U3c 8.0 U2+	In the same VM Class where you configure the NVIDIA vGPU profile, you include support for a passthrough networking device using Dynamic DirectPath IO. In this case vSphere DRS determines VM placement.