About Deploying AI/ML Workloads on TKGS Clusters

You can deploy AI/ML workloads on TKGS clusters using vSphere with Tanzu and NVIDIA vGPU technology.

Announcing TGKS Support for AI/ML Workloads

Beginning with the release of vSphere with Tanzu Version 7 Update 3 Monthly Patch 1, you can deploy compute intensive workloads to Tanzu Kubernetes clusters provisioned by the Tanzu Kubernetes Grid Service. In this context a compute intensive workload is an artificial intelligence (AI) or machine learning (ML) application that requires the use of a GPU accelerator device.

To facilitate running AI/ML workloads in a Kubernetes environment, VMware has partnered with NVIDIA to support the NVIDIA GPU Cloud platform on vSphere with Tanzu. This means that you can deploy container images from the NGC Catalog on Tanzu Kubernetes clusters provisioned by the Tanzu Kubernetes Grid Service.

To learn more about the joint NVIDIA and VMware architecture for the AI-Ready Enterprise, see Accelerating Workloads on vSphere 7 with Tanzu - A Technical Preview of Kubernetes Clusters with GPUs.

Supported vGPU Modes

Deploying AI/ML workloads on TKGS requires the use of the Ubuntu OVA that is available through the vSphere with Tanzu content delivery network. TKGS supports two modes of GPU operations: vGPU and vGPU with NIC Passthrough. The table describes the two modes in more detail.

Mode	Configuration	Description
NVIDIA + TKGS + Ubuntu + vGPU	NVIDIA vGPU	The GPU device is virtualized by the NVIDIA Host Manager Driver installed on each ESXi host. The GPU device is then shared across multiple NVIDIA virtual GPUs (vGPUs). Each NVIDIA vGPU is defined by the amount of memory from the GPU device. For example, if the GPU device has a total amount of RAM of 32GB, you can create 8 vGPUs with approximately 4 GB of memory each.
NVIDIA + TKGS + Ubuntu + vGPU + NIC Passthrough	NVIDIA vGPU and Dynamic DirectPath IO	In the same VM Class where you configure the NVIDIA vGPU profile, you include support for a passthrough networking device using Dynamic DirectPath IO. In this case vSphere DRS determines VM placement.

Mode

Configuration

Description

NVIDIA + TKGS + Ubuntu + vGPU

NVIDIA vGPU

The GPU device is virtualized by the NVIDIA Host Manager Driver installed on each ESXi host. The GPU device is then shared across multiple NVIDIA virtual GPUs (vGPUs).

Each NVIDIA vGPU is defined by the amount of memory from the GPU device. For example, if the GPU device has a total amount of RAM of 32GB, you can create 8 vGPUs with approximately 4 GB of memory each.

NVIDIA + TKGS + Ubuntu + vGPU + NIC Passthrough

NVIDIA vGPU

and

Dynamic DirectPath IO

In the same VM Class where you configure the NVIDIA vGPU profile, you include support for a passthrough networking device using Dynamic DirectPath IO. In this case vSphere DRS determines VM placement.

Getting Started

To configure NVIDIA vGPU for TKGS, refer to the following topics:

If you are using vGPU with NIC Passthrough, also refer to the following topic: vSphere Administrator Addendum for Deploying AI/ML Workloads on TKGS Clusters (vGPU and Dynamic DirectPath IO).

If you are using the NVIDIA Delegated Licensing Server (DLS) for your NVAIE account, also refer to the following topic: Cluster Operator Addendum for Deploying AI/ML Workloads on TKGS Clusters (DLS).