You can deploy AI/ML workloads on TKG Service clusters using NVIDIA GPU technology.
TKGS Support for AI/ML Workloads
You can deploy compute intensive workloads to TKG Service clusters. In this context a compute intensive workload is an artificial intelligence (AI) or machine learning (ML) application that requires the use of a GPU accelerator device.
To facilitate running AI/ML workloads in a Kubernetes environment, VMware has partnered with NVIDIA to support the NVIDIA GPU Cloud platform. This means that you can deploy container images from the NGC Catalog on TKGS clusters. For more information about vSphere 8 NVIDIA GPU support, check out the vGPU article on Tech Zone.
Supported GPU Modes
OS | TKr | vSphere with Tanzu | Description |
---|---|---|---|
Ubuntu 20.04 LTS | 1.22 - 1.2x* (latest up through 1.28) | 7.0 U3c 8.0 U2+ |
The GPU device is virtualized by the NVIDIA Host Manager Driver installed on each ESXi host. The GPU device is then shared across multiple NVIDIA virtual GPUs (vGPUs).
Note: The vSphere Distributed Resource Scheduler (DRS) distributes vGPU VMs in a breadth-first manner across the hosts comprising a vSphere cluster. For more information, see
DRS Placement of vGPU VMs in the vSphere Resource Management guide.
Each NVIDIA vGPU is defined by the amount of memory from the GPU device. For example, if the GPU device has a total amount of RAM of 32GB, you can create 8 vGPUs with 4 GB of memory each. |
OS | TKr | vSphere with Tanzu | Description |
---|---|---|---|
Ubuntu 20.04 LTS | 1.22 - 1.2x* (latest up through 1.28) | 7.0 U3c 8.0 U2+ |
In the same VM Class where you configure the NVIDIA vGPU profile, you include support for a passthrough networking device using Dynamic DirectPath IO. In this case vSphere DRS determines VM placement. |