You can deploy AI/ML workloads on TKG clusters on Supervisor using NVIDIA vGPU technology.
TKG Support for AI/ML Workloads
You can deploy compute intensive workloads to TKG clusters on Supervisor. In this context a compute intensive workload is an artificial intelligence (AI) or machine learning (ML) application that requires the use of a GPU accelerator device.
To facilitate running AI/ML workloads in a Kubernetes environment, VMware has partnered with NVIDIA to support the NVIDIA vGPU Cloud platform on vSphere with Tanzu. This means that you can deploy container images from the NGC Catalog on TKG clusters on Supervisor.
Supported vGPU Modes
Deploying AI/ML workloads on TKG requires the use of the Ubuntu OVA that is available through the vSphere with Tanzu content delivery network. TKG supports two modes of GPU operations: vGPU and vGPU with Dynamic DirectPath IO.
Mode | Configuration | Description |
---|---|---|
NVIDIA + TKGS + Ubuntu + vGPU |
NVIDIA vGPU |
The GPU device is virtualized by the NVIDIA Host Manager Driver installed on each ESXi host. The GPU device is then shared across multiple NVIDIA virtual GPUs (vGPUs). Each NVIDIA vGPU is defined by the amount of memory from the GPU device. For example, if the GPU device has a total amount of RAM of 32GB, you can create 8 vGPUs with 4 GB of memory each. |
NVIDIA + TKG + Ubuntu + vGPU + NIC Passthrough |
NVIDIA vGPU and Dynamic DirectPath IO |
In the same VM Class where you configure the NVIDIA vGPU profile, you include support for a passthrough networking device using Dynamic DirectPath IO. In this case vSphere DRS determines VM placement. |