Configuring vGPU on Tanzu Kubernetes Grid Clusters to allow Artificial Intelligence and Machine Learning Workloads

You can deploy artificial intelligence and machine learning workloads on clusters provisioned by the Tanzu Kubernetes Grid. The deployment of artificial intelligence and machine learning workloads requires some initial setup by service providers, and some configuration by organization administrators and tenant users in the cluster creation workflow.

To prepare VMware Cloud Director environment to provision clusters that can handle artificial intelligence and machine learning workloads, service providers must create a vGPU policy and add a vGPU policy to an organization VDC. For instructions on how to perform these tasks, refer to Creating and Managing vGPU Policies. Once service providers perform these steps, tenant users can deploy artificial intelligence and machine learning workloads to their Tanzu Kubernetes Grid clusters.

To create Tanzu Kubernetes Grid clusters with vGPU functionality, see Create a Tanzu Kubernetes Grid Cluster. If you are using Tanzu Kubernetes Grid 2.1 and above that are interoperable with VMware Cloud Director Container Service Extension, the following sections are not applicable and you can proceed to the cluster creation workflow.

Note: The following sections are applicable to Tanzu Kubernetes Grid 1.6.1 only, that is no longer supported by VMware. To avail of the vGPU functionality, use Tanzu Kubernetes Grid versions 2.1 and above that are interoperable with VMware Cloud Director Container Service Extension.

BIOS Firmware Limitations

VMware Cloud Director Container Service Extension Tanzu Kubernetes Grid templates are built with BIOS firmware, and it is not possible to change this firmware configuration. The BAR1 memory on this firmware cannot exceed 256 MB. NVIDIA Grid cards with more than 256MB of BAR1 memory require EFI firmware. For more information on firmware limitations, refer to VMware vSphere: NVIDIA Virtual GPU Software Documentation.

Create a Custom Image with EFI Firmware

To overcome the BIOS firmware limitations that exist on Tanzu Kubernetes Grid templates, you can create a custom image with EFI firmware in vSphere. For instructions, refer to Linux Custom Machine Images sections in the archived Tanzu Kubernetes Grid 1.6 documentation. To access the archived documentation, see VMware Tanzu Kubernetes Grid Documentation > Unsupported Releases.

To create Linux custom machine images with Tanzu Kubernetes Grid 1.6 successfully on a GPU template, you also have to include the following inputs when you build the custom image:

Inputs Description

customizations.json

To build an image for a vGPU-enabled cluster for vSphere, create a file named customizations.json, and add the following:

{
"vmx_version": "17"
}

metadata.json

VERSION must identically match an established version of a Tanzu Kubernetes Grid template, as the Kubernetes Container Clusters UI plug-in does not recognize the OVA file if the version number differs to that of the template.

The following example outlines the recommended file naming convention:


Template and Version	Metadata
Kubernetes template for TKG 1.6	ubuntu-2004-kube-v1.23.10+vmware.1-tkg.2-b53d41690f8742e7388f2c553fd9a181.ova
Version	v1.23.10+vmware.1-tkg.2-b53d41690f8742e7388f2c553fd9a181

build-node-ova-vsphere-ubuntu-2004-efi Use this command to run the image builder for vGPU-enabled clusters. This command specifies to build the custom image with EFI firmware.

Service providers must set up a new catalog in VMware Cloud Director for vGPU templates, and upload the templates to this catalog. When a user wants to create a vGPU-enabled cluster, they can select this template in the cluster creation process, and it leverages the vGPUs in that cluster. For more information, see Create Catalogs and Upload OVA Files.