Provision a GPU-Accelerated TKG Cluster by Using a Self-Service Catalog in VMware Private AI Foundation with NVIDIA

DevOps engineers and developers can use VMware Aria Automation to provision GPU-accelerated TKG clusters for hosting container AI workloads on the Supervisor instance in a VI workload domain.

The workflow for deploying a GPU-accelerated TKG cluster has two parts:

As a cloud administrator, you add self-service catalog items for private AI for a new namespace on the Supervisor to Automation Service Broker.
As a data scientist or DevOps engineer, you use an AI Kubernetes cluster catalog item to deploy a GPU-accelerated TKG cluster on a new namespace on the Supervisor.

Create AI Self-Service Catalog Items in VMware Aria Automation

As a cloud administrator, you can use the catalog setup wizard for private AI in VMware Aria Automation to quickly add catalog items for deploying deep learning virtual machines or GPU-accelerated TKG clusters in a VI workload domain.

Data scientists can use deep learning catalog items for deployment of deep learning VMs. DevOps engineers can use the catalog items for provisioning AI-ready TKG clusters. Every time you run it, the catalog setup wizard for private AI adds two catalog items to the Service Broker catalog - one for a deep learning virtual machine and one for a TKG cluster.

Every time you run it, the catalog setup wizard for private AI adds two catalog items to the Service Broker catalog - one for a deep learning virtual machine and one for a TKG cluster. You can run the wizard every time you need the following:

Enable provisioning of AI workloads on another supervisor.
Accommodate a change in your NVIDIA AI Enterprise license, including the client configuration .tok file and license server, or the download URL for the vGPU guest drivers for a disconnected environment.
Accommodate a deep learning VM image change.
Use other vGPU or non-GPU VM classes, storage policy, or container registry.
Create catalog items in a new project.

Prerequisites

Verify that VMware Private AI Foundation with NVIDIA is available for the VI workload domain.
Verify that the prerequisites for deploying deep learning VMs are in place.
Create a Content Library with Deep Learning VM Images for VMware Private AI Foundation with NVIDIA.

Procedure

Navigate to the VMware Aria Automation home page and click Quickstart.
Run the Private AI Automation Services catalog setup wizard for Private AI Automation.

See Add Private AI items to the Automation Service Broker catalog in the VMware Aria Automation Product Documentation.

Provision a GPU-Accelerated TKG Cluster by Using a Self-Service Catalog in VMware Aria Automation

In VMware Private AI Foundation with NVIDIA, as a DevOps engineer, you can provision a TKG cluster accelerated with NVIDIA GPUs from VMware Aria Automation by using an AI Kubernetes Cluster self-service catalog items in Automation Service Broker. Then, you can deploy AI container images from NVIDIA NGC on the cluster.

Note: VMware Aria Automation creates a namespace every time you provision a GPU-accelerated TKG cluster.

Procedure

In a connected environment, in Automation Service Broker, deploy an AI Kubernetes Cluster catalog item on the Supervisor instance configured by the cloud administrator.
See Deploy an AI-enabled Tanzu Kubernetes cluster.
In a disconnected environment, upload the components of the NVIDIA GPU Operator on internal locations and modify the AI Kubernetes Cluster catalog item for the Supervisor instance configured by the cloud administrator.
1. Provide a local Ubuntu package repository and upload the container images in the NVIDIA GPU Operator package to the Harbor Registry for the Supervisor.
2. Provide a local Helm chart repository with NVIDIA GPU Operator chart definitions.
3. Update the Helm chart definitions of the NVIDIA GPU Operator to use the local Ubuntu package repository and private Harbor Registry.
4. On the Design > Cloud Templates page of Automation Assembler, modify directly the AI Kubernetes Cluster cloud template, or clone the cloud template and modify the clone.
  1. Add a ConfigMap to for using the local Ubuntu repository in the NVIDIA GPU Operator.
  2. Update the Helm chart repository URL.
  3. Deploy the cloud template.
5. Deploy the modified or cloned AI Kubernetes Cluster catalog item on the Supervisor instance.

What to do next

For details on how to access the TKG cluster by using kubectl, in Automation Service Broker, navigate to Consume > Deployments > Deployments.
Deploy an AI container image from the NVIDIA NGC catalog.
In a disconnected environment, you must upload the AI container images to a private container registry. See Setting Up a Private Harbor Registry in VMware Private AI Foundation with NVIDIA.