As a DevOps engineer, you can provision a Tanzu Kubernetes Grid cluster accelerated with NVIDIA GPUs by using the AI Kubernetes Cluster catalog item in the self-service Automation Service Broker catalog. Then, you can deploy AI container images from NVIDIA NGC on the cluster.
The TKG cluster contains an NVIDIA GPU operator, which is a Kubernetes operator that is responsible for setting up the proper NVIDIA driver for the NVIDIA GPU hardware on the TKG cluster nodes. The deployed cluster is ready to use for AI/ML workloads without needing additional GPU-related setup.
The deployment contains a supervisor namespace, a TKG cluster with three work nodes, multiple resources inside the TKG cluster, and a Carvel application which deploys the GPU Operator application.
For a RAG-based Tanzu Kubernetes Grid cluster, use the AI Kubernetes RAG Cluster catalog item. See Deploy a RAG Workload on a TKG Cluster Using a Self-Service Catalog Item in VMware Aria Automation.
Prerequisites
- Verify that your cloud administrator has configured Private AI Automation Services for your project.
- Verify that you have permissions to request AI catalog items.
Procedure
- On the Catalog page in Automation Service Broker, locate the AI Kubernetes Cluster card and click Request.
- Select a project.
- Enter a name and description for your deployment.
- Select the number of control pane nodes.
Setting Sample value Node count 1 VM class best-effort-4xlarge - 16 CPUs and 128 GB Memory The class selection defines the resources available within the virtual machine.
- Select the number of work nodes.
Setting Description Node count 3 VM class best-effort-4xlarge-a100-40c - 1 vGPU (40 GB), 16 CPUs and 120 GB Memory Time-slicing replicas 1 Time-slicing defines a set of replicas for a GPU that is shared between workloads.
- Provide the NVIDIA AI enterprise API key.
- Click Submit.
What to do next
Run an AI container image. In a connected environment, use the NVIDIA NGC catalog. In a disconnected environment, use the Harbor Registry on the Supervisor.