As a DevOps engineer, you can request a GPU-accelerated Tanzu Kubernetes Grid (TKG) cluster, where worker nodes can run AI/ML workloads, from the self-service Automation Service Broker catalog.

Note: This documentation is based on VMware Aria Automation 8.18. For information about the VMware Private AI Foundation functionality in VMware Aria Automation 8.18.1, see Deploy a GPU-Accelerated TKG Cluster by Using a Self-Service Catalog Item in VMware Aria Automation in the VMware Private AI Foundation with NVIDIA documentation.

The TKG cluster contains an NVIDIA GPU operator, which is a Kubernetes operator that is responsible for setting up the proper NVIDIA driver for the NVIDIA GPU hardware on the TKG cluster nodes. The deployed cluster is ready to use for AI/ML workloads without needing additional GPU-related setup.

The deployment contains a supervisor namespace, a TKG cluster with three work nodes, multiple resources inside the TKG cluster, and a Carvel application which deploys the GPU Operator application.

Procedure

  1. On the Catalog page in Automation Service Broker, locate the AI Kubernetes Cluster card and click Request.
  2. Select a project.
  3. Enter a name and description for your deployment.
  4. Select the number of control pane nodes.
    Setting Sample value
    Node count 1
    VM class best-effort-4xlarge - 16 CPUs and 128 GB Memory

    The class selection defines the resources available within the virtual machine.

  5. Select the number of work nodes.
    Setting Description
    Node count 3
    VM class best-effort-4xlarge-a100-40c - 1 vGPU (40 GB), 16 CPUs and 120 GB Memory
    Time-slicing replicas 1

    Time-slicing defines a set of replicas for a GPU that is shared between workloads.

  6. Provide the NVIDIA AI enterprise API key.
  7. Click Submit.