Cluster Operator Workflow for Deploying AI/ML Workloads on TKGS Clusters

The high-level steps to deploy AI/ML workloads on TKGS clusters are as follows:

Step	Action	Link
0	Review system requirements.	See Operator Step 0: Review System Requirements.
1	Download kubectl and vSphere Plugin for Kubectl to local workstation.	See Operator Step 1: Install the Kubernetes CLI Tools for vSphere on Your Workstation.
2	Use kubectl to log in to the Supervisor Cluster, which populates .kube/config with the context for the new Supervisor Cluster.	See Operator Step 2: Log in to the Supervisor Cluster.
3	Use kubectl to switch context to the vSphere Namespace.	See Operator Step 3: Switch Context to the vSphere Namespace.
4	Use kubectl to list VM classes and verify that the NVIDIA vGPU-enabled class is present.	See Operator Step 4: Get the Custom VM Class for vGPU Workloads.
5	Use kubectl to list the available Tanzu Kubernetes releases and verify that the Ubuntu image is present.	See Operator Step 5: Get the Ubuntu Tanzu Kubernetes Release for GPU Nodes.
6	Craft the YAML specification for provisioning the GPU-enabled TKGS cluster; specify the TKR version and the VM class.	See Operator Step 6: Craft the YAML for Provisioning the vGPU-enabled TKGS Cluster.
7	Provision the TKGS cluster.	See Operator Step 7: Provision the TKGS Cluster.
8	Log in to the cluster and verify provisioning.	See Operator Step 8: Log In to the TKGS Cluster and Verify Provisioning.
9	Prepare to install the NVAIE GPU Operator by creating some prerequisite objects in the TKGS cluster, including a namespace, role bindings, image secret, and license configmap.	See Operator Step 9: Prepare to Install the NVAIE GPU Operator.
10	Install the NVAIE GPU Operator in the cluster.	See Operator Step 10: Install the NVIDIA GPU Operator in the Cluster.
11	Deploy AI/ML workloads to the vGPU-enabled TKGS cluster.	See Operator Step 11: Deploy an AI/ML Workload.

Operator Step 0: Review System Requirements

Refer to the following system requirements to set up the environment for deploying AI/ML workloads on TKGS clusters.

Requirement	Description
vSphere Administrator has set up the environment for NVIDIA vGPU	See vSphere Administrator Workflow for Deploying AI/ML Workloads on TKGS Clusters (vGPU)
TKR Ubuntu OVA	Tanzu Kubernetes release Ubuntu `ob-18691651-tkgs-ova-ubuntu-2004-v1.20.8---vmware.1-tkg.2`
TKG Cluster Provisioner	Tanzu Kubernetes Grid Service API version: `run.tanzu.vmware.com/v1alpha2`
NVIDIA GPU Operator	GPU Operator v1.8.0
NVIDIA GPU Driver Container	`nvcr.io/nvstating/cnt-ea/driver:470.51-ubuntu20.04`

Operator Step 1: Install the Kubernetes CLI Tools for vSphere on Your Workstation

Download and install the Kubernetes CLI Tools for vSphere.

If you are using Linux you can run the following command to download the tools.

curl -LOk https://${SC_IP}/wcp/plugin/linux-amd64/vsphere-plugin.zip
unzip vsphere-plugin.zip
mv -v bin/* /usr/local/bin/

For additional guidance, see Download and Install the Kubernetes CLI Tools for vSphere.

Operator Step 2: Log in to the Supervisor Cluster

Using the vSphere Plugin for kubectl, authenticate with the Supervisor Cluster.

kubectl vsphere login --server=IP-ADDRESS --vsphere-username USERNAME

Operator Step 3: Switch Context to the vSphere Namespace

Using kubectl, switch context to the vSphere Namespace that the vSphere administrator created for the TKGS GPU cluster.

kubectl config get-contexts

kubectl config use-context TKGS-GPU-CLUSTER-NAMESPACE

Operator Step 4: Get the Custom VM Class for vGPU Workloads

Verify that the custom VM Class with the vGPU profile that the vSphere Administrator created is available in the target vSphere Namespace.

kubectl get virtualmachineclassbindings

Note: The VM class must be bound to the target vSphere Namespace. If you do not see the custom VM class for vGPU workloads, check with the vSphere Administrator.

Operator Step 5: Get the Ubuntu Tanzu Kubernetes Release for GPU Nodes

Verify that the required Ubuntu Tanzu Kubernetes release that the vSphere Administrator synchronized from the Content Library is available in the vSphere Namespace.

kubectl get tanzukubernetesreleases

Or, using the shortcut:

kubectl get tkr

Operator Step 6: Craft the YAML for Provisioning the vGPU-enabled TKGS Cluster

Construct the YAML file for provisioning a Tanzu Kubernetes cluster.

Start with one of the of the examples below. Use the information you gleaned from the output of the preceding commands to customize the cluster specification. Refer to the full list of configuration parameters: TKGS v1alpha2 API for Provisioning Tanzu Kubernetes Clusters.

Example 1 specifies two worker node pools.

apiVersion: run.tanzu.vmware.com/v1alpha2
kind: TanzuKubernetesCluster
metadata:
   #cluster name
   name: tkgs-cluster-gpu-a100
   #target vsphere namespace
   namespace: tkgs-gpu-operator
spec:
   topology:
     controlPlane:
       replicas: 3
       #storage class for control plane nodes
       #use `kubectl describe storageclasses`
       #to get available pvcs
       storageClass: vwt-storage-policy
       vmClass: guaranteed-medium
       #TKR NAME for Ubuntu ova supporting GPU
       tkr:
         reference:
           name: 1.20.8---vmware.1-tkg.1
     nodePools:
     - name: nodepool-a100-primary
       replicas: 3
       storageClass: vwt-storage-policy
       #custom VM class for vGPU
       vmClass: class-vgpu-a100
       #TKR NAME for Ubuntu ova supporting GPU 
       tkr:
         reference:
           name: 1.20.8---vmware.1-tkg.1
     - name: nodepool-a100-secondary
       replicas: 3
       vmClass: class-vgpu-a100
       storageClass: vwt-storage-policy
       #TKR NAME for Ubuntu ova supporting GPU
       tkr:
         reference:
           name: 1.20.8---vmware.1-tkg.1
   settings:
     storage:
       defaultClass: vwt-storage-policy
     network:
       cni:
        name: antrea
       services:
        cidrBlocks: ["198.51.100.0/12"]
       pods:
        cidrBlocks: ["192.0.2.0/16"]
       serviceDomain: managedcluster.local

Example 2 specifies a separate volume on worker nodes for the containerd runtime with a capacity of 50 GiB. This setting is configurable. Providing a separate volume of good size is recommended for container-based AI/ML workloads.

apiVersion: run.tanzu.vmware.com/v1alpha2
kind: TanzuKubernetesCluster
metadata:
  name: tkc
  namespace: tkg-ns-auto
spec:
  distribution:
    fullVersion: v1.20.8+vmware.1-tkg.1
  topology:
    controlPlane:
      replicas: 3
      storageClass: vwt-storage-policy
      tkr:
        reference:
          name: v1.20.8---vmware.1-tkg.1
      vmClass: best-effort-medium
    nodePools:
    - name: workers
      replicas: 3
      storageClass: k8s-storage-policy
      tkr:
        reference:
          name: v1.20.8---vmware.1-tkg.1
      vmClass: vmclass-vgpu
      volumes:
      - capacity:
          storage: 50Gi
        mountPath: /var/lib/containerd
        name: containerd
      - capacity:
          storage: 50Gi
        mountPath: /var/lib/kubelet
        name: kubelet
    - name: nodepool-1
      replicas: 1
      storageClass: vwt-storage-policy
      vmClass: best-effort-medium

Example 3 includes cluster additional metadata such as a label.

apiVersion: run.tanzu.vmware.com/v1alpha2
kind: TanzuKubernetesCluster
metadata:
  annotations:
  labels:
    run.tanzu.vmware.com/tkr: v1.20.8---vmware.1-tkg.1
  name: tkgs-gpu-direct-rdma
  namespace: tkgs-ns
spec:
  settings:
    network:
      cni:
        name: antrea
      pods:
        cidrBlocks:
        - 192.168.0.0/16
      serviceDomain: cluster.local
      services:
        cidrBlocks:
        - 10.96.0.0/12
  topology:
    controlPlane:
      replicas: 3
      storageClass: tkgs-storage-policy
      vmClass: guaranteed-medium
      tkr:
        reference:
          name: v1.20.8---vmware.1-tkg.1
    nodePools:
    - name: workers
      replicas: 5
      storageClass: tkgs-storage-policy
      vmClass: claire-gpu-direct-rdma
      volumes:
      - capacity:
          storage: 50Gi
        mountPath: /var/lib/containerd
        name: containerd
      - capacity:
          storage: 50Gi
        mountPath: /var/lib/kubelet
        name: kubelet
      tkr:
        reference:
          name: v1.20.8---vmware.1-tkg.1

Operator Step 7: Provision the TKGS Cluster

Provision the cluster by running the following kubectl command.

kubectl apply -f CLUSTER-NAME.yaml

For example:

kubectl apply -f tkgs-gpu-cluster-1.yaml

Monitor the deployment of cluster nodes using kubectl.

kubectl get tanzukubernetesclusters -n NAMESPACE

Operator Step 8: Log In to the TKGS Cluster and Verify Provisioning

Using the vSphere Plugin for kubectl, log in to the TKGS cluster.

kubectl vsphere login --server=IP-ADDRESS --vsphere-username USERNAME \
--tanzu-kubernetes-cluster-name CLUSTER-NAME --tanzu-kubernetes-cluster-namespace NAMESPACE-NAME

Use the following commands to verify the cluster:

kubectl cluster-info

kubectl get nodes

kubectl get namespaces

kubectl api-resources

Operator Step 9: Prepare to Install the NVAIE GPU Operator

Prior to installing the GPU Operator with NVIDIA AI Enterprise, complete the following tasks for the TKGS cluster you provisioned. For additional guidance, see Prerequisite Tasks in the NVAIE documentation.

Note: If you are using the NVIDIA Delegated Licensing Server (DLS), refer to the following topic for instructions: Cluster Operator Addendum for Deploying AI/ML Workloads on TKGS Clusters (DLS)

Create the Kubernetes namespace gpu-operator-resources. As a best practice, always deploy everything in this namespace.
```
kubectl create ns gpu-operator-resources
```

Create role bindings.

Tanzu Kubernetes clusters have pod security policy enabled.

Create rolebidings.yaml.

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: psp:vmware-system-privileged:default
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: psp:vmware-system-privileged
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:nodes
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: system:serviceaccounts

Apply the role binding.

kubectl apply -f rolebindings.yaml

Create post-rolebindings.yaml.

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: psp:vmware-system-privileged:gpu-operator-resources
  namespace: gpu-operator-resources
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: psp:vmware-system-privileged
subjects:
- kind: Group
  apiGroup: rbac.authorization.k8s.io
  name: system:serviceaccounts

Apply the role binding:

kubectl apply -f post-rolebindings.yaml

Create an image secret with NGC credentials that can be used by Docker to pull container images from the NVIDIA GPU Cloud Catalog.

kubectl create secret docker-registry registry-secret \
     --docker-server=server-name --docker-username='$oauthtoken' \
     --docker-password=<place_holder> \
     --docker-email=email-name -n gpu-operator-resources

Create a configmap for the NVIDIA license server.

kubectl create configmap licensing-config -n gpu-operator-resources --from-file=gridd.conf

The gridd.conf reference the NVIDIA license server address, for example:

# Description: Set License Server Address
# Data type: string
# Format:  "<address>"
ServerAddress=<place_holder>

Operator Step 10: Install the NVIDIA GPU Operator in the Cluster

Install the NVAIE GPU Operator version 1.8.0 in the TKGS cluster. For additional guidance, refer to the GPU Operator documentation.

Note: If you are using the NVIDIA Delegated Licensing Server (DLS), refer to the following topic for instructions: Cluster Operator Addendum for Deploying AI/ML Workloads on TKGS Clusters (DLS)

Install Helm by referring to the Helm documentation.

Add the gpu-operator Helm repository.

helm repo add nvidia https://nvidia.github.io/gpu-operator

Install the NVAIE GPU Operator by running the following command.

Where necessary, substitute environment variable values with those that match your environment.

export PRIVATE_REGISTRY="private/registry/path"
export OS_TAG=ubuntu20.04
export VERSION=460.73.01
export VGPU_DRIVER_VERSION=460.73.01-grid
export NGC_API_KEY=ZmJjMHZya...LWExNTRi
export REGISTRY_SECRET_NAME=registry-secret

helm install nvidia/gpu-operator \
   --set driver.repository=$PRIVATE_REGISTRY \
   --set driver.version=$VERSION \
   --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \
   --set operator.defaultRuntime=containerd \
   --set driver.licensingConfig.configMapName=licensing-config

Operator Step 11: Deploy an AI/ML Workload

The NVIDIA GPU Cloud Catalog offers several off-the-shelf container images you can use to run AI/ML workloads on your vGPU-enabled Tanzu Kubernetes clusters. For more information on the images available, see the NGC documentation.