To enable developers to deploy AI/ML workloads on TKGS clusters, as a Cluster Operator you configure the Kubernetes environment to support NVIDIA vGPU operations.
Cluster Operator Workflow for Deploying AI/ML Workloads on TKGS Clusters
Step | Action | Link |
---|---|---|
0 | Review system requirements. |
See Operator Step 0: Review System Requirements. |
1 | Download kubectl and vSphere Plugin for Kubectl to local workstation. | See Operator Step 1: Install the Kubernetes CLI Tools for vSphere on Your Workstation. |
2 | Use kubectl to log in to the Supervisor Cluster, which populates .kube/config with the context for the new Supervisor Cluster. | See Operator Step 2: Log in to the Supervisor Cluster. |
3 | Use kubectl to switch context to the vSphere Namespace. | See Operator Step 3: Switch Context to the vSphere Namespace. |
4 | Use kubectl to list VM classes and verify that the NVIDIA vGPU-enabled class is present. | See Operator Step 4: Get the Custom VM Class for vGPU Workloads. |
5 | Use kubectl to list the available Tanzu Kubernetes releases and verify that the Ubuntu image is present. | See Operator Step 5: Get the Ubuntu Tanzu Kubernetes Release for GPU Nodes. |
6 | Craft the YAML specification for provisioning the GPU-enabled TKGS cluster; specify the TKR version and the VM class. | See Operator Step 6: Craft the YAML for Provisioning the vGPU-enabled TKGS Cluster. |
7 | Provision the TKGS cluster. | See Operator Step 7: Provision the TKGS Cluster. |
8 | Log in to the cluster and verify provisioning. | See Operator Step 8: Log In to the TKGS Cluster and Verify Provisioning. |
9 | Prepare to install the NVAIE GPU Operator by creating some prerequisite objects in the TKGS cluster, including a namespace, role bindings, image secret, and license configmap. | See Operator Step 9: Prepare to Install the NVAIE GPU Operator. |
10 | Install the NVAIE GPU Operator in the cluster. | See Operator Step 10: Install the NVIDIA GPU Operator in the Cluster. |
11 | Deploy AI/ML workloads to the vGPU-enabled TKGS cluster. | See Operator Step 11: Deploy an AI/ML Workload. |
Operator Step 0: Review System Requirements
Requirement | Description |
---|---|
vSphere Administrator has set up the environment for NVIDIA vGPU |
See vSphere Administrator Workflow for Deploying AI/ML Workloads on TKGS Clusters (vGPU) |
TKR Ubuntu OVA | Tanzu Kubernetes release Ubuntu
|
TKG Cluster Provisioner | Tanzu Kubernetes Grid Service API version: |
NVIDIA GPU Operator |
GPU Operator v1.8.0 |
NVIDIA GPU Driver Container |
|
Operator Step 1: Install the Kubernetes CLI Tools for vSphere on Your Workstation
Download and install the Kubernetes CLI Tools for vSphere.
If you are using Linux you can run the following command to download the tools.
curl -LOk https://${SC_IP}/wcp/plugin/linux-amd64/vsphere-plugin.zip unzip vsphere-plugin.zip mv -v bin/* /usr/local/bin/
For additional guidance, see Download and Install the Kubernetes CLI Tools for vSphere.
Operator Step 2: Log in to the Supervisor Cluster
kubectl vsphere login --server=IP-ADDRESS --vsphere-username USERNAME
Operator Step 3: Switch Context to the vSphere Namespace
kubectl config get-contexts
kubectl config use-context TKGS-GPU-CLUSTER-NAMESPACE
Operator Step 4: Get the Custom VM Class for vGPU Workloads
kubectl get virtualmachineclassbindings
Operator Step 5: Get the Ubuntu Tanzu Kubernetes Release for GPU Nodes
kubectl get tanzukubernetesreleases
kubectl get tkr
Operator Step 6: Craft the YAML for Provisioning the vGPU-enabled TKGS Cluster
Construct the YAML file for provisioning a Tanzu Kubernetes cluster.
Start with one of the of the examples below. Use the information you gleaned from the output of the preceding commands to customize the cluster specification. Refer to the full list of configuration parameters: TKGS v1alpha2 API for Provisioning Tanzu Kubernetes Clusters.
apiVersion: run.tanzu.vmware.com/v1alpha2 kind: TanzuKubernetesCluster metadata: #cluster name name: tkgs-cluster-gpu-a100 #target vsphere namespace namespace: tkgs-gpu-operator spec: topology: controlPlane: replicas: 3 #storage class for control plane nodes #use `kubectl describe storageclasses` #to get available pvcs storageClass: vwt-storage-policy vmClass: guaranteed-medium #TKR NAME for Ubuntu ova supporting GPU tkr: reference: name: 1.20.8---vmware.1-tkg.1 nodePools: - name: nodepool-a100-primary replicas: 3 storageClass: vwt-storage-policy #custom VM class for vGPU vmClass: class-vgpu-a100 #TKR NAME for Ubuntu ova supporting GPU tkr: reference: name: 1.20.8---vmware.1-tkg.1 - name: nodepool-a100-secondary replicas: 3 vmClass: class-vgpu-a100 storageClass: vwt-storage-policy #TKR NAME for Ubuntu ova supporting GPU tkr: reference: name: 1.20.8---vmware.1-tkg.1 settings: storage: defaultClass: vwt-storage-policy network: cni: name: antrea services: cidrBlocks: ["198.51.100.0/12"] pods: cidrBlocks: ["192.0.2.0/16"] serviceDomain: managedcluster.local
apiVersion: run.tanzu.vmware.com/v1alpha2 kind: TanzuKubernetesCluster metadata: name: tkc namespace: tkg-ns-auto spec: distribution: fullVersion: v1.20.8+vmware.1-tkg.1 topology: controlPlane: replicas: 3 storageClass: vwt-storage-policy tkr: reference: name: v1.20.8---vmware.1-tkg.1 vmClass: best-effort-medium nodePools: - name: workers replicas: 3 storageClass: k8s-storage-policy tkr: reference: name: v1.20.8---vmware.1-tkg.1 vmClass: vmclass-vgpu volumes: - capacity: storage: 50Gi mountPath: /var/lib/containerd name: containerd - capacity: storage: 50Gi mountPath: /var/lib/kubelet name: kubelet - name: nodepool-1 replicas: 1 storageClass: vwt-storage-policy vmClass: best-effort-medium
apiVersion: run.tanzu.vmware.com/v1alpha2 kind: TanzuKubernetesCluster metadata: annotations: labels: run.tanzu.vmware.com/tkr: v1.20.8---vmware.1-tkg.1 name: tkgs-gpu-direct-rdma namespace: tkgs-ns spec: settings: network: cni: name: antrea pods: cidrBlocks: - 192.168.0.0/16 serviceDomain: cluster.local services: cidrBlocks: - 10.96.0.0/12 topology: controlPlane: replicas: 3 storageClass: tkgs-storage-policy vmClass: guaranteed-medium tkr: reference: name: v1.20.8---vmware.1-tkg.1 nodePools: - name: workers replicas: 5 storageClass: tkgs-storage-policy vmClass: claire-gpu-direct-rdma volumes: - capacity: storage: 50Gi mountPath: /var/lib/containerd name: containerd - capacity: storage: 50Gi mountPath: /var/lib/kubelet name: kubelet tkr: reference: name: v1.20.8---vmware.1-tkg.1
Operator Step 7: Provision the TKGS Cluster
kubectl apply -f CLUSTER-NAME.yamlFor example:
kubectl apply -f tkgs-gpu-cluster-1.yaml
kubectl get tanzukubernetesclusters -n NAMESPACE
Operator Step 8: Log In to the TKGS Cluster and Verify Provisioning
kubectl vsphere login --server=IP-ADDRESS --vsphere-username USERNAME \ --tanzu-kubernetes-cluster-name CLUSTER-NAME --tanzu-kubernetes-cluster-namespace NAMESPACE-NAME
kubectl cluster-info
kubectl get nodes
kubectl get namespaces
kubectl api-resources
Operator Step 9: Prepare to Install the NVAIE GPU Operator
- Create the Kubernetes namespace
gpu-operator-resources
. As a best practice, always deploy everything in this namespace.kubectl create ns gpu-operator-resources
- Create role bindings.
Tanzu Kubernetes clusters have pod security policy enabled.
Create rolebidings.yaml.apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: psp:vmware-system-privileged:default namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: psp:vmware-system-privileged subjects: - apiGroup: rbac.authorization.k8s.io kind: Group name: system:nodes - apiGroup: rbac.authorization.k8s.io kind: Group name: system:serviceaccounts
Apply the role binding.kubectl apply -f rolebindings.yaml
Create post-rolebindings.yaml.apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: psp:vmware-system-privileged:gpu-operator-resources namespace: gpu-operator-resources roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: psp:vmware-system-privileged subjects: - kind: Group apiGroup: rbac.authorization.k8s.io name: system:serviceaccounts
Apply the role binding:kubectl apply -f post-rolebindings.yaml
- Create an image secret with NGC credentials that can be used by Docker to pull container images from the NVIDIA GPU Cloud Catalog.
kubectl create secret docker-registry registry-secret \ --docker-server=server-name --docker-username='$oauthtoken' \ --docker-password=<place_holder> \ --docker-email=email-name -n gpu-operator-resources
- Create a configmap for the NVIDIA license server.
kubectl create configmap licensing-config -n gpu-operator-resources --from-file=gridd.conf
The
gridd.conf
reference the NVIDIA license server address, for example:# Description: Set License Server Address # Data type: string # Format: "<address>" ServerAddress=<place_holder>
Operator Step 10: Install the NVIDIA GPU Operator in the Cluster
- Install Helm by referring to the Helm documentation.
- Add the
gpu-operator
Helm repository.helm repo add nvidia https://nvidia.github.io/gpu-operator
- Install the NVAIE GPU Operator by running the following command.
Where necessary, substitute environment variable values with those that match your environment.
export PRIVATE_REGISTRY="private/registry/path" export OS_TAG=ubuntu20.04 export VERSION=460.73.01 export VGPU_DRIVER_VERSION=460.73.01-grid export NGC_API_KEY=ZmJjMHZya...LWExNTRi export REGISTRY_SECRET_NAME=registry-secret helm install nvidia/gpu-operator \ --set driver.repository=$PRIVATE_REGISTRY \ --set driver.version=$VERSION \ --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \ --set operator.defaultRuntime=containerd \ --set driver.licensingConfig.configMapName=licensing-config
Operator Step 11: Deploy an AI/ML Workload
The NVIDIA GPU Cloud Catalog offers several off-the-shelf container images you can use to run AI/ML workloads on your vGPU-enabled Tanzu Kubernetes clusters. For more information on the images available, see the NGC documentation.