Cluster Operator Workflow for Deploying AI/ML Workloads on TKG Service Clusters

To enable developers to deploy AI/ML workloads on TKG Service clusters, as a Cluster Operator you create one or more Kubernetes clusters and install the NVIDIA Network and GPU Operators on each.

Operator Step 1: Verify Prerequisites

These instructions assume that the vSphere administrator has set up the environment for NVIDIA GPU. See vSphere Administrator Workflow for Deploying AI/ML Workloads on TKGS Clusters.

These instructions assume that you are installing the NVIDIA AI Enterprise (NVAIE) edition of the GPU Operator, which is pre-configured and optimized for use with vSphere IaaS control plane. The NVAIE GPU Operator differs from the GPU Operator that is available in the public NGC catalog. See NVIDIA AI Enterprise for more information.

These instructions assume that you are using a version of the NVAIE GPU Operator and vGPU driver that has a matching VIB for ESXi. See NVIDIA GPU Operator Versioning for more information.

When provisioning the TKG cluster, you must use the Ubuntu edition of the TKR. With TKG on vSphere 8 Supervisor, the Ubuntu edition is specified in the cluster YAML using annotation.

Operator Step 2: Provision a TKGS Cluster for NVIDIA vGPU

VMware offers native TKGS support for NVIDIA virtual GPUs on NVIDIA GPU Certified Servers with NVIDIA GPU Operator and NVIDIA Network Operator. You install these operators in a TKGS workload cluster. To provision a TKGS cluster for hosting vGPU workloads, complete the following steps.

Install the Kubernetes CLI Tools for vSphere.
See Install the Kubernetes CLI Tools for vSphere.
Using the vSphere Plugin for kubectl, authenticate with Supervisor.
```
kubectl vsphere login --server=IP-ADDRESS-or-FQDN --vsphere-username USERNAME
```
Note: FQDN can only be used if Supervisor is enabled with it.
Using kubectl, switch context to the vSphere Namespace that the vSphere administrator created for the TKGS vGPU cluster.
```
kubectl config get-contexts
```
```
kubectl config use-context TKG-GPU-CLUSTER-NAMESPACE
```
Get the name of the custom VM Class with the vGPU profile that the vSphere Administrator created.
```
kubectl get virtualmachineclass
```
Note: The VM class must be bound to the target vSphere Namespace.
Get the TKR NAME for the Ubuntu Tanzu Kubernetes release that the vSphere Administrator synchronized from the content library and added to the vSphere Namespace.
```
kubectl get tkr
```
Craft the YAML for Provisioning the vGPU-enabled TKG Cluster.
1. Decide which TKGS cluster provisioning API you are going to use: v1alpha3 API or the v1beta1 API: TKG Cluster Provisioning APIs.
2. Depending on which API you choose, refer to the Ubuntu example for that API.
  - v1alpha3 Example: TKC with Ubuntu TKR
  - v1beta1 Example: Cluster with Ubuntu TKR
  Note: You must use an Ubuntu OS image. You cannot use Photon OS.
3. Use the information you gleaned from the output of the preceding commands to customize the TKGS cluster specification.

Provision the cluster by running the following kubectl command.

kubectl apply -f CLUSTER-NAME.yaml

For example:

kubectl apply -f tkg-gpu-cluster-1.yaml

Verify cluster provisioning.
Monitor the deployment of cluster nodes using kubectl.
```
kubectl get tanzukubernetesclusters -n NAMESPACE
```

kubectl vsphere login --server=IP-ADDRESS-or-FQDN --vsphere-username USERNAME \
--tanzu-kubernetes-cluster-name CLUSTER-NAME --tanzu-kubernetes-cluster-namespace NAMESPACE-NAME

Verify the cluster.

Use the following commands to verify the cluster:

kubectl cluster-info

kubectl get nodes

kubectl get namespaces

kubectl api-resources

Operator Step 3: Install the NVIDIA Network Operator

The NVIDIA Network Operator leverages Kubernetes custom resources and the Operator framework to optimize the networking for vGPU. For more information, see NVIDIA Network Operator.

Verify that you are logged into the TKGS vGPU workload cluster and that the context is set to the TKGS vGPU workload cluster namespace.
Refer to the instructions Operator Step 2: Provision a TKGS Cluster for NVIDIA vGPU if necessary.
Install Helm by referring to the Helm documentation.

Fetch NVIDIA Network Operator Helm Chart.

helm fetch https://helm.ngc.nvidia.com/nvaie/charts/network-operator-v1.1.0.tgz --username='$oauthtoken' --password=<YOUR API KEY> --untar

Create a YAML file for the configuration values.
```
vi values.yaml
```

Populate the values.yaml file with the following information.

deployCR: true
 ofedDriver:
  deploy: true
 rdmaSharedDevicePlugin:
  deploy: true
  resources:
   - name: rdma_shared_device_a
   vendors: [15b3]
   devices: [ens192]

Install the NVIDIA Network Operator using the following command.

helm install network-operator -f ./values.yaml -n network-operator --create-namespace --wait network-operator/

Operator Step 4: Install the NVIDIA GPU Operator

NVIDIA provides a pre-configured GPU Operator for NVIDIA AI Enterprise customers. These instructions assume you are using this preconfigured version of the GPU Operator. These instructions are based on the instructions provided by NVIDIA for Installing the GPU Operator but they have been updated for TKG on vSphere 8.

Complete the following steps to install the GPU Operator NVIDIA AI Enterprise on the TKG cluster you provisioned.

Verify that you are logged into the TKGS vGPU workload cluster and that the context is set to the TKGS vGPU workload cluster namespace.
Refer to the instructions Operator Step 2: Provision a TKGS Cluster for NVIDIA vGPU if necessary.
Install Helm by referring to the Helm documentation, if it is not already installed.
Create the gpu-operator Kubernetes namespace.
```
kubectl create namespace gpu-operator
```
Create an empty vGPU license configuration file.
```
sudo touch gridd.conf
```
Generate and download an NLS client license token.
Refer to Section 4.6. Generating a Client Configuration Token of the NVIDIA License System User Guide.
Rename the NLS client license token that you downloaded to client_configuration_token.tok.
Create the licensing-config ConfigMap object in the gpu-operator namespace.
Include the vGPU license configuration file ( gridd.conf) and the NLS client license token ( *.tok) in this ConfigMap
```
kubectl create configmap licensing-config \
    -n gpu-operator --from-file=gridd.conf --from-file=<path>/client_configuration_token.tok
```
Create an image pull secret for the private registry that contains the containerized NVIDIA vGPU software graphics driver for Linux for use with NVIDIA GPU Operator.
Create the image pull secret in the gpu-operator namespace with the registry secret name ngc-secret and the private registry name nvcr.io/nvaie. Include your NGC API key and email address in the indicated fields.
```
kubectl create secret docker-registry ngc-secret \
--docker-server=‘nvcr.io/nvaie’ \
--docker-username=‘$oauthtoken’ \
--docker-password=<YOUR_NGC_API_KEY> \
--docker-email=<YOUR_EMAIL_ADDRESS> \
-n gpu-operator
```

Download the helm chart for NVAIE GPU Operator version 2.2.

Replace YOUR API KEY.

helm fetchhttps://helm.ngc.nvidia.com/nvaie/charts/gpu-operator-2-2-v1.11.1.tgz--username=‘$oauthtoken’ \
--password=<YOUR API KEY>

Install the NVAIE GPU Operator version 2.2 in the TKG cluster.

helm install gpu-operator ./gpu-operator-2-2-v1.11.1.tgz -n gpu-operator

Operator Step 5: Deploy an AI/ML Workload

The NVIDIA GPU Cloud Catalog offers several off-the-shelf container images you can use to run AI/ML workloads on your vGPU-enabled TKG clusters. For more information on the images available, see the NGC documentation.