Cluster Operator Addendum for Deploying AI/ML Workloads on TKGS Clusters (DLS)

Refer to this delta topic if you are using the NVIDIA Delegated Licensing Server (DLS) for your NVIDIA AI Enterprise account.

Cluster Operator Addendum for Deploying AI/ML Workloads on TKGS Clusters

NVIDIA provides a new NVIDIA Licensing Server (NLS) system called DLS which stands for Delegated Licensing Server. For more information, refer to the NVIDIA documentation.

If you are using DLS for your NVAIE account, the steps for preparing to and deploying the NVAIE GPU Operator are different that what is documented here: Cluster Operator Workflow for Deploying AI/ML Workloads on TKGS Clusters. Specifically, Steps 9 and 10 are modified as follows.

Operator Step 9: Prepare to Install the NVAIE GPU Operator

Complete the following steps to prepare to install the GPU Operator using a DLS.

Create a Secret.

kubectl create secret docker-registry registry-secret \
  --docker-server=<users private NGC registry name> 
  --docker-username='$oauthtoken' \
  --docker-password=ZmJj…………Ri \
  --docker-email=<user-email-address> -n gpu-operator-resources

Note: The password is the user API Key that was previously created on the NVIDIA GPU Cloud (NGC) Portal.

Get a Client Token from the DLS Server.
A user who wishes to use a vGPU license will need to get a token from that DLS license server called a “Client token. The mechanism for doing this is in the NVIDIA documentation.
Create a ConfigMap object in the TKGS cluster using the Client Token.
Place the Client Token file into a file at <path>/client_configuration_token.tok.
Then, run the following command:
```
kubectl delete configmap licensing-config -n gpu-operator-resources; > gridd.conf
kubectl create configmap licensing-config \
  -n gpu-operator-resources --from-file=./gridd.conf --from-file=./client_configuration_token.tok
```
Note: The grid.conf file used by the DLS is empty. However, both the "--from-file" parameters are required.

Operator Step 10: Install the NVAIE GPU Operator

Complete the following steps to install the NVAIE GPU Operator using a DLS. For additional guidance, refer to the GPU Operator documentation.

Install the NVAIE GPU Operator in the TKGS cluster.

Install Helm by referring to the Helm documentation.

Add the gpu-operator Helm repository.

helm repo add nvidia https://nvidia.github.io/gpu-operator

Install the GPU Operator using Helm.

export PRIVATE_REGISTRY="<user’s private registry name>"
export OS_TAG=ubuntu20.04
export VERSION=470.63.01
export VGPU_DRIVER_VERSION=470.63.01-grid
export NGC_API_KEY=Zm……………Ri  <- The user’s NGC AP Key
export REGISTRY_SECRET_NAME=registry-secret

helm show chart .
kubectl delete crd clusterpolicies.nvidia.com
helm install gpu-operator . -n gpu-operator-resources \
  --set psp.enabled=true \
  --set driver.licensingConfig.configMapName=licensing-config \
  --set operator.defaultRuntime=containerd \
  --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \
  --set driver.version=$VERSION \
  --set driver.repository=$PRIVATE_REGISTRY \
  --set driver.licensingConfig.nlsEnabled=true

Verify that DLS has worked.
From within a NVIDIA Driver DaemonSet pod that was deployed by the GPU Operator, execute the nvidia-smi command to verify that DLS is working.
First, run the following command to get into the pod and bring up a shell session:
```
kubectl exec -it nvidia-driver-daemonset-cvxx6 nvidia-driver-ctr -n gpu-operator-resources – bash
```
Now you can run the command to verify the DLS setup.
```
nvidia-smi
```
If DLS is setup correctly, this command should return "Licensed" in the output.