To enable developers to deploy AI/ML workloads on TKG Service clusters, as a Cluster Operator you create one or more Kubernetes clusters and install the NVIDIA Network and GPU Operators on each.
Operator Step 1: Verify Prerequisites
These instructions assume that the vSphere administrator has set up the environment for NVIDIA GPU. See vSphere Administrator Workflow for Deploying AI/ML Workloads on TKGS Clusters.
These instructions assume that you are installing the NVIDIA AI Enterprise (NVAIE) edition of the GPU Operator, which is pre-configured and optimized for use with vSphere IaaS control plane. The NVAIE GPU Operator differs from the GPU Operator that is available in the public NGC catalog. See NVIDIA AI Enterprise for more information.
These instructions assume that you are using a version of the NVAIE GPU Operator and vGPU driver that has a matching VIB for ESXi. See NVIDIA GPU Operator Versioning for more information.
When provisioning the TKG cluster, you must use the Ubuntu edition of the TKR. With TKG on vSphere 8 Supervisor, the Ubuntu edition is specified in the cluster YAML using annotation.
Operator Step 2: Provision a TKGS Cluster for NVIDIA vGPU
- Install the Kubernetes CLI Tools for vSphere.
- Using the vSphere Plugin for kubectl, authenticate with Supervisor.
kubectl vsphere login --server=IP-ADDRESS-or-FQDN --vsphere-username USERNAME
Note: FQDN can only be used if Supervisor is enabled with it. - Using kubectl, switch context to the vSphere Namespace that the vSphere administrator created for the TKGS vGPU cluster.
kubectl config get-contexts
kubectl config use-context TKG-GPU-CLUSTER-NAMESPACE
- Get the name of the custom VM Class with the vGPU profile that the vSphere Administrator created.
kubectl get virtualmachineclass
Note: The VM class must be bound to the target vSphere Namespace. - Get the TKR NAME for the Ubuntu Tanzu Kubernetes release that the vSphere Administrator synchronized from the content library and added to the vSphere Namespace.
kubectl get tkr
- Craft the YAML for Provisioning the vGPU-enabled TKG Cluster.
- Decide which TKGS cluster provisioning API you are going to use: v1alpha3 API or the v1beta1 API: TKG Cluster Provisioning APIs.
- Depending on which API you choose, refer to the Ubuntu example for that API.
Note: You must use an Ubuntu OS image. You cannot use Photon OS.
- Use the information you gleaned from the output of the preceding commands to customize the TKGS cluster specification.
- Provision the cluster by running the following kubectl command.
kubectl apply -f CLUSTER-NAME.yaml
For example:kubectl apply -f tkg-gpu-cluster-1.yaml
- Verify cluster provisioning.
Monitor the deployment of cluster nodes using kubectl.
kubectl get tanzukubernetesclusters -n NAMESPACE
- Log in to the TKGS vGPU cluster using the vSphere Plugin for kubectl.
kubectl vsphere login --server=IP-ADDRESS-or-FQDN --vsphere-username USERNAME \ --tanzu-kubernetes-cluster-name CLUSTER-NAME --tanzu-kubernetes-cluster-namespace NAMESPACE-NAME
- Verify the cluster.
Use the following commands to verify the cluster:
kubectl cluster-info
kubectl get nodes
kubectl get namespaces
kubectl api-resources
Operator Step 3: Install the NVIDIA Network Operator
- Verify that you are logged into the TKGS vGPU workload cluster and that the context is set to the TKGS vGPU workload cluster namespace.
Refer to the instructions Operator Step 2: Provision a TKGS Cluster for NVIDIA vGPU if necessary.
- Install Helm by referring to the Helm documentation.
- Fetch NVIDIA Network Operator Helm Chart.
helm fetch https://helm.ngc.nvidia.com/nvaie/charts/network-operator-v1.1.0.tgz --username='$oauthtoken' --password=<YOUR API KEY> --untar
- Create a YAML file for the configuration values.
vi values.yaml
- Populate the
values.yaml
file with the following information.deployCR: true ofedDriver: deploy: true rdmaSharedDevicePlugin: deploy: true resources: - name: rdma_shared_device_a vendors: [15b3] devices: [ens192]
- Install the NVIDIA Network Operator using the following command.
helm install network-operator -f ./values.yaml -n network-operator --create-namespace --wait network-operator/
Operator Step 4: Install the NVIDIA GPU Operator
NVIDIA provides a pre-configured GPU Operator for NVIDIA AI Enterprise customers. These instructions assume you are using this preconfigured version of the GPU Operator. These instructions are based on the instructions provided by NVIDIA for Installing the GPU Operator but they have been updated for TKG on vSphere 8.
- Verify that you are logged into the TKGS vGPU workload cluster and that the context is set to the TKGS vGPU workload cluster namespace.
Refer to the instructions Operator Step 2: Provision a TKGS Cluster for NVIDIA vGPU if necessary.
- Install Helm by referring to the Helm documentation, if it is not already installed.
- Create the
gpu-operator
Kubernetes namespace.kubectl create namespace gpu-operator
- Create an empty vGPU license configuration file.
sudo touch gridd.conf
- Generate and download an NLS client license token.
Refer to Section 4.6. Generating a Client Configuration Token of the NVIDIA License System User Guide.
- Rename the NLS client license token that you downloaded to
client_configuration_token.tok
. - Create the
licensing-config
ConfigMap object in thegpu-operator
namespace.Include the vGPU license configuration file (gridd.conf
) and the NLS client license token (*.tok
) in this ConfigMapkubectl create configmap licensing-config \ -n gpu-operator --from-file=gridd.conf --from-file=<path>/client_configuration_token.tok
- Create an image pull secret for the private registry that contains the containerized NVIDIA vGPU software graphics driver for Linux for use with NVIDIA GPU Operator.
Create the image pull secret in the
gpu-operator
namespace with the registry secret namengc-secret
and the private registry namenvcr.io/nvaie
. Include your NGC API key and email address in the indicated fields.kubectl create secret docker-registry ngc-secret \ --docker-server=‘nvcr.io/nvaie’ \ --docker-username=‘$oauthtoken’ \ --docker-password=<YOUR_NGC_API_KEY> \ --docker-email=<YOUR_EMAIL_ADDRESS> \ -n gpu-operator
- Download the helm chart for NVAIE GPU Operator version 2.2.
Replace YOUR API KEY.
helm fetchhttps://helm.ngc.nvidia.com/nvaie/charts/gpu-operator-2-2-v1.11.1.tgz--username=‘$oauthtoken’ \ --password=<YOUR API KEY>
- Install the NVAIE GPU Operator version 2.2 in the TKG cluster.
helm install gpu-operator ./gpu-operator-2-2-v1.11.1.tgz -n gpu-operator
Operator Step 5: Deploy an AI/ML Workload
The NVIDIA GPU Cloud Catalog offers several off-the-shelf container images you can use to run AI/ML workloads on your vGPU-enabled TKG clusters. For more information on the images available, see the NGC documentation.