Tanzu Kubernetes Grid supports deploying workload clusters to specific types of GPU-enabled hosts and edge devices on vSphere 7.0+.
To use a node with a GPU in a vSphere workload cluster, you must enable PCI passthrough mode. This allows the cluster to access the GPU directly, bypassing the ESXi hypervisor, which provides a level of performance that is similar to the performance of the GPU on a native system. When using PCI passthrough mode, each GPU device is dedicated to a virtual machine (VM) in the vSphere workload cluster.
Note: To add GPU enabled nodes to existing clusters, use the
tanzu cluster node-pool set command.
To create a workload cluster of GPU-enabled hosts, follow these steps to enable PCI passthrough, build a custom machine image, create a cluster configuration file and Tanzu Kubernetes release, deploy the workload cluster, and install a GPU operator using Helm.
Add the ESXi hosts with the GPU cards to your vSphere Client.
Enable PCI passthrough and record the GPU IDs as follows:
Create a custom machine image for your cluster that uses
Ubuntu 20.04 for the operating system,
EFI for the boot option, and
vmx-17 for the VM hardware version by following the procedure in Build a Linux Image.
Create a Tanzu Kubernetes release (TKr) for the image by following the steps in Create a TKr for the Linux Image.
Create a workload cluster configuration file using the template in Workload Cluster Template and include the following variables:
CLUSTER_PLAN: dev ... VSPHERE_WORKER_PCI_DEVICES: "0x<VENDOR-ID>:0x<DEVICE-ID>" VSPHERE_WORKER_CUSTOM_VMX_KEYS: 'pciPassthru.allowP2P=true,pciPassthru.RelaxACSforP2P=true,pciPassthru.use64bitMMIO=true,pciPassthru.64bitMMIOSizeGB=<GPU-SIZE>' VSPHERE_IGNORE_PCI_DEVICES_ALLOW_LIST: "<BOOLEAN>" WORKER_ROLLOUT_STRATEGY: "RollingUpdate"
<DEVICE-ID>is the Vendor ID and Device ID you recorded in a previous step. For example, if the Vendor ID is
10DEand the Device ID is
1EB8, the value is
"0x10DE:0x1EB8". Note: You can only use one type of GPU per VM. For example, you cannot use both the NVIDIA V100 and NVIDIA Tesla T4 on a single VM, but you can use multiple GPUs with the same Vendor ID and Device ID.
<GPU-SIZE>is the total GB of framebuffer memory of all GPUs in the cluster rounded to the next power-of-two. For example, if you have two 40GB GPUs, the total is 80GB, then rounded to the next power-of-two is 128GB, so the value is
falseif you are using the NVIDIA Tesla T4 GPU and
trueif you are using the NVIDIA V100 GPU.
RollingUpdateif you have extra PCI devices which can be used by the worker nodes during upgrades, otherwise use
tanzu CLI does not allow updating the
WORKER_ROLLOUT_STRATEGY spec on the
MachineDeployment. If the cluster upgrade is stuck due unavailable PCI devices, VMware suggests editing the
MachineDeployment strategy using the
kubectl CLI. The rollout strategy is defined at
For a complete list of variables you can configure for GPU-enabled clusters, see GPU-Enabled Clusters in Configuration File Variable Reference.
Create the workload cluster by running:
tanzu cluster create -f CLUSTER-CONFIG-NAME --tkr TKR-NAME
TKR-NAME are the names of the cluster configuration file and TKr file you created in the previous steps.
Add the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
Install the NVIDIA GPU Operator:
helm install --kubeconfig=./KUBECONFIG --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
KUBECONFIG is the name and location of the
kubeconfig for your workload cluster. For more information, see Retrieve Workload Cluster
For information about the parameters in this command, see Install the GPU Operator in the NVIDIA documentation.
Ensure the NVIDIA GPU Operator is running:
kubectl --kubeconfig=./KUBECONFIG get pods -A
The output is similar to:
NAMESPACE NAME READY STATUS RESTARTS AGE gpu-operator gpu-feature-discovery-4p6rs 0/1 Init:0/1 0 10s gpu-operator gpu-operator-1656477030-node-feature-discovery-master-56457lp8r 1/1 Running 0 50s gpu-operator gpu-operator-1656477030-node-feature-discovery-worker-9g2cm 1/1 Running 0 50s gpu-operator gpu-operator-1656477030-node-feature-discovery-worker-l296w 1/1 Running 0 50s gpu-operator gpu-operator-6688b48999-zssxv 1/1 Running 0 50s gpu-operator nvidia-container-toolkit-daemonset-r6nzz 0/1 Init:0/1 0 10s gpu-operator nvidia-dcgm-exporter-m2vt8 0/1 Init:0/1 0 10s gpu-operator nvidia-device-plugin-daemonset-tp6qx 0/1 Init:0/1 0 10s
To test your GPU-enabled cluster, create a pod manifest for the
cuda-vector-add example from the Kubernetes documentation and deploy it. The container will download, run, and perform a CUDA calculation with the GPU.
Create a file named
cuda-vector-add.yaml and add the following:
apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile image: "registry.k8s.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU
Apply the file:
kubectl apply -f cuda-vector-add.yaml
kubectl get po cuda-vector-add
The output is similar to:
cuda-vector-add 0/1 Completed 0 91s
kubectl logs cuda-vector-add
The output is similar to:
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
Tanzu Kubernetes Grid v1.6+ supports deploying workload clusters to VMware SD-WAN Edge devices.
Topology: You can run edge workload clusters in production with a single control plane node and just one or two hosts. However, while this uses less CPU, memory, and network bandwidth, you do not have the same resiliency and recovery characteristics of standard production Tanzu Kubernetes Grid clusters. For more information, see VMware Tanzu Edge Solution Reference Architecture 1.0.
Local Registry: To minimize communication delays and maximize resilience, each edge cluster should have its own local Harbor container registry. For an overview of this architecture, see Container Registry in Architecture Overview. To install a local Harbor registry on the edge device, see the Knowledge Base article Building a Harbor Appliance (OVA) to Bootstrap Thick Edge Clusters and Airgap Environments with Tanzu Kubernetes Grid 1.6 (89416).
Timeouts: In addition, when an edge workload cluster has its management cluster remote in a main datacenter, you may need to adjust certain timeouts to allow the management cluster enough time to connect with the workload cluster machines. To adjust these timeouts, see Extending Timeouts for Edge Clusters to Handle Higher Latency below.
If your management cluster is remotely managing workload clusters running on edge devices or managing more than 20 workload clusters, you can adjust specific timeouts so the Cluster API does not block or prune machines that may be temporarily offline or taking longer than 12 minutes to communicate with their remote management cluster, particularly if your infrastructure is underprovisioned.
There are three settings you can adjust to give your edge devices additional time to communicate with their control plane:
MHC_FALSE_STATUS_TIMEOUT: Extend the default
12m to, for example,
40m in the workload cluster configuration file to prevent the
MachineHealthCheck controller from recreating the machine if its
Ready condition remains
False for more than 12 minutes. For more information about machine health checks, see Configure Machine Health Checks for Tanzu Kubernetes Clusters.
NODE_STARTUP_TIMEOUT: Extend the default
20m to, for example,
60m in the workload cluster configuration file to prevent the
MachineHealthCheck controller from blocking new machines from joining the cluster because they took longer than 20 minutes to start up, which it considers unhealthy.
etcd-dial-timeout-duration: Extend the default
10m to, for example,
40s in the
capi-kubeadm-control-plane-controller-manager manifest to prevent
etcd clients on the management cluster from prematurely failing while scanning the health of
etcd on the workload clusters. The management cluster uses its ability to connect with
etcd as a yardstick for machine health. For example:
In a terminal, run:
kubectl edit capi-kubeadm-control-plane-controller-manager -n capi-system
Change the value for
- args: - --leader-elect - --metrics-bind-addr=localhost:8080 - --feature-gates=ClusterTopology=false - --etcd-dial-timeout-duration=40s command: - /manager image: projects.registry.vmware.com/tkg/cluster-api/kubeadm-control-plane-controller:v1.0.1_vmware.1