To complete the configuration of the vSphere with Tanzu environment, after the Supervisor is configured, deploy a Tanzu Kubernetes Grid (TKG) cluster on the Supervisor using the kubectl command line tool.
To perform these operations, you can also use the fully-automated self-service approach that is part of the VMware Private AI Foundation with NVIDIA solution add-on.
Create a Namespace for the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation
To run applications that require upstream Kubernetes compliance, you can provision a Tanzu Kubernetes Grid cluster.
Tanzu Kubernetes clusters are fully upstream-compliant Kubernetes clusters that run on top of your Supervisor.
To help you to organize and manage your development projects, you can optionally divide the clusters into vSphere namespaces.
Procedure
- Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
From the vSphere Client Menu, select Workload Management.
On the Workload Management page, click the Namespaces tab and click New Namespace.
In the Create Namespace dialog box, select the Supervisor, enter name for the namespace, and click Create.
Click the Storage tab for the newly-created vSphere namespace.
Under Storage Policies, click Edit.
In the Select Storage Policies dialog box, select the storage policy that you created earlier and click OK.
Assign the New Tanzu Cluster Namespace Roles to Active Directory Groups for VMware Cloud Foundation
You assign roles for the Namespace to Active Directory groups, . You can later assign access to users by adding them to these groups. You assign access to separate Active Directory groups for the edit and view roles in the Namespace. External Identity Providers such as Okta are also supported.
Procedure
- Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
From the vSphere Client Menu, select Workload Management.
On the Workload management page, on the Namespaces tab, click the new Namespace.
- Click the Permissions tab.
Provide edit permissions to your Active Directory group intended for admins for the Namespace.
Click Add.
In the Add Permissions dialog box, enter the Identity source and User/Group for edit access according to your values in the VMware Cloud Foundation Planning and Preparation Workbook, set the Role to Can edit, and click OK.
Provide read-only permissions to your Active Directory group intended for viewers for the Namespace.
Click Add.
In the Add Permissions dialog box, enter the Identity source and User/Group for read-only access according to your values in the VMware Cloud Foundation Planning and Preparation Workbook, set the Role to Can view, and click OK.
Add GPU-Enabled VM Classes for the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation
Before you deploy a GPU-enabled Tanzu Kubernetes Grid cluster that can run AI workloads, you must add one or more VM classes defining access to the GPUs. You then assign these VM classes to the worker nodes of the cluster.
This example uses a guaranteed-large
configuration for the control plane nodes and vgpu-a100-16vcpu-128gb
for the worker nodes.
Procedure
- Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
From the vSphere Client Menu, select Workload Management.
On the Workload Management page, on the Services tab, click the VM Service card.
On the VM Service page, click the VM Classes tab.
Click the Create VM Class .
- On the Name page of the Create VM Class wizard, enter a name for the VM class and click Next.
For example:
vgpu-a100-16vcpu-128gb
. - On the Compatibility page, select ESXi 8.0 U2 and later and click Next.
- Click the tab.
- Add the GPU device to the VM class.
- Select .
- Select the desired NVIDIA Grid vGPU device from the list according to the GPU model and GPU sharing mode.
There are two types of NVIDIA Grid vGPU profiles: Time sharing and Multi-Instance GPU (MIG) sharing. The profile is detected by the system when you select the device.
Note: You can add only one NVIDIA GRID vGPU device of type MIG profile to a VM class. - Click Select.
A New PCI device device appears on the Virtual Hardware tab.
Configure the desired settings for CPU, Memory, New PCI Device, Video Card, and Security Devices.
Table 1. CPU Configuration Setting Configuration CPU Assign at least 16 virtual CPU cores. CPU Topology Assigned at power on Reservation Reservation must be between 0 and 10 MHz Limit Limit must be greater than or equal to 10 MHz Shares Options are Low, Normal, High, Custom Hardware virtualization Select this option to expose hardware assisted virtualization to the guest OS Performance Counters Enable virtualized CPU performance counters Scheduling Affinity Select a physical processor affinity for this virtual machine. Use '-' for ranges and ',' to separate values. For example, "0, 2, 4-7" would indicate processors 0, 2, 4, 5, 6 and 7. Clear the string to remove affinity settings. I/O MMU Select to enable memory management unit (page to disk) Table 2. Memory Configuration Setting Configuration Memory Set it to at least 64 GB memory. Reservation Specify the guaranteed minimum allocation for a virtual machine, or reserve all guest memory. If the reservation cannot be met, the VM cannot run. Limit Select the amount of memory to limit to place a limit on the consumption of memory for a VM. Shares Select the amount of memory to share. Shares represent a relative metric for allocating memory capacity. For more information, see Memory Sharing. Memory Hot Plug Enable (check) to allow the addition of memory resources to a VM that is powered on. See Memory Hot Add Settings for details. Table 3. Configure Video Card Setting Configuration Video Card Choose to auto-detect settings from the hardware or enter custom settings. If you select auto-detect, other settings are not configurable. Number of displays Select the number of displays. Total video memory Enter the total video memory, in MB. 3D Graphics Select to enable 3D support. Table 4. Configure Security Devices Settings Configuration Security Device If the SGX security device is installed, you can configure the VM settings here, otherwise this field is not configurable. See the SGX documentation for details. - For the GPUDirect feature, click the Advanced Parameters tab and add the following attribute-value pairs.
pciPassthru.allowP2P=True
pciPassthru.RelaxACSforP2P=True
- Click Next and click Finish.
- Repeat the steps to create VM classes for the other vGPU profiles you plan to use for cluster worker nodes.
- Add the VM class to the namespace for the GPU-enabled Tanzu Kubernetes Grid clusters.
- On the Workload Management page, click the Namespaces tab and click the Summary tab.
In the VM Service card, click the Manage VM Classes link.
Select the vgpu-a100-16vcpu-128gb GPU-enabled class and the
guaranteed-large
VM classes.- Select other VM classes required for your cluster control and worker nodes.
- Click OK.
Provision a Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation
Provision a Tanzu Kubernetes Grid cluster by using kubectl
and a YAML file for input. The command prompt procedure uses example values from the VMware Cloud Foundation Planning and Preparation Workbook.
For the PowerShell procedure, you must know the path where kubectl and kubectl-vsphere binaries are located. The path is required in the $kubectlBinLocation
variable.
Command Prompt Procedure
In a command prompt, log in to the Supervisor by using
kubectl
.kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Cluster_User
Switch the
kubectl
context to thesfo-w01-tkc01
namespace.kubectl config use-context Tanzu_Kubernetes_Namespace
Create a sfo-w01-tkc01.yaml text file with the following specifications.
apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: name: sfo-w01-tkc01 namespace: Tanzu_Kubernetes_Namespace spec: clusterNetwork: topology: services: cidrBlocks: ["198.51.100.0/12"] pods: cidrBlocks: ["192.0.2.0/16"] serviceDomain: "cluster.local" topology: class: tanzukubernetescluster version: v1.26.5---vmware.2-fips.1-tkg.1 controlPlane: replicas: 3 metadata: annotations: run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu workers: machineDeployments: - class: node-pool name: node-pool-gpu replicas: 2 metadata: annotations: run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu variables: overrides: - name: vmClass value: vgpu-a100-16vcpu-128gb variables: - name: vmClass value: guaranteed-large - name: storageClass value: vsphere-with-tanzu-storage-policy - name: defaultStorageClass value: vsphere-with-tanzu-storage-policy - name: nodePoolVolumes value: - name: containerd capacity: storage: 50Gi mountPath: /var/lib/containerd storageClass: vsphere-with-tanzu-storage-policy - name: kubelet capacity: storage: 25Gi mountPath: /var/lib/kubelet storageClass: vsphere-with-tanzu-storage-policy
- Use
kubectl
to deploy the Tanzu Kubernetes Grid cluster from your YAML file input.kubectl apply -f ./sfo-w01-tkc01.yaml
- After the deployment of the Tanzu Kubernetes Grid cluster completes, run
kubectl
to verify the Tanzu Kubernetes Grid cluster status.kubectl get cluster NAME PHASE AGE VERSION sfo-w01-tkc01 Provisioned 6m v1.26.5+vmware.2-fips.1
- Log in to the new Tanzu Kubernetes Grid cluster and run
kubectl
to verify the status of the control plane and worker nodes.kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Admin --tanzu-kubernetes-cluster-namespace Tanzu_Kubernetes_Namespace --tanzu-kubernetes-cluster-name Tanzu_Kubernetes_Cluster_Name --insecure-skip-tls-verify
kubectl get nodes
NAME STATUS ROLES AGE VERSION sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn Ready <none> 13m v1.26.5+vmware.2-fips.1 sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv Ready <none> 13m v1.26.5+vmware.2-fips.1 sfo-w01-tkc01-vvjgd-2ptdr Ready control-plane 15m v1.26.5+vmware.2-fips.1 sfo-w01-tkc01-vvjgd-2vnx6 Ready control-plane 11m v1.26.5+vmware.2-fips.1 sfo-w01-tkc01-vvjgd-66hxn Ready control-plane 13m v1.26.5+vmware.2-fips.1
Install the NVIDIA GPU Operator for Private AI Ready Infrastructure for VMware Cloud Foundation
Install the NVIDIA GPU Operator to automate the management of all NVIDIA software components needed to provision vGPU.
The command prompt steps use example values from the VMware Cloud Foundation Planning and Preparation Workbook.
For more information on the deployment and verification procedures, see the Appendix of Deploying Enterprise-Ready Generative AI on VMware Private AI.
Prerequisites
- Determine the required NVIDIA GPU Operator version according to the GPU model, required features, operating system version, and driver version.
See NVIDIA GPU Operator Component Matrix and the NVIDIA GPU Operator Release Notes.
- Verify that you have the NVIDIA vGPU license file, downloaded from the NVIDIA Licensing Portal.
- Verify that you have the API key to pull NVAIE containers from NVIDIA NGC enterprise catalog.
- On the machine with the Kubernetes CLI Tools, install Helm.
Procedure
Install the NVIDIA Network Operator for Private AI Ready Infrastructure for VMware Cloud Foundation
The NVIDIA Network Operator leverages Kubernetes custom resources and the Kubernetes Operator framework to optimize the networking for vGPU.
The command prompt steps use example values from the VMware Cloud Foundation Planning and Preparation Workbook.
For more information on the deployment and verification procedures, see the Appendix of Deploying Enterprise-Ready Generative AI on VMware Private AI.
Prerequisites
- Determine the required NVIDIA Network Operator version.
Provide an RDMA NIC on each host in the GPU-enabled VI workload domain.