Deploy and Configure a Tanzu Kubernetes Grid Cluster for vSphere with Tanzu for Private AI Ready Infrastructure for VMware Cloud Foundation

To complete the configuration of the vSphere with Tanzu environment, after the Supervisor is configured, deploy a Tanzu Kubernetes Grid (TKG) cluster on the Supervisor using the kubectl command line tool.

To perform these operations, you can also use the fully-automated self-service approach that is part of the VMware Private AI Foundation with NVIDIA solution add-on.

Create a Namespace for the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

To run applications that require upstream Kubernetes compliance, you can provision a Tanzu Kubernetes Grid cluster.

Tanzu Kubernetes clusters are fully upstream-compliant Kubernetes clusters that run on top of your Supervisor.

To help you to organize and manage your development projects, you can optionally divide the clusters into vSphere namespaces.

Procedure

Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
From the vSphere Client Menu, select Workload Management.
On the Workload Management page, click the Namespaces tab and click New Namespace.
In the Create Namespace dialog box, select the Supervisor, enter name for the namespace, and click Create.
Click the Storage tab for the newly-created vSphere namespace.
Under Storage Policies, click Edit.
In the Select Storage Policies dialog box, select the storage policy that you created earlier and click OK.

Assign the New Tanzu Cluster Namespace Roles to Active Directory Groups for VMware Cloud Foundation

You assign roles for the Namespace to Active Directory groups, . You can later assign access to users by adding them to these groups. You assign access to separate Active Directory groups for the edit and view roles in the Namespace. External Identity Providers such as Okta are also supported.

Procedure

Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
From the vSphere Client Menu, select Workload Management.
On the Workload management page, on the Namespaces tab, click the new Namespace.
Click the Permissions tab.
Provide edit permissions to your Active Directory group intended for admins for the Namespace.
1. Click Add.
2. In the Add Permissions dialog box, enter the Identity source and User/Group for edit access according to your values in the VMware Cloud Foundation Planning and Preparation Workbook, set the Role to Can edit, and click OK.
Provide read-only permissions to your Active Directory group intended for viewers for the Namespace.
1. Click Add.
2. In the Add Permissions dialog box, enter the Identity source and User/Group for read-only access according to your values in the VMware Cloud Foundation Planning and Preparation Workbook, set the Role to Can view, and click OK.

Add GPU-Enabled VM Classes for the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

Before you deploy a GPU-enabled Tanzu Kubernetes Grid cluster that can run AI workloads, you must add one or more VM classes defining access to the GPUs. You then assign these VM classes to the worker nodes of the cluster.

This example uses a guaranteed-large configuration for the control plane nodes and vgpu-a100-16vcpu-128gb for the worker nodes.

Procedure

Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
From the vSphere Client Menu, select Workload Management.
On the Workload Management page, on the Services tab, click the VM Service card.
On the VM Service page, click the VM Classes tab.
Click the Create VM Class .
On the Name page of the Create VM Class wizard, enter a name for the VM class and click Next.
For example: vgpu-a100-16vcpu-128gb.
On the Compatibility page, select ESXi 8.0 U2 and later and click Next.
Click the Configuration > Virtual Hardware tab.
Add the GPU device to the VM class.
1. Select Add New Device > PCI Device.
2. Select the desired NVIDIA Grid vGPU device from the list according to the GPU model and GPU sharing mode.
  There are two types of NVIDIA Grid vGPU profiles: Time sharing and Multi-Instance GPU (MIG) sharing. The profile is detected by the system when you select the device.
  
  Note: You can add only one NVIDIA GRID vGPU device of type MIG profile to a VM class.
3. Click Select.
  A New PCI device device appears on the Virtual Hardware tab.

Configure the desired settings for CPU, Memory, New PCI Device, Video Card, and Security Devices.

Table 1. CPU Configuration
Setting	Configuration
CPU	Assign at least 16 virtual CPU cores.
CPU Topology	Assigned at power on
Reservation	Reservation must be between 0 and 10 MHz
Limit	Limit must be greater than or equal to 10 MHz
Shares	Options are Low, Normal, High, Custom
Hardware virtualization	Select this option to expose hardware assisted virtualization to the guest OS
Performance Counters	Enable virtualized CPU performance counters
Scheduling Affinity	Select a physical processor affinity for this virtual machine. Use '-' for ranges and ',' to separate values. For example, "0, 2, 4-7" would indicate processors 0, 2, 4, 5, 6 and 7. Clear the string to remove affinity settings.
I/O MMU	Select to enable memory management unit (page to disk)

Table 2. Memory Configuration
Setting	Configuration
Memory	Set it to at least 64 GB memory.
Reservation	Specify the guaranteed minimum allocation for a virtual machine, or reserve all guest memory. If the reservation cannot be met, the VM cannot run.
Limit	Select the amount of memory to limit to place a limit on the consumption of memory for a VM.
Shares	Select the amount of memory to share. Shares represent a relative metric for allocating memory capacity. For more information, see Memory Sharing.
Memory Hot Plug	Enable (check) to allow the addition of memory resources to a VM that is powered on. See Memory Hot Add Settings for details.

Table 3. Configure Video Card
Setting	Configuration
Video Card	Choose to auto-detect settings from the hardware or enter custom settings. If you select auto-detect, other settings are not configurable.
Number of displays	Select the number of displays.
Total video memory	Enter the total video memory, in MB.
3D Graphics	Select to enable 3D support.

Table 4. Configure Security Devices
Settings	Configuration
Security Device	If the SGX security device is installed, you can configure the VM settings here, otherwise this field is not configurable. See the SGX documentation for details.

For the GPUDirect feature, click the Advanced Parameters tab and add the following attribute-value pairs.
- pciPassthru.allowP2P=True
- pciPassthru.RelaxACSforP2P=True
Click Next and click Finish.
Repeat the steps to create VM classes for the other vGPU profiles you plan to use for cluster worker nodes.
Add the VM class to the namespace for the GPU-enabled Tanzu Kubernetes Grid clusters.
1. On the Workload Management page, click the Namespaces tab and click the Summary tab.
2. In the VM Service card, click the Manage VM Classes link.
3. Select the vgpu-a100-16vcpu-128gb GPU-enabled class and the guaranteed-large VM classes.
4. Select other VM classes required for your cluster control and worker nodes.
5. Click OK.

Provision a Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

Provision a Tanzu Kubernetes Grid cluster by using kubectl and a YAML file for input. The command prompt procedure uses example values from the VMware Cloud Foundation Planning and Preparation Workbook.

For the PowerShell procedure, you must know the path where kubectl and kubectl-vsphere binaries are located. The path is required in the $kubectlBinLocation variable.

Command Prompt Procedure

In a command prompt, log in to the Supervisor by using kubectl.

kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Cluster_User

Switch the kubectl context to the sfo-w01-tkc01 namespace.

kubectl config use-context Tanzu_Kubernetes_Namespace

Create a sfo-w01-tkc01.yaml text file with the following specifications.

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: sfo-w01-tkc01
  namespace: Tanzu_Kubernetes_Namespace
spec:
  clusterNetwork:
  topology:
  services:
    cidrBlocks: ["198.51.100.0/12"]
  pods:
    cidrBlocks: ["192.0.2.0/16"]
  serviceDomain: "cluster.local"
  topology:
    class: tanzukubernetescluster
    version: v1.26.5---vmware.2-fips.1-tkg.1
    controlPlane:
      replicas: 3
      metadata:
        annotations:
          run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu
    workers:
      machineDeployments:
        - class: node-pool
          name: node-pool-gpu
          replicas: 2
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu
          variables:
            overrides:
            - name: vmClass
              value: vgpu-a100-16vcpu-128gb
    variables:
      - name: vmClass
        value: guaranteed-large
      - name: storageClass
        value: vsphere-with-tanzu-storage-policy
      - name: defaultStorageClass
        value: vsphere-with-tanzu-storage-policy
      - name: nodePoolVolumes
        value:
          - name: containerd
            capacity:
              storage: 50Gi 
            mountPath: /var/lib/containerd
            storageClass: vsphere-with-tanzu-storage-policy
          - name: kubelet
            capacity:
              storage: 25Gi 
            mountPath: /var/lib/kubelet
            storageClass: vsphere-with-tanzu-storage-policy

Use kubectl to deploy the Tanzu Kubernetes Grid cluster from your YAML file input.
```
kubectl apply -f ./sfo-w01-tkc01.yaml
```
After the deployment of the Tanzu Kubernetes Grid cluster completes, run kubectl to verify the Tanzu Kubernetes Grid cluster status.
```
kubectl get cluster
NAME            PHASE         AGE   VERSION
sfo-w01-tkc01   Provisioned   6m    v1.26.5+vmware.2-fips.1
```

Log in to the new Tanzu Kubernetes Grid cluster and run kubectl to verify the status of the control plane and worker nodes.

kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Admin --tanzu-kubernetes-cluster-namespace Tanzu_Kubernetes_Namespace --tanzu-kubernetes-cluster-name Tanzu_Kubernetes_Cluster_Name --insecure-skip-tls-verify

kubectl get nodes

NAME                                                STATUS   ROLES           AGE   VERSION
sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn   Ready    <none>          13m   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv   Ready    <none>          13m   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-2ptdr                           Ready    control-plane   15m   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-2vnx6                           Ready    control-plane   11m   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-66hxn                           Ready    control-plane   13m   v1.26.5+vmware.2-fips.1

Install the NVIDIA GPU Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

Install the NVIDIA GPU Operator to automate the management of all NVIDIA software components needed to provision vGPU.

The command prompt steps use example values from the VMware Cloud Foundation Planning and Preparation Workbook.

For more information on the deployment and verification procedures, see the Appendix of Deploying Enterprise-Ready Generative AI on VMware Private AI.

Prerequisites

Determine the required NVIDIA GPU Operator version according to the GPU model, required features, operating system version, and driver version.
See NVIDIA GPU Operator Component Matrix and the NVIDIA GPU Operator Release Notes.
Verify that you have the NVIDIA vGPU license file, downloaded from the NVIDIA Licensing Portal.
Verify that you have the API key to pull NVAIE containers from NVIDIA NGC enterprise catalog.
On the machine with the Kubernetes CLI Tools, install Helm.

Procedure

In a command prompt on the machine with the Kubernetes CLI tools, log in to the Tanzu Kubernetes Grid cluster by running kubectl.

kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Cluster_User --tanzu-kubernetes-cluster-namespace Tanzu_Kubernetes_Namespace --tanzu-kubernetes-cluster-name Tanzu_Kubernetes_Cluster_Name

Create a gpu-operator namespace.
```
kubectl create namespace gpu-operator
```

Verify that the namespace has been created.

kubect get namespaces

NAME                           STATUS   AGE
default                        Active   64m
gpu-operator                   Active   6s
kube-node-lease                Active   64m
kube-public                    Active   64m
kube-system                    Active   64m
secretgen-controller           Active   62m
tkg-system                     Active   63m
vmware-system-antrea           Active   62m
vmware-system-auth             Active   62m
vmware-system-cloud-provider   Active   63m
vmware-system-csi              Active   63m
vmware-system-tkg              Active   63m

Create a gridd.conf configuration file.
```
sudo touch gridd.conf
```
Create a ConfigMap in the gpu-operator namespace.
You can use ConfigMap to store non-confidential data in key-value pairs. You add both the vGPU configuration file and the NVIDIA license token to this ConfigMap.
```
kubectl create configmap licensing-config -n gpu-operator --from-file=<path>/gridd.conf --from-file=<path>/client_configuration_token.tok
```

Verify that the contents of the ConfigMap has been successfully populated by describing the ConfigMap.

Name:         licensing-config
Namespace:    gpu-operator
Labels:       <none>
Annotations:  <none>

Data
====
gridd.conf:
----

client_configuration_token.tok:
----
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

BinaryData
====

Events:  <none>

Create a pull secret object in the gpu-operator namespace.
A secret is an object that contains a small amount of sensitive data such as a password, a token, or a key. Such information might otherwise be put in a pod specification or in a container image. Using a secret object means that you do not need to include confidential data in your application code. We will use this secret object to pull the required images from NVIDIA NGC registry.
```
export REGISTRY_SECRET_NAME=<your-ngc-secret>
export PRIVATE_REGISTRY=nvcr.io/nvaie
kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \
--docker-server=${PRIVATE_REGISTRY} \
--docker-username='$oauthtoken' \
--docker-password=${NGC_API_KEY} \
--docker-email='YOUREMAIL \
-n gpu-operator
```

Add the NVAIE Helm repository where the password is the NGC API key for accessing the NVIDIA NGC catalog.

helm repo add nvaie https://helm.ngc.nvidia.com/nvaie \ --username='$oauthtoken' --password=${NGC_API_KEY} \ && helm repo update

Set the required Pod Security admission policy on the gpu-operator namespace.

kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

Install NVIDIA GPU Operator by using Helm.

helm install --wait gpu-operator nvaie/gpu-operator-4-2 -n gpu-operator --set driver.repository=nvcr.io/nvaie --set operator.repository=nvcr.io/nvaie --set driver.imagePullPolicy=Always --set migStrategy=mixed --set driver.rdma.enabled=True

Verify that the GPU Operator pods are running.

kubectl get pods -n gpu-operator

NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-9zv52                                   1/1     Running     0          7d6h
gpu-feature-discovery-pv4p4                                   1/1     Running     0          7d6h
gpu-feature-discovery-zms5s                                   1/1     Running     0          55d
gpu-operator-dc844b566-w9mjl                                  1/1     Running     0          55d
gpu-operator-node-feature-discovery-master-79bc547944-rzp4v   1/1     Running     0          55d
gpu-operator-node-feature-discovery-worker-7m5ht              1/1     Running     0          7d6h
gpu-operator-node-feature-discovery-worker-llz7k              1/1     Running     0          7d6h
gpu-operator-node-feature-discovery-worker-zk7mt              1/1     Running     0          55d
nvidia-container-toolkit-daemonset-pswbb                      1/1     Running     0          7d6h
nvidia-container-toolkit-daemonset-tlqfn                      1/1     Running     0          7d6h
nvidia-container-toolkit-daemonset-zm48q                      1/1     Running     0          55d
nvidia-cuda-validator-fmwsh                                   0/1     Completed   0          55d
nvidia-cuda-validator-qdz6r                                   0/1     Completed   0          7d6h
nvidia-cuda-validator-x7mkj                                   0/1     Completed   0          7d6h
nvidia-dcgm-exporter-c7dwd                                    1/1     Running     0          7d6h
nvidia-dcgm-exporter-mc4x8                                    1/1     Running     0          55d
nvidia-dcgm-exporter-xnpvp                                    1/1     Running     0          7d6h
nvidia-device-plugin-daemonset-92pf4                          1/1     Running     0          7d6h
nvidia-device-plugin-daemonset-m276d                          1/1     Running     0          55d
nvidia-device-plugin-daemonset-v62nj                          1/1     Running     0          7d6h
nvidia-device-plugin-validator-8d2jr                          0/1     Completed   0          7d6h
nvidia-device-plugin-validator-cfkrl                          0/1     Completed   0          7d6h
nvidia-device-plugin-validator-wltdz                          0/1     Completed   0          55d
nvidia-driver-daemonset-7g6nj                                 1/1     Running     0          7d6h
nvidia-driver-daemonset-8bwsx                                 1/1     Running     0          55d
nvidia-driver-daemonset-fhz56                                 1/1     Running     0          7d6h
nvidia-operator-validator-5zs4b                               1/1     Running     0          55d
nvidia-operator-validator-hp5dt                               1/1     Running     0          7d6h
nvidia-operator-validator-qrfj8                               1/1     Running     0          7d6h

Verify that the license is valid.

ctnname=`kubectl get pods -n gpu-operator | grep driver-daemonset | head -1 | cut -d " " -f1`

kubectl -n gpu-operator exec -it $ctnname -- /bin/bash -c "/usr/bin/nvidia-smi -q | grep -i lic"

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
    vGPU Software Licensed Product
        License Status                    : Licensed (Expiry: 2024-2-28 21:22:44 GMT)
    Applications Clocks
    Default Applications Clocks
    Clock Policy

Install the NVIDIA Network Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

The NVIDIA Network Operator leverages Kubernetes custom resources and the Kubernetes Operator framework to optimize the networking for vGPU.

The command prompt steps use example values from the VMware Cloud Foundation Planning and Preparation Workbook.

For more information on the deployment and verification procedures, see the Appendix of Deploying Enterprise-Ready Generative AI on VMware Private AI.

Prerequisites

Determine the required NVIDIA Network Operator version.
See the NVIDIA Network Operator documentation.
Provide an RDMA NIC on each host in the GPU-enabled VI workload domain.

Procedure

In a command prompt on the machine with the Kubernetes CLI tools, log in to the Tanzu Kubernetes Grid cluster by running kubectl.

kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Cluster_Admin --tanzu-kubernetes-cluster-namespace Tanzu_Kubernetes_Namespace --tanzu-kubernetes-cluster-name Tanzu_Kubernetes_Cluster_Name --insecure-skip-tls-verify

List all the nodes in the cluster.

kubectl get nodes

NAME                                                STATUS   ROLES           AGE   VERSION
sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn   Ready    <none>          18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv   Ready    <none>          18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-2ptdr                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-2vnx6                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-66hxn                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1

Label the worker node roles.

Do not change the Control Plane label.

kubectl label node sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv node-role.kubernetes.io/worker=worker

Verify that the worker nodes are properly labelled.

kubect get nodes

NAME                                                STATUS   ROLES           AGE   VERSION
sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn   Ready    worker          18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv   Ready    worker          18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-2ptdr                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-2vnx6                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
sfo-w01-tkc01-vvjgd-66hxn                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1

Create a nvidia-network-operator namespace.

kubectl create namespace nvidia-network-operator

Verify if namespace has been created properly.

kubect get namespaces

NAME                           STATUS   AGE
default                        Active   18h
gpu-operator                   Active   17h
kube-node-lease                Active   18h
kube-public                    Active   18h
kube-system                    Active   18h
nvidia-network-operator        Active   5s
secretgen-controller           Active   18h
tkg-system                     Active   18h
vmware-system-antrea           Active   18h
vmware-system-auth             Active   18h
vmware-system-cloud-provider   Active   18h
vmware-system-csi              Active   18h
vmware-system-tkg              Active   18h

To be able to pull the network operator images during Helm installation, create a secret on the nvidia-network-operator namespace.

kubectl create secret docker-registry ngc-image-secret -n nvidia-network-operator --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password='YOUR NVIDIA API KEY' --docker-email='YOUR NVIDIA NGC EMAIL'

Create the values.yaml file that for the Network Operator deployment.

nfd:
  enabled: true
sriovNetworkOperator:
  enabled: false
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: true

rdmaSharedDevicePlugin:
  deploy: true
  imagePullSecrets: <ngc-image-secret>

sriovDevicePlugin:
  deploy: true
  imagePullSecrets: <ngc-image-secret>
  resources:
    - name: hostdev
      vendors: [15b3]
secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: true

Add the NVIDIA NGC Helm repository.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia  \ --username='$oauthtoken' --password=${NGC_API_KEY} \ && helm repo update

Install the NVIDIA Network Operator, passing values.yaml file as a parameter.

helm install network-operator nvidia/network-operator -n nvidia-network-operator --create-namespace --version v23.5.0 -f values.yaml --debug

Wait until the operation completes.

Verify that the Network Operator pods are running.

kubectl -n nvidia-network-operator get pods

NAME                                                             READY   STATUS      RESTARTS   AGE
cni-plugins-ds-jqd8f                                             1/1     Running     0          4d23h
cni-plugins-ds-pqq4x                                             1/1     Running     0          4d23h
kube-multus-ds-2h7sm                                             1/1     Running     0          4d23h
kube-multus-ds-hwfdg                                             1/1     Running     0          4d23h
mofed-ubuntu20.04-ds-27815                                       1/1     Running     0          4d23h
mofed-ubuntu20.04-ds-4zth4                                       1/1     Running     0          4d23h
network-operator-57cf95446-722tl                                 1/1     Running     0          4d23h
network-operator-node-feature-discovery-master-848d8b8cdf-667wh  1/1     Running     0          4d23h
network-operator-node-feature-discovery-master-worker-h5x74      1/1     Running     0          4d23h
network-operator-node-feature-discovery-master-worker-j5stf      1/1     Running     0          4d23h
rdma-shared-dp-ds-7g6s5                                          1/1     Running     0          4d23h
rdma-shared-dp-ds-b6pgc                                          1/1     Running     0          4d23h
rdma-shared-dp-ds-j2m84                                          0/1     Running     0          4d23h
sriov-device-plugin-22cv9                                        0/1     Running     0          4d23h
sriov-device-plugin-6ktpf                                        0/1     Running     0          4d23h
whereabouts-c1951                                                0/1     Running     0          4d23h
whereabouts-tkw8t                                                0/1     Running     0          4d23h

Apply a host-device-net.yaml file.
```
kubectl apply -f host-device-net.yaml
```
host-device-net.yaml is the configuration file for Kubernetes networking deployment. The YAML file defines the creation of a hostdev custom resource that can be requested while creating a pod. Keep in mind that the Whereabouts IPAM configuration can be customized according to your needs.

Verify that the custom resource was created successfully.

kubectl get HostDeviceNetwork

NAME                   STATUS   AGE
hostdev-net            ready    2024-02-28T17:22:38Z

Verify that the nvidia-peermem-ctr container was successfully loaded in the nvidia-peermem Kernel module.

kubectl logs -n gpu-operator ds/nvidia-driver-daemonset -c nvidia-peermem-ctr

Found 4 pods, using pod/nvidia-driver-daemonset-66rnx
DRIVER_ARCH is x86_64
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
successfully loaded nvidia-peermem module, now waiting for signal