To complete the configuration of the vSphere with Tanzu environment, after the Supervisor is configured, deploy a Tanzu Kubernetes Grid (TKG) cluster on the Supervisor using the kubectl command line tool.

To perform these operations, you can also use the fully-automated self-service approach that is part of the VMware Private AI Foundation with NVIDIA solution add-on.

Create a Namespace for the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

To run applications that require upstream Kubernetes compliance, you can provision a Tanzu Kubernetes Grid cluster.

Tanzu Kubernetes clusters are fully upstream-compliant Kubernetes clusters that run on top of your Supervisor.

To help you to organize and manage your development projects, you can optionally divide the clusters into vSphere namespaces.

Procedure

  1. Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
  2. From the vSphere Client Menu, select Workload Management.

  3. On the Workload Management page, click the Namespaces tab and click New Namespace.

  4. In the Create Namespace dialog box, select the Supervisor, enter name for the namespace, and click Create.

  5. Click the Storage tab for the newly-created vSphere namespace.

  6. Under Storage Policies, click Edit.

  7. In the Select Storage Policies dialog box, select the storage policy that you created earlier and click OK.

Assign the New Tanzu Cluster Namespace Roles to Active Directory Groups for VMware Cloud Foundation

You assign roles for the Namespace to Active Directory groups, . You can later assign access to users by adding them to these groups. You assign access to separate Active Directory groups for the edit and view roles in the Namespace. External Identity Providers such as Okta are also supported.

Procedure

  1. Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
  2. From the vSphere Client Menu, select Workload Management.

  3. On the Workload management page, on the Namespaces tab, click the new Namespace.

  4. Click the Permissions tab.
  5. Provide edit permissions to your Active Directory group intended for admins for the Namespace.

    1. Click Add.

    2. In the Add Permissions dialog box, enter the Identity source and User/Group for edit access according to your values in the VMware Cloud Foundation Planning and Preparation Workbook, set the Role to Can edit, and click OK.

  6. Provide read-only permissions to your Active Directory group intended for viewers for the Namespace.

    1. Click Add.

    2. In the Add Permissions dialog box, enter the Identity source and User/Group for read-only access according to your values in the VMware Cloud Foundation Planning and Preparation Workbook, set the Role to Can view, and click OK.

Add GPU-Enabled VM Classes for the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

Before you deploy a GPU-enabled Tanzu Kubernetes Grid cluster that can run AI workloads, you must add one or more VM classes defining access to the GPUs. You then assign these VM classes to the worker nodes of the cluster.

This example uses a guaranteed-large configuration for the control plane nodes and vgpu-a100-16vcpu-128gb for the worker nodes.

Procedure

  1. Log in to the VI workload domain vCenter Server at https://<vi_workload_vcenter_server_fqdn>/ui as [email protected].
  2. From the vSphere Client Menu, select Workload Management.

  3. On the Workload Management page, on the Services tab, click the VM Service card.

  4. On the VM Service page, click the VM Classes tab.

  5. Click the Create VM Class .

  6. On the Name page of the Create VM Class wizard, enter a name for the VM class and click Next.

    For example: vgpu-a100-16vcpu-128gb.

  7. On the Compatibility page, select ESXi 8.0 U2 and later and click Next.
  8. Click the Configuration > Virtual Hardware tab.
  9. Add the GPU device to the VM class.
    1. Select Add New Device > PCI Device.
    2. Select the desired NVIDIA Grid vGPU device from the list according to the GPU model and GPU sharing mode.

      There are two types of NVIDIA Grid vGPU profiles: Time sharing and Multi-Instance GPU (MIG) sharing. The profile is detected by the system when you select the device.

      Note: You can add only one NVIDIA GRID vGPU device of type MIG profile to a VM class.
    3. Click Select.

      A New PCI device device appears on the Virtual Hardware tab.

  10. Configure the desired settings for CPU, Memory, New PCI Device, Video Card, and Security Devices.

    Table 1. CPU Configuration
    Setting Configuration
    CPU Assign at least 16 virtual CPU cores.
    CPU Topology Assigned at power on
    Reservation Reservation must be between 0 and 10 MHz
    Limit Limit must be greater than or equal to 10 MHz
    Shares Options are Low, Normal, High, Custom
    Hardware virtualization Select this option to expose hardware assisted virtualization to the guest OS
    Performance Counters Enable virtualized CPU performance counters
    Scheduling Affinity Select a physical processor affinity for this virtual machine. Use '-' for ranges and ',' to separate values. For example, "0, 2, 4-7" would indicate processors 0, 2, 4, 5, 6 and 7. Clear the string to remove affinity settings.
    I/O MMU Select to enable memory management unit (page to disk)
    Table 2. Memory Configuration
    Setting Configuration
    Memory Set it to at least 64 GB memory.
    Reservation Specify the guaranteed minimum allocation for a virtual machine, or reserve all guest memory. If the reservation cannot be met, the VM cannot run.
    Limit Select the amount of memory to limit to place a limit on the consumption of memory for a VM.
    Shares Select the amount of memory to share. Shares represent a relative metric for allocating memory capacity. For more information, see Memory Sharing.
    Memory Hot Plug Enable (check) to allow the addition of memory resources to a VM that is powered on. See Memory Hot Add Settings for details.
    Table 3. Configure Video Card
    Setting Configuration
    Video Card Choose to auto-detect settings from the hardware or enter custom settings. If you select auto-detect, other settings are not configurable.
    Number of displays Select the number of displays.
    Total video memory Enter the total video memory, in MB.
    3D Graphics Select to enable 3D support.
    Table 4. Configure Security Devices
    Settings Configuration
    Security Device If the SGX security device is installed, you can configure the VM settings here, otherwise this field is not configurable. See the SGX documentation for details.
  11. For the GPUDirect feature, click the Advanced Parameters tab and add the following attribute-value pairs.
    • pciPassthru.allowP2P=True

    • pciPassthru.RelaxACSforP2P=True

  12. Click Next and click Finish.
  13. Repeat the steps to create VM classes for the other vGPU profiles you plan to use for cluster worker nodes.
  14. Add the VM class to the namespace for the GPU-enabled Tanzu Kubernetes Grid clusters.
    1. On the Workload Management page, click the Namespaces tab and click the Summary tab.
    2. In the VM Service card, click the Manage VM Classes link.

    3. Select the vgpu-a100-16vcpu-128gb GPU-enabled class and the guaranteed-large VM classes.

    4. Select other VM classes required for your cluster control and worker nodes.
    5. Click OK.

Provision a Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

Provision a Tanzu Kubernetes Grid cluster by using kubectl and a YAML file for input. The command prompt procedure uses example values from the VMware Cloud Foundation Planning and Preparation Workbook.

For the PowerShell procedure, you must know the path where kubectl and kubectl-vsphere binaries are located. The path is required in the $kubectlBinLocation variable.

Command Prompt Procedure

  1. In a command prompt, log in to the Supervisor by using kubectl.

    kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Cluster_User
  2. Switch the kubectl context to the sfo-w01-tkc01 namespace.

    kubectl config use-context Tanzu_Kubernetes_Namespace
  3. Create a sfo-w01-tkc01.yaml text file with the following specifications.

    apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    metadata:
      name: sfo-w01-tkc01
      namespace: Tanzu_Kubernetes_Namespace
    spec:
      clusterNetwork:
      topology:
      services:
        cidrBlocks: ["198.51.100.0/12"]
      pods:
        cidrBlocks: ["192.0.2.0/16"]
      serviceDomain: "cluster.local"
      topology:
        class: tanzukubernetescluster
        version: v1.26.5---vmware.2-fips.1-tkg.1
        controlPlane:
          replicas: 3
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu
        workers:
          machineDeployments:
            - class: node-pool
              name: node-pool-gpu
              replicas: 2
              metadata:
                annotations:
                  run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu
              variables:
                overrides:
                - name: vmClass
                  value: vgpu-a100-16vcpu-128gb
        variables:
          - name: vmClass
            value: guaranteed-large
          - name: storageClass
            value: vsphere-with-tanzu-storage-policy
          - name: defaultStorageClass
            value: vsphere-with-tanzu-storage-policy
          - name: nodePoolVolumes
            value:
              - name: containerd
                capacity:
                  storage: 50Gi 
                mountPath: /var/lib/containerd
                storageClass: vsphere-with-tanzu-storage-policy
              - name: kubelet
                capacity:
                  storage: 25Gi 
                mountPath: /var/lib/kubelet
                storageClass: vsphere-with-tanzu-storage-policy
  4. Use kubectl to deploy the Tanzu Kubernetes Grid cluster from your YAML file input.
    kubectl apply -f ./sfo-w01-tkc01.yaml
  5. After the deployment of the Tanzu Kubernetes Grid cluster completes, run kubectl to verify the Tanzu Kubernetes Grid cluster status.
    kubectl get cluster
    NAME            PHASE         AGE   VERSION
    sfo-w01-tkc01   Provisioned   6m    v1.26.5+vmware.2-fips.1
  6. Log in to the new Tanzu Kubernetes Grid cluster and run kubectl to verify the status of the control plane and worker nodes.
    kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Admin --tanzu-kubernetes-cluster-namespace Tanzu_Kubernetes_Namespace --tanzu-kubernetes-cluster-name Tanzu_Kubernetes_Cluster_Name --insecure-skip-tls-verify
    
    kubectl get nodes
    
    NAME                                                STATUS   ROLES           AGE   VERSION
    sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn   Ready    <none>          13m   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv   Ready    <none>          13m   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-2ptdr                           Ready    control-plane   15m   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-2vnx6                           Ready    control-plane   11m   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-66hxn                           Ready    control-plane   13m   v1.26.5+vmware.2-fips.1

Install the NVIDIA GPU Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

Install the NVIDIA GPU Operator to automate the management of all NVIDIA software components needed to provision vGPU.

The command prompt steps use example values from the VMware Cloud Foundation Planning and Preparation Workbook.

For more information on the deployment and verification procedures, see the Appendix of Deploying Enterprise-Ready Generative AI on VMware Private AI.

Prerequisites

  • Determine the required NVIDIA GPU Operator version according to the GPU model, required features, operating system version, and driver version.

    See NVIDIA GPU Operator Component Matrix and the NVIDIA GPU Operator Release Notes.

  • Verify that you have the NVIDIA vGPU license file, downloaded from the NVIDIA Licensing Portal.
  • Verify that you have the API key to pull NVAIE containers from NVIDIA NGC enterprise catalog.
  • On the machine with the Kubernetes CLI Tools, install Helm.

Procedure

  1. In a command prompt on the machine with the Kubernetes CLI tools, log in to the Tanzu Kubernetes Grid cluster by running kubectl.
    kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Cluster_User --tanzu-kubernetes-cluster-namespace Tanzu_Kubernetes_Namespace --tanzu-kubernetes-cluster-name Tanzu_Kubernetes_Cluster_Name 
  2. Create a gpu-operator namespace.
    kubectl create namespace gpu-operator
  3. Verify that the namespace has been created.
    kubect get namespaces
    
    NAME                           STATUS   AGE
    default                        Active   64m
    gpu-operator                   Active   6s
    kube-node-lease                Active   64m
    kube-public                    Active   64m
    kube-system                    Active   64m
    secretgen-controller           Active   62m
    tkg-system                     Active   63m
    vmware-system-antrea           Active   62m
    vmware-system-auth             Active   62m
    vmware-system-cloud-provider   Active   63m
    vmware-system-csi              Active   63m
    vmware-system-tkg              Active   63m
  4. Create a gridd.conf configuration file.
    sudo touch gridd.conf
  5. Create a ConfigMap in the gpu-operator namespace.
    You can use ConfigMap to store non-confidential data in key-value pairs. You add both the vGPU configuration file and the NVIDIA license token to this ConfigMap.
    kubectl create configmap licensing-config -n gpu-operator --from-file=<path>/gridd.conf --from-file=<path>/client_configuration_token.tok
  6. Verify that the contents of the ConfigMap has been successfully populated by describing the ConfigMap.
    Name:         licensing-config
    Namespace:    gpu-operator
    Labels:       <none>
    Annotations:  <none>
    
    Data
    ====
    gridd.conf:
    ----
    
    client_configuration_token.tok:
    ----
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    
    BinaryData
    ====
    
    Events:  <none>
  7. Create a pull secret object in the gpu-operator namespace.
    A secret is an object that contains a small amount of sensitive data such as a password, a token, or a key. Such information might otherwise be put in a pod specification or in a container image. Using a secret object means that you do not need to include confidential data in your application code. We will use this secret object to pull the required images from NVIDIA NGC registry.
    export REGISTRY_SECRET_NAME=<your-ngc-secret>
    export PRIVATE_REGISTRY=nvcr.io/nvaie
    kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \
    --docker-server=${PRIVATE_REGISTRY} \
    --docker-username='$oauthtoken' \
    --docker-password=${NGC_API_KEY} \
    --docker-email='YOUREMAIL \
    -n gpu-operator
  8. Add the NVAIE Helm repository where the password is the NGC API key for accessing the NVIDIA NGC catalog.
    helm repo add nvaie https://helm.ngc.nvidia.com/nvaie \ --username='$oauthtoken' --password=${NGC_API_KEY} \ && helm repo update
  9. Set the required Pod Security admission policy on the gpu-operator namespace.
    kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
  10. Install NVIDIA GPU Operator by using Helm.
    helm install --wait gpu-operator nvaie/gpu-operator-4-2 -n gpu-operator --set driver.repository=nvcr.io/nvaie --set operator.repository=nvcr.io/nvaie --set driver.imagePullPolicy=Always --set migStrategy=mixed --set driver.rdma.enabled=True
  11. Verify that the GPU Operator pods are running.
    kubectl get pods -n gpu-operator
    
    NAME                                                          READY   STATUS      RESTARTS   AGE
    gpu-feature-discovery-9zv52                                   1/1     Running     0          7d6h
    gpu-feature-discovery-pv4p4                                   1/1     Running     0          7d6h
    gpu-feature-discovery-zms5s                                   1/1     Running     0          55d
    gpu-operator-dc844b566-w9mjl                                  1/1     Running     0          55d
    gpu-operator-node-feature-discovery-master-79bc547944-rzp4v   1/1     Running     0          55d
    gpu-operator-node-feature-discovery-worker-7m5ht              1/1     Running     0          7d6h
    gpu-operator-node-feature-discovery-worker-llz7k              1/1     Running     0          7d6h
    gpu-operator-node-feature-discovery-worker-zk7mt              1/1     Running     0          55d
    nvidia-container-toolkit-daemonset-pswbb                      1/1     Running     0          7d6h
    nvidia-container-toolkit-daemonset-tlqfn                      1/1     Running     0          7d6h
    nvidia-container-toolkit-daemonset-zm48q                      1/1     Running     0          55d
    nvidia-cuda-validator-fmwsh                                   0/1     Completed   0          55d
    nvidia-cuda-validator-qdz6r                                   0/1     Completed   0          7d6h
    nvidia-cuda-validator-x7mkj                                   0/1     Completed   0          7d6h
    nvidia-dcgm-exporter-c7dwd                                    1/1     Running     0          7d6h
    nvidia-dcgm-exporter-mc4x8                                    1/1     Running     0          55d
    nvidia-dcgm-exporter-xnpvp                                    1/1     Running     0          7d6h
    nvidia-device-plugin-daemonset-92pf4                          1/1     Running     0          7d6h
    nvidia-device-plugin-daemonset-m276d                          1/1     Running     0          55d
    nvidia-device-plugin-daemonset-v62nj                          1/1     Running     0          7d6h
    nvidia-device-plugin-validator-8d2jr                          0/1     Completed   0          7d6h
    nvidia-device-plugin-validator-cfkrl                          0/1     Completed   0          7d6h
    nvidia-device-plugin-validator-wltdz                          0/1     Completed   0          55d
    nvidia-driver-daemonset-7g6nj                                 1/1     Running     0          7d6h
    nvidia-driver-daemonset-8bwsx                                 1/1     Running     0          55d
    nvidia-driver-daemonset-fhz56                                 1/1     Running     0          7d6h
    nvidia-operator-validator-5zs4b                               1/1     Running     0          55d
    nvidia-operator-validator-hp5dt                               1/1     Running     0          7d6h
    nvidia-operator-validator-qrfj8                               1/1     Running     0          7d6h
  12. Verify that the license is valid.
    ctnname=`kubectl get pods -n gpu-operator | grep driver-daemonset | head -1 | cut -d " " -f1`
    
    kubectl -n gpu-operator exec -it $ctnname -- /bin/bash -c "/usr/bin/nvidia-smi -q | grep -i lic"
    
    Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
        vGPU Software Licensed Product
            License Status                    : Licensed (Expiry: 2024-2-28 21:22:44 GMT)
        Applications Clocks
        Default Applications Clocks
        Clock Policy

Install the NVIDIA Network Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

The NVIDIA Network Operator leverages Kubernetes custom resources and the Kubernetes Operator framework to optimize the networking for vGPU.

The command prompt steps use example values from the VMware Cloud Foundation Planning and Preparation Workbook.

For more information on the deployment and verification procedures, see the Appendix of Deploying Enterprise-Ready Generative AI on VMware Private AI.

Prerequisites

Procedure

  1. In a command prompt on the machine with the Kubernetes CLI tools, log in to the Tanzu Kubernetes Grid cluster by running kubectl.
    kubectl vsphere login --server 192.168.21.2 --vsphere-username Supervisor_Cluster_Admin --tanzu-kubernetes-cluster-namespace Tanzu_Kubernetes_Namespace --tanzu-kubernetes-cluster-name Tanzu_Kubernetes_Cluster_Name --insecure-skip-tls-verify
  2. List all the nodes in the cluster.
    kubectl get nodes
    NAME                                                STATUS   ROLES           AGE   VERSION
    sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn   Ready    <none>          18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv   Ready    <none>          18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-2ptdr                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-2vnx6                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-66hxn                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
  3. Label the worker node roles.
    Do not change the Control Plane label.
    kubectl label node sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv node-role.kubernetes.io/worker=worker
  4. Verify that the worker nodes are properly labelled.
    kubect get nodes
    NAME                                                STATUS   ROLES           AGE   VERSION
    sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-5d6zn   Ready    worker          18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-node-pool-gpu-9w5jr-768f85ccd-nxscv   Ready    worker          18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-2ptdr                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-2vnx6                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
    sfo-w01-tkc01-vvjgd-66hxn                           Ready    control-plane   18h   v1.26.5+vmware.2-fips.1
  5. Create a nvidia-network-operator namespace.
    kubectl create namespace nvidia-network-operator
  6. Verify if namespace has been created properly.
    kubect get namespaces
    NAME                           STATUS   AGE
    default                        Active   18h
    gpu-operator                   Active   17h
    kube-node-lease                Active   18h
    kube-public                    Active   18h
    kube-system                    Active   18h
    nvidia-network-operator        Active   5s
    secretgen-controller           Active   18h
    tkg-system                     Active   18h
    vmware-system-antrea           Active   18h
    vmware-system-auth             Active   18h
    vmware-system-cloud-provider   Active   18h
    vmware-system-csi              Active   18h
    vmware-system-tkg              Active   18h
  7. To be able to pull the network operator images during Helm installation, create a secret on the nvidia-network-operator namespace.
    kubectl create secret docker-registry ngc-image-secret -n nvidia-network-operator --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password='YOUR NVIDIA API KEY' --docker-email='YOUR NVIDIA NGC EMAIL'
  8. Create the values.yaml file that for the Network Operator deployment.
    nfd:
      enabled: true
    sriovNetworkOperator:
      enabled: false
    # NicClusterPolicy CR values:
    deployCR: true
    ofedDriver:
      deploy: true
    
    rdmaSharedDevicePlugin:
      deploy: true
      imagePullSecrets: <ngc-image-secret>
    
    sriovDevicePlugin:
      deploy: true
      imagePullSecrets: <ngc-image-secret>
      resources:
        - name: hostdev
          vendors: [15b3]
    secondaryNetwork:
      deploy: true
      multus:
        deploy: true
      cniPlugins:
        deploy: true
      ipamPlugin:
        deploy: true
  9. Add the NVIDIA NGC Helm repository.
    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia  \ --username='$oauthtoken' --password=${NGC_API_KEY} \ && helm repo update
  10. Install the NVIDIA Network Operator, passing values.yaml file as a parameter.
    helm install network-operator nvidia/network-operator -n nvidia-network-operator --create-namespace --version v23.5.0 -f values.yaml --debug
    Wait until the operation completes.
  11. Verify that the Network Operator pods are running.
    kubectl -n nvidia-network-operator get pods
    
    NAME                                                             READY   STATUS      RESTARTS   AGE
    cni-plugins-ds-jqd8f                                             1/1     Running     0          4d23h
    cni-plugins-ds-pqq4x                                             1/1     Running     0          4d23h
    kube-multus-ds-2h7sm                                             1/1     Running     0          4d23h
    kube-multus-ds-hwfdg                                             1/1     Running     0          4d23h
    mofed-ubuntu20.04-ds-27815                                       1/1     Running     0          4d23h
    mofed-ubuntu20.04-ds-4zth4                                       1/1     Running     0          4d23h
    network-operator-57cf95446-722tl                                 1/1     Running     0          4d23h
    network-operator-node-feature-discovery-master-848d8b8cdf-667wh  1/1     Running     0          4d23h
    network-operator-node-feature-discovery-master-worker-h5x74      1/1     Running     0          4d23h
    network-operator-node-feature-discovery-master-worker-j5stf      1/1     Running     0          4d23h
    rdma-shared-dp-ds-7g6s5                                          1/1     Running     0          4d23h
    rdma-shared-dp-ds-b6pgc                                          1/1     Running     0          4d23h
    rdma-shared-dp-ds-j2m84                                          0/1     Running     0          4d23h
    sriov-device-plugin-22cv9                                        0/1     Running     0          4d23h
    sriov-device-plugin-6ktpf                                        0/1     Running     0          4d23h
    whereabouts-c1951                                                0/1     Running     0          4d23h
    whereabouts-tkw8t                                                0/1     Running     0          4d23h
  12. Apply a host-device-net.yaml file.
    kubectl apply -f host-device-net.yaml
    host-device-net.yaml is the configuration file for Kubernetes networking deployment. The YAML file defines the creation of a hostdev custom resource that can be requested while creating a pod. Keep in mind that the Whereabouts IPAM configuration can be customized according to your needs.
  13. Verify that the custom resource was created successfully.
    kubectl get HostDeviceNetwork
    NAME                   STATUS   AGE
    hostdev-net            ready    2024-02-28T17:22:38Z
  14. Verify that the nvidia-peermem-ctr container was successfully loaded in the nvidia-peermem Kernel module.
    kubectl logs -n gpu-operator ds/nvidia-driver-daemonset -c nvidia-peermem-ctr
    Found 4 pods, using pod/nvidia-driver-daemonset-66rnx
    DRIVER_ARCH is x86_64
    waiting for mellanox ofed and nvidia drivers to be installed
    waiting for mellanox ofed and nvidia drivers to be installed
    waiting for mellanox ofed and nvidia drivers to be installed
    waiting for mellanox ofed and nvidia drivers to be installed
    waiting for mellanox ofed and nvidia drivers to be installed
    waiting for mellanox ofed and nvidia drivers to be installed
    waiting for mellanox ofed and nvidia drivers to be installed
    waiting for mellanox ofed and nvidia drivers to be installed
    successfully loaded nvidia-peermem module, now waiting for signal