After you complete the implementation of the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution and VMware Private AI Foundation with NVIDIA, you perform common operations on the environment, such as examining the operational state of the NVAIE components added to the environment during the implementation.

Verify the operational state of the NVIDIA AI Enterprise (NVAIE) Kubernetes Operators and host components by checking their state and health status.

Validate that the NVAIE components are properly functioning and ready for GPU-enabled workloads running on top of VMware Cloud Foundation.

Prerequisites

Install the vSphere kubectl plug-in to connect to the Supervisor as a vCenter Single Sign-On user. See Download and Install the Kubernetes CLI Tools for vSphere.

Verify the Status of the ESXi Host Components for Private AI Ready Infrastructure for VMware Cloud Foundation

Verify the operational state of the ESXi host by checking its state and health status.

Expected Outcomes

  • The ESXi host has access to the NVIDIA System Management Interface (nvidia-smi).

Procedure

  1. Enable SSH on the ESXi host.
    For instructions, see Enable Access to the ESXi Shell.
  2. Log to the ESXi server as root over SSH.
  3. Run the nvidia-smi command.

Verify the Status of the GPU Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

Verify the operational state of the NVIDIA GPU Operator by checking its state and health status.

Expected Outcomes

  • All GPU Operator pods have a Running status.
  • All GPU Operator pods have a Ready status.
  • The GPU Operator License returns a Licensed status.

Procedure

  1. Log in to the Supervisor as a vCenter Server Single Sign-On user by running the command.
    kubectl vsphere login --server Supervisor_cluster_IP_address --vsphere-username Supervisor cluster administrator --insecure-skip-tls-verify
  2. Verify that the GPU Operator pods have a Running status by running the command.
    kubectl get pods -n gpu-operator
  3. Verify that the GPU Operator license has a Licensed status by running the command.
    ctnname=`kubectl get pods -n gpu-operator | grep driver-daemonset | head -1 | cut -d " " -f1`
    
    kubectl -n gpu-operator exec -it $ctnname -- /bin/bash -c "/usr/bin/nvidia-smi -q | grep -i lic"
    Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
        vGPU Software Licensed Product
            License Status                    : Licensed (Expiry: 2024-2-28 21:22:44 GMT)
        Applications Clocks
        Default Applications Clocks
        Clock Policy

What to do next

Troubleshooting Tips

Verify the Status of the Network Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

Verify the operational state of the NVIDIA Network Operator by checking its state and health status.

Expected Outcomes

  • All Network Operator pods have a Running status.
  • The hostdev-net Network Operator custom resource has a Ready status.
  • The nvidia-peermem-ctr container is loaded the nvidia-peermem kernel module.

Procedure

  1. Log in to the Supervisor as a vCenter Server Single Sign-On user by running the command.
    kubectl vsphere login --server Supervisor_cluster_IP_address --vsphere-username Supervisor cluster administrator --insecure-skip-tls-verify
  2. Verify that the Network Operator pods have a Running status by running the command.
    kubectl -n nvidia-network-operator get pods
  3. Verify that the Network Operator custom resource has a Ready status by running the command.
    kubectl get HostDeviceNetwork
  4. Verify that the nvidia-peermem-ctr container is loaded the nvidia-peermem kernel module by running the command.
    kubectl logs -n gpu-operator ds/nvidia-driver-daemonset -c nvidia-peermem-ctr

What to do next

Troubleshooting Tips