Operational Verification of VMware Private AI Foundation with NVIDIA

After you complete the implementation of the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution and VMware Private AI Foundation with NVIDIA, you perform common operations on the environment, such as examining the operational state of the NVAIE components added to the environment during the implementation.

Verify the operational state of the NVIDIA AI Enterprise (NVAIE) Kubernetes Operators and host components by checking their state and health status.

Validate that the NVAIE components are properly functioning and ready for GPU-enabled workloads running on top of VMware Cloud Foundation.

Prerequisites

Install the vSphere kubectl plug-in to connect to the Supervisor as a vCenter Single Sign-On user. See Download and Install the Kubernetes CLI Tools for vSphere.

Verify the Status of the ESXi Host Components for Private AI Ready Infrastructure for VMware Cloud Foundation

Verify the operational state of the ESXi host by checking its state and health status.


Expected Outcomes
The ESXi host has access to the NVIDIA System Management Interface (nvidia-smi).

Procedure

Enable SSH on the ESXi host.
For instructions, see Enable Access to the ESXi Shell.
Log to the ESXi server as root over SSH.
Run the nvidia-smi command.

Verify the Status of the GPU Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

Verify the operational state of the NVIDIA GPU Operator by checking its state and health status.


Expected Outcomes
All GPU Operator pods have a `Running` status. All GPU Operator pods have a `Ready` status. The GPU Operator License returns a `Licensed` status.

Procedure

kubectl vsphere login --server Supervisor_cluster_IP_address --vsphere-username Supervisor cluster administrator --insecure-skip-tls-verify

Verify that the GPU Operator pods have a Running status by running the command.
```
kubectl get pods -n gpu-operator
```

Verify that the GPU Operator license has a Licensed status by running the command.

ctnname=`kubectl get pods -n gpu-operator | grep driver-daemonset | head -1 | cut -d " " -f1`

kubectl -n gpu-operator exec -it $ctnname -- /bin/bash -c "/usr/bin/nvidia-smi -q | grep -i lic"

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
    vGPU Software Licensed Product
        License Status                    : Licensed (Expiry: 2024-2-28 21:22:44 GMT)
    Applications Clocks
    Default Applications Clocks
    Clock Policy

What to do next


Troubleshooting Tips
Ensure that the GPU Operator is properly installed. See Install the NVIDIA GPU Operator for Private AI Ready Infrastructure for VMware Cloud Foundation. Ensure that the NVIDIA GPU driver is properly installed on the ESXi host. See Install the Vendor GPU Driver on the ESXi Hosts for Private AI Ready Infrastructure for VMware Cloud Foundation.

Verify the Status of the Network Operator for Private AI Ready Infrastructure for VMware Cloud Foundation

Verify the operational state of the NVIDIA Network Operator by checking its state and health status.


Expected Outcomes
All Network Operator pods have a `Running` status. The hostdev-net Network Operator custom resource has a `Ready` status. The nvidia-peermem-ctr container is loaded the nvidia-peermem kernel module.

Procedure

kubectl vsphere login --server Supervisor_cluster_IP_address --vsphere-username Supervisor cluster administrator --insecure-skip-tls-verify

Verify that the Network Operator pods have a Running status by running the command.
```
kubectl -n nvidia-network-operator get pods
```
Verify that the Network Operator custom resource has a Ready status by running the command.
```
kubectl get HostDeviceNetwork
```
Verify that the nvidia-peermem-ctr container is loaded the nvidia-peermem kernel module by running the command.
```
kubectl logs -n gpu-operator ds/nvidia-driver-daemonset -c nvidia-peermem-ctr
```

What to do next


Troubleshooting Tips
Ensure that the Network Operator is properly installed. See Install the NVIDIA Network Operator for Private AI Ready Infrastructure for VMware Cloud Foundation Ensure that the NVIDIA GPU driver is properly installed on the ESXi host. See Install the Vendor GPU Driver on the ESXi Hosts for Private AI Ready Infrastructure for VMware Cloud Foundation.