VMware Private AI Foundation with NVIDIA runs on top of VMware Cloud Foundation adding support for AI workloads in VI workload domains with vSphere IaaS control plane provisioned by using kubectl and VMware Aria Automation .

Figure 1. Example Architecture for VMware Private AI Foundation with NVIDIA


Table 1. Components for Running AI Workloads in VMware Private AI Foundation with NVIDIA
Component Description
GPU-enabled ESXi hosts ESXi hosts that are configured in the following way:
  • Have an NVIDIA GPU that is supported for VMware Private AI Foundation with NVIDIA. The GPU is shared between workloads by using the time slicing or Multi-Instance GPU (MIG) mechanism. See Supported NVIDIA GPU Devices.
  • Have the NVIDIA vGPU host driver installed so that you can use vGPU profiles based on MIG or time slicing.
Supervisor One or more vSphere clusters enabled for vSphere IaaS control plane so that you can run virtual machines and containers on vSphere by using the Kubernetes API. A Supervisor is a Kubernetes cluster itself, serving as the control plane to manage workload clusters and virtual machines.
Harbor registry You can use a Harbor registry in the following cases:
  • In a disconnected environment, as a local image registry where you host the container images downloaded from the NVIDIA NGC catalog.
  • For storing validated ML models.
NSX Edge cluster A cluster of NSX Edge nodes that provides 2-tier north-south routing for the Supervisor and the workloads it runs.

The Tier-0 gateway on the NSX Edge cluster is in active-active mode.

NVIDIA Operators
  • NVIDIA GPU Operator. Automates the management of all NVIDIA software components needed to provision GPU to containers in a Kubernetes cluster. NVIDIA GPU Operator is deployed on a TKG cluster.
  • NVIDIA Network Operator. NVIDIA Network Operator also helps configuring the right Mellanox drivers for containers using virtual functions for high speed networking, RDMA and GPUDirect.

    Network Operator works together with the GPU Operator to enable GPUDirect RDMA on compatible systems.

    NVIDIA Network Operator is deployed on a TKG cluster.

Vector database
  • A PostgreSQL database that has the pgvector extension enabled so that you can use it in Retrieval Augmented Generation (RAG) AI workloads.
  • A Milvus database as a reference sample.
  • NVIDIA Licensing Portal
  • NVIDIA Delegated License Service (DLS)
You use the NVIDIA Licensing Portal to generate a client configuration token to assign a license to the guest vGPU driver in the deep learning virtual machine and the GPU Operators on TKG clusters.

In a disconnected environment or to have your workloads getting license information without using an Internet connection, you host the NVIDIA licenses locally on a Delegated License Service (DLS) appliance.

Content library Content libraries store the images for the deep learning virtual machines and for the Tanzu Kubernetes releases. You use these images for AI workload deployment within the VMware Private AI Foundation with NVIDIA environment. In a connected environment, content libraries pull their content from VMware managed public content libraries. In a disconnected environment, you must upload the required images manually or pull them from an internal content library mirror server.
NVIDIA GPU Cloud (NGC) catalog A portal for GPU-optimized containers for AI, and machine learning that are tested and ready to run on supported NVIDIA GPUs on premises on top of VMware Private AI Foundation with NVIDIA.

As a cloud administrator, you use the management components in VMware Cloud Foundation in the following way:

Table 2. Management Components in VMware Private AI Foundation with NVIDIA
Management Component Description
Management vCenter Server Manage the ESXi hosts that are running the management components of the SDDC and support integration with other solutions for monitoring and management of the virtual infrastructure.
Management NSX Manager Provide networking services to the management workloads in VMware Cloud Foundation.
SDDC Manager
  • Deploy a GPU-enabled VI workload domain that is based on vSphere Lifecycle Manager images and add clusters to it.
  • Deploy an NSX Edge cluster in VI workload domains for use by Supervisor instances and in the management domain for the VMware Aria Suite components of VMware Private AI Foundation with NVIDIA.
  • Deploy a VMware Aria Suite Lifecycle instance which is integrated with the SDDC Manager repository.
VI Workload Domain vCenter Server Enable and configure a Supervisor.
VI Workload Domain NSX Manager SDDC Manager uses this NSX Manager to deploy and update NSX Edge clusters.
NSX Edge Cluster (AVN) Place the VMware Aria Suite components on a pre-defined configuration of NSX segments, called application virtual networks (AVNs), for dynamic routing and load balancing.
VMware Aria Suite Lifecycle Deploy and update VMware Aria Automation and VMware Aria Operations.
VMware Aria Automation Add self-service catalog items for deploying AI workloads for DevOps engineers, data scientists, and MLOps engineers.
VMware Aria Operations Monitor the GPU consumption in the GPU-enabled workload domains.
VMware Data Services Manager Create vector databases, such as a PostgreSQL database with pgvector extension.