The Private AI Ready Infrastructure for VMware Cloud Foundation validated solution provides design, implementation, and operational guidance for AI GPU-enabled accelerated workload domains that run on vSphere with Tanzu in the Software-Defined Data Center (SDDC) as part of VMware Cloud Foundation.

A VMware by Broadcom validated solution is a well-architected and validated implementation, built and tested by VMware to help customers deliver common business use cases. VMware validated solutions are operational, cost-effective, reliable, and secure. Each solution contains detailed design, implementation, and operational guidance.

This validated solution for private AI ready infrastructure provides technical guidance on how to design and implement a robust private AI ready infrastructure based on VMware Cloud Foundation.

The validated solution also covers the VMware Private AI Foundation with NVIDIA add-on solution on top of VMware Cloud Foundation. You can use VMware Private AI Foundation with NVIDIA for the following use cases:

  • Running deep learning VMs for AI development based on NVIDIA GPUs and NVIDIA DL workloads.
  • Provisioning Tanzu Kubernetes Grid (TKG) clusters for running AI container workloads on top of NVIDIA GPUs.

Automation for This Design in VMware Cloud Foundation

VMware Cloud Foundation™ SDDC Manager® automates the implementation tasks for some design decisions. For the rest of the design decisions, as noted in the design implications, you must perform the implementation steps manually.

To provide a fast and efficient path for automating the AI Ready Infrastructure for VMware Cloud Foundation implementation, this document provides Microsoft PowerShell cmdlets using an open-source module as code-based alternatives to completing certain procedures in each SDDC component's user interface. You can directly reuse the PowerShell commands by replacing the provided sample values with values from your VMware Cloud Foundation Planning and Preparation Workbook.

Intended Audience

The Private AI Ready Infrastructure for VMware Cloud Foundation document is intended for cloud architects, cloud administrators, DevOps/MLOps practitioners who are familiar with and want to use VMware software to deploy and manage a workload domain that runs vSphere with Tanzu workloads in the SDDC to meet specific and advanced technical requirements of AI workloads and providing the best performance possible. The document provides guidance for capacity, scalability, backup and restore, and extensibility for disaster recovery support.

Support Matrix

Private AI Ready Infrastructure for VMware Cloud Foundation is compatible with certain versions of the VMware products that are used for implementing the solution.

Table 1. Software Components in Private AI Ready Infrastructure for VMware Cloud Foundation

VMware Cloud Foundation Version

Product Group

Component Versions

5.1.1

Products part of VMware Cloud Foundation

See VMware Cloud Foundation 5.1.0 Release Notes.

Solution-added products

VMware Data Services Manager 2.0.2.

See VMware Data Services Manager 2.0 Release Notes.

VMware Data Services Manager is added as part of VMware Private AI Foundation with NVIDIA.

Before You Apply This Guidance

To design and implement the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution, your environment must have a certain configuration.

Table 2. Supported VMware Cloud Foundation Deployment

Workload Domain

Deployment Details

Management domain

Automated deployment by using VMware Cloud Builder.

See the following VMware Cloud Foundation documentation:

One or more virtual infrastructure workload domains with GPU-enabled ESXi hosts and using the vSphere Lifecycle Manager images.

Automated deployment by using SDDC Manager.

See the following VMware Cloud Foundation documentation:

To view compatible NVIDIA GPU devices, see the VMware Compatibility Guide.

NSX Edge cluster in the VI workload domain

Automated deployment by using SDDC Manager.

See the following VMware Cloud Foundation documentation:

Note: You must deploy the NSX Edge cluster with large-sized nodes. A cluster with smaller nodes is not compatible with Supervisor deployment.

VMware Cloud Foundation integrated with Active Directory

Manual or PowerShell automated configuration of Active Directory over LDAP.

See the Identity and Access Management for VMware Cloud Foundation validated solution.

Table 3. Components Required for VMware Private AI Foundation with NVIDIA
Component Deployment Details
NVIDIA GPU device Before you start using VMware Private AI Foundation with NVIDIA, make sure that the GPUs on your ESXi hosts are supported by VMware:
  • NVIDIA A100

  • NVIDIA L40S

  • NVIDIA H100

Supported GPU sharing mode
  • Time slicing

  • Multi-Instance GPU (MIG)

VMware Aria Automation

Manual or PowerShell automated deployment of VMware Aria Automation 8.16.2.

See the Private Cloud Automation for VMware Cloud Foundation validated solution.

Overview of Private AI Ready Infrastructure for VMware Cloud Foundation

By applying the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution, you implement Kubernetes natively on VMware vSphere within your VMware Cloud Foundation instance.

Table 4. Implementation Overview of Private AI Ready Infrastructure for VMware Cloud Foundation

Stage

Steps

Plan and prepare the Private AI Ready VMware Cloud Foundation environment

Work with the technology team in your organization on configuring the physical servers, network, and storage in the data center. Collect the environment details and write them down in the VMware Cloud Foundation Planning and Preparation Workbook.

Deploy and configure vSphere with Tanzu on VMware Cloud Foundation

  1. Configure NSX for vSphere with Tanzu.
  2. Activate and configure a Supervisor in your VI workload domain.
  3. Configure vSphere for AI GPU-Enabled workloads.
  4. Deploy a Tanzu Kubernetes Grid cluster on the Supervisor for AI ready workloads.
  5. Deploy and configure NVIDIA Kubernetes Operators.

VMware Private AI Foundation with NVIDIA

  1. Create self-service catalog items for Service Broker in VMware Aria Automation.
  2. Deploy VMware Data Services Manager and configure
  3. Configure a content library for deep learning VM images.
  4. If your environment is not connected to the Internet, perform additional configuration required for making deep learning VM images, TKr images, and container images from the NVIDIA NGC catalog available in the environment.
  5. Deploy AI workloads and deploy vector databases for RAG workloads by using the catalog items in Automation Service Broker.

Update History

The Private AI Ready Infrastructure for VMware Cloud Foundation validated solution is updated when necessary.

Revision

Description

28 MAY 2024

Initial release.