The Private AI Ready Infrastructure for VMware Cloud Foundation validated solution provides design, implementation, and operational guidance for AI GPU-enabled accelerated workload domains that run on vSphere with Tanzu in the Software-Defined Data Center (SDDC) as part of VMware Cloud Foundation.
A VMware by Broadcom validated solution is a well-architected and validated implementation, built and tested by VMware to help customers deliver common business use cases. VMware validated solutions are operational, cost-effective, reliable, and secure. Each solution contains detailed design, implementation, and operational guidance.
This validated solution for private AI ready infrastructure provides technical guidance on how to design and implement a robust private AI ready infrastructure based on VMware Cloud Foundation.
The validated solution also covers the VMware Private AI Foundation with NVIDIA add-on solution on top of VMware Cloud Foundation. You can use VMware Private AI Foundation with NVIDIA for the following use cases:
- Running deep learning VMs for AI development based on NVIDIA GPUs and NVIDIA DL workloads.
- Provisioning Tanzu Kubernetes Grid (TKG) clusters for running AI container workloads on top of NVIDIA GPUs.
Automation for This Design in VMware Cloud Foundation
VMware Cloud Foundation™ SDDC Manager® automates the implementation tasks for some design decisions. For the rest of the design decisions, as noted in the design implications, you must perform the implementation steps manually.
To provide a fast and efficient path for automating the AI Ready Infrastructure for VMware Cloud Foundation implementation, this document provides Microsoft PowerShell cmdlets using an open-source module as code-based alternatives to completing certain procedures in each SDDC component's user interface. You can directly reuse the PowerShell commands by replacing the provided sample values with values from your VMware Cloud Foundation Planning and Preparation Workbook.
Intended Audience
The Private AI Ready Infrastructure for VMware Cloud Foundation document is intended for cloud architects, cloud administrators, DevOps/MLOps practitioners who are familiar with and want to use VMware software to deploy and manage a workload domain that runs vSphere with Tanzu workloads in the SDDC to meet specific and advanced technical requirements of AI workloads and providing the best performance possible. The document provides guidance for capacity, scalability, backup and restore, and extensibility for disaster recovery support.
Support Matrix
Private AI Ready Infrastructure for VMware Cloud Foundation is compatible with certain versions of the VMware products that are used for implementing the solution.
VMware Cloud Foundation Version |
Product Group |
Component Versions |
---|---|---|
5.2.0 |
Products part of VMware Cloud Foundation |
|
Solution-added products |
VMware Data Services Manager 2.0.2. See VMware Data Services Manager 2.0 Release Notes. VMware Data Services Manager is added as part of VMware Private AI Foundation with NVIDIA. |
|
5.1.1 |
Products part of VMware Cloud Foundation |
|
Solution-added products |
VMware Data Services Manager 2.0.2. See VMware Data Services Manager 2.0 Release Notes. VMware Data Services Manager is added as part of VMware Private AI Foundation with NVIDIA. |
Before You Apply This Guidance
To design and implement the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution, your environment must have a certain configuration.
Workload Domain |
Deployment Details |
---|---|
Management Domain |
Automated deployment by using VMware Cloud Builder. See the following VMware Cloud Foundation documentation:
|
One or more virtual infrastructure workload domains with GPU-enabled ESXi hosts and using the vSphere Lifecycle Manager images. |
Automated deployment by using SDDC Manager. See the following VMware Cloud Foundation documentation:
To view compatible NVIDIA GPU devices, see the VMware Compatibility Guide. |
NSX Edge cluster in the VI workload domain |
Automated deployment by using SDDC Manager. See the following VMware Cloud Foundation documentation:
Note: You must deploy the NSX Edge cluster with large-sized nodes. A cluster with smaller nodes is not compatible with Supervisor deployment.
|
VMware Cloud Foundation integrated with Active Directory |
Manual or PowerShell automated configuration of Active Directory over LDAP. See the Identity and Access Management for VMware Cloud Foundation validated solution. |
Deploy and configure vSphere with Tanzu on VMware Cloud Foundation |
Manual or PowerShell automated configuration of vSphere with Tanzu. See the Developer Ready Infrastructure for VMware Cloud Foundation validated solution. |
Component | Deployment Details |
---|---|
NVIDIA GPU device | Before you start using VMware Private AI Foundation with NVIDIA, make sure that the GPUs on your ESXi hosts are supported by VMware:
|
Supported GPU sharing mode |
|
VMware Aria Automation | Manual or PowerShell automated deployment of VMware Aria Automation 8.18.0. See the Private Cloud Automation for VMware Cloud Foundation validated solution. |
Overview of Private AI Ready Infrastructure for VMware Cloud Foundation
By applying the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution, you implement Kubernetes natively on VMware vSphere within your VMware Cloud Foundation instance.
Stage |
Steps |
---|---|
Plan and prepare the Private AI Ready VMware Cloud Foundation environment |
Work with the technology team in your organization on configuring the physical servers, network, and storage in the data center. Collect the environment details and write them down in the VMware Cloud Foundation Planning and Preparation Workbook. |
Deploy and configure vSphere with Tanzu on VMware Cloud Foundation |
|
VMware Private AI Foundation with NVIDIA |
|
Update History
The Private AI Ready Infrastructure for VMware Cloud Foundation validated solution is updated when necessary.
Revision |
Description |
---|---|
23 JUL 2024 |
|
28 MAY 2024 |
Initial release. |