The appendix aggregates all design decisions of the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution. You can use this design decision list for reference related to the end state of the environment and potentially to track your level of adherence to the design and any justification for deviations.

For full design details, see Detailed Design of Private AI Ready Infrastructure for VMware Cloud Foundation and Detailed Design for VMware Private AI Foundation with NVIDIA for Private AI Ready Infrastructure for for VMware Cloud Foundation.

Compute Design

Table 1. Design Decisions for Compute Configuration for Private AI Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-COMPUTE-001

Select servers with CPUs with a high number of cores.

To optimize computational efficiency and minimize the need for scaling out by adding more nodes, consider scaling up the CPU core count in each server. By choosing CPUs with a high number of cores, you can effectively handle multiple inference threads simultaneously. This approach maximizes hardware utilization and enhances the capacity to manage parallel tasks, leading to improved performance and resource utilization in inference workloads

High-end CPUs might increase the overall cost of the solution.

AIR-COMPUTE-002

Select a fast-access memory.

Minimal latency for data retrieval is crucial for real-time inference applications. Increased latency reduces inference performance and give a poor user experience.

Re-purposing available servers might not be a feasible option and overall cost of the solution might increase.

AIR-COMPUTE-003

Select CPUs with Advanced Vector Extensions (AVX, AVX2, or AVX-512).

CPUs with support for AVX or AVX2 can improve performance in deep learning tasks by accelerating vector operations.

Re-purposing available servers might not be a feasible option and overall cost of the solution might increase.

Network Design

Table 2. Design Decisions on Networking for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-TZU-NET-001

Set up networking for 100 Gbps or higher if possible.

100 Gbps networking provides enough bandwidth and very low latency for inference and fine-tuning use cases backed by vSAN ESA.

The cost of the solution is increased.

AIR-TZU-NET-002

Add a /28 overlay-backed NSX segment for use by the Supervisor control plane nodes.

Supports the Supervisor control plane nodes.

You must create the overlay-backed NSX segment.

AIR-TZU-NET-003

Use a dedicated /20 subnet for pod networking.

A single /20 subnet is sufficient to meet the design requirement of 2000 pods.

You must set up a private IP space behind a NAT that you can use in multiple Supervisors.

AIR-TZU-NET-004

Use a dedicated /22 subnet for services.

A single /22 subnet is sufficient to meet the design requirement of 2000 pods.

Private IP space behind a NAT that you can use in multiple Supervisors.

AIR-TZU-NET-005

Use a dedicated /24 or larger subnet on your corporate network for ingress endpoints.

A /24 subnet is sufficient to meet the design requirement of 2000 pods in most cases.

This subnet must be routable to the rest of the corporate network.

A /24 subnet will be sufficient for most use cases, but you should evaluate your ingress needs before deployment.

AIR-TZU-NET-006

Use a dedicated /24 or larger subnet on your corporate network for egress endpoints.

A /24 subnet is sufficient to meet the design requirement of 2000 pods in most cases.

This subnet must be routable to the rest of the corporate network.

A /24 subnet will be sufficient for most use cases, but you should evaluate your egress needs before to deployment.

Accelerators Design

Table 3. Design Decisions on Accelerators for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-ACCELERATE-001

Select GPUs with high memory bandwidth

AI workloads require high memory bandwidth to efficiently handle large amounts of data. Look for GPUs with high memory bandwidth specifications.

  • The cost of the solution is increased.
  • GPU choice might be limited.

AIR-ACCELERATE-002

Select GPUs with large memory capacity.

To handle efficiently LLMs, select GPUs equipped with substantial memory capacities. LLMs containing billions of parameters demand significant GPU memory resources for model fine-tuning and inference.

  • The cost of the solution is increased.
  • GPU choice might be limited.

AIR-ACCELERATE-003

Evaluate and compare compute performance of the available options of GPUs.

Assess the GPU's compute performance based on metrics like CUDA cores (for NVIDIA GPUs) or stream processors (for AMD GPUs). Higher compute performance provide support for faster model training and inference, particularly beneficial for complex AI tasks.

  • The cost of the solution is increased.
  • GPU choice might be limited.

AIR-ACCELERATE-004

Evaluate cooling and power efficiency of GPUs.

To manage the strain large language models place on GPUs, prioritize systems with efficient cooling and power management to mitigate high power consumption and heat generation.

You must select server platforms focused on GPU.

Storage Design

Table 4. Design Decisions on Storage for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-STORAGE-001

Use vSAN ESA with 100 Gbps networking and,if possible, RDMA.

Provides high performance and efficiency. Although the minimum bandwidth for vSAN ESA is 25 Gbps, 100 Gbps and faster provides the best performance in terms of bandwidth and latency for all AI use cases.

  • The cost of the solution is increased.
  • RDMA increases the design complexity.

  • The choice of vSAN ReadyNodes is limited to nodes that are approved for use with vSAN ESA.

AIR-STORAGE-002

Use vSAN ESA RAID 5 or RAID 6 erasure coding.

Provides performance equal to RAID 1 mirroring.

None.

AIR-STORAGE-003

Leave data compression enabled for vSAN ESA.

Enables transmitting data in compressed state across hosts in the cluster. Data compression in vSAN ESA is controllable using storage policies.

None.

Deployment Specification Design

Table 5. Design Decisions on vSphere Storage Policy Based Management for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-SPBM-CFG-001

Create a vSphere tag and tag category, and apply the vSphere tag to the vSAN datastore in the shared edge and workload vSphere cluster in the VI workload domain.

Supervisor activation requires the use of vSphere Storage Based Policy Management (SPBM).

To assign the vSAN datastore to the Supervisor, you need to create a vSphere tag and tag category to create an SPBM rule.

You must perform this operation manually or by using PowerCLI.

AIR-SPBM-CFG-002

Create a vSphere Storage Policy Based Management (SPBM) policy that specifies the vSphere tag you created for the Supervisor.

When you create the SPBM policy and define the vSphere tag for the Supervisor, you can then assign that SPBM policy during Supervisor activation.

You must perform this operation manually or by using PowerCLI.

Table 6. Design Decisions on the Supervisor for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-TZU-CFG-001

Activate vSphere with Tanzu on the shared edge and workload vSphere cluster in the VI workload domain.

The Supervisor is required to run Kubernetes workloads natively and to deploy Tanzu Kubernetes Grid clusters natively using Tanzu Kubernetes Grid Service.

Ensure the shared edge and workload vSphere cluster is sized to support the Supervisor control plane, any additional integrated management workloads, and any customer workloads.

AIR-TZU-CFG-002

Deploy the Supervisor with small-size control plane nodes.

Deploying the control plane nodes as small-size appliances gives you the ability to run up to 2,000 pods within your Supervisor.

If your pod count is higher than 2,000 for the Supervisor, you must deploy control plane nodes that can handle that level of scale.

You must consider the size of the control plane nodes.

AIR-TZU-CFG-003

Use NSX as provider of the software-defined networking for the Supervisor.

You can deploy a Supervisor either by using NSX or vSphere networking .

VMware Cloud Foundation uses NSX for software-defined networking across the SDDC. Deviating for vSphere with Tanzu would increase the operational overhead.

None.

AIR-TZU-CFG-004

Deploy the NSX Edge cluster with large-size nodes.

Large-size NSX Edge nodes are the smallest size supported to activate a Supervisor.

You must account for the size of the NSX Edge nodes.

AIR-TZU-CFG-005

Deploy a single-zone Supervisor.

A three-zone Supervisor requires three separate vSphere clusters.

No change to existing design or procedures with single-zone Supervisor.

Table 7. Design Decisions on the Harbor Supervisor Service for AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-HRB-CFG-001

Deploy Contour as an Ingress Supervisor Service.

Harbor requires Contour on the target Supervisor to provide Ingress Service. The Ingress IP address provided by Contour must be resolved to the Harbor FQDN.

None.

AIR-HRB-CFG-002

Deploy the Harbor Registry as a Supervisor Service.

Harbor as a Supervisor Service has replaced the integrated registry in previous vSphere versions.

You must provide the following configuration:

  • Harbor FQDN

  • Record and Pointer Record (PTR) for the Harbor Registry IP (this IP is provided by the Contour Ingress Service)

  • Manage Supervisor Services privilege in vCenter Server.

Table 8. Design Decisions on the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-TZU-CFG-006

Deploy a Tanzu Kubernetes Grid Cluster in the Supervisor.

For applications that require upstream Kubernetes compliance, a Tanzu Kubernetes Grid Cluster is required.

None.

AIR-TZU-CFG-007

For a disconnected environment, configure a local content library for Tanzu Kubernetes releases (TKrs) for use in the shared edge and workload vSphere cluster.

In a disconnected environment, the Supervisor is unable to pull TKr images from the central public content library maintained by VMware. To deploy a Tanzu Kubernetes Grid on a Supervisor, you must configure a content library in the shared edge and workload vSphere cluster with the required images, downloaded from the public library.

You must manually configure the content library.

AIR-TZU-CFG-008

Use Antrea as the container network interface (CNI) for your Tanzu Kubernetes Grid clusters.

Antrea is the default CNI for Tanzu Kubernetes Grid clusters.

New Tanzu Kubernetes Grid clusters are deployed with Antrea as the CNI, unless you specify Calico.

Table 9. Design Decisions on Sizing the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-TZU-CFG-009

Deploy Tanzu Kubernetes Grid clusters with a minimum of three control plane nodes.

Deploying three control plane nodes ensures the control plance state of your Tanzu Kubernetes Grid cluster stays if a node failure occurs.

Horizontal and vertical scaling of the control plane is supported. See Scale a TKG Cluster on Supervisor Using Kubectl.

None.

AIR-TZU-CFG-010

For production environments, deploy Tanzu Kubernetes Grid clusters with a minimum of three worker nodes.

Deploying three worker nodes provides a higher level of availability of your workloads deployed to the cluster.

You must configure your customer workloads to use effectively the additional worker nodes in the cluster for high availability at an application-level.

AIR-TZU-CFG-011

Deploy Tanzu Kubernetes Grid clusters with small-size control plane nodes if your cluster will have less than 10 worker nodes.

You must size the control plane of a Tanzu Kubernetes Grid cluster according to the amount of worker nodes and pod density.

The size of the cluster nodes impacts the scale of a given cluster. If you must add a node to a cluster, consider the use of larger nodes. For AI GPU-enabled workloads, the GPU is the constraining factor for the amount of worker nodes that could be deployed.

Life Cycle Management Design

Table 10. Design Decisions on Life Cycle Management for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-TZU-LCM-001

For life cycle management of a GPU-enabled VI workload domain, use a vSphere Lifecycle Manager image with a custom ESXi image that includes the GPU driver and any other core components from the GPU vendor.

  • Eases maintaining the right host driver versions and daemons.
  • Introduces consistency across the GPU-enabled hosts.

You must create the customer vSphere Lifecycle Manager image before you deploy the VI workload domain.

AIR-TZU-LCM-002

Use the vSphere Client for life cycle management of a Supervisor.

Life cycle management of a Supervisor is not integrated in SDDC Manager.

You perform deployment, patching, updates, and upgrades of a Supervisor and its components manually.

AIR-TZU-LCM-003

Use kubectl for life cycle management of a Tanzu Kubernetes Grid cluster.

Life cycle management of a Tanzu Kubernetes Grid cluster is not integrated in SDDC Manager.

You perform deployment, patching, updates, and upgrades of a Tanzu Kubernetes Grid cluster and its components manually.

Information Security and Access Design

Table 11. Design Decisions on Authentication and Access Control for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-TZU-SEC-001

Create a security group in Active Directory for DevOps administrators. Add users who need edit permissions within a namespace to the group and grant Can Edit permissions to the namespace for that group.

If you require different permissions per namespace, create additional groups.

Necessary for auditable role-based access control within the Supervisor and Tanzu Kubernetes Grid clusters.

You must define and manage security groups, group membership, and security controls in Active Directory.

AIR-TZU-SEC-002

Create a security group in Active Directory for DevOps administrators. Add users who need read-only permissions in a namespace to the group, and grant Can View permissions to the namespace for that group.

If you require different permissions per namespace, create additional groups.

Necessary for auditable role-based access control within the Supervisor and Tanzu Kubernetes Grid clusters.

You must define and manage security groups, group membership, and security controls in Active Directory.

Table 12. Design Decisions on Certificate Management for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-TZU-SEC-003

Replace the default self-signed certificate for the Supervisor management interface with a PEM-encoded, CA-signed certificate.

Ensures that the communication between administrators and the Supervisor management interface is encrypted by using a trusted certificate.

You must replace and manager certificates manually, outside certificate management automation of SDDC Manager.

AIR-TZU-SEC-004

Use a SHA-2 or higher algorithm when signing certificates.

The SHA-1 algorithm is considered less secure and has been deprecated.

Not all certificate authorities support SHA-2.

NVIDIA Licensing System Design

Table 13. Design Decisions on NVIDIA Licensing System Design for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-NVD-LIC-001

For Delegated License Service (DLS) instances, account for extra compute, storage, and network resources as part of your management domain.

DLS is deployed as a virtual appliance with specific hardware requirements. The appliance can also be configured in a high availability setup, independent from vSphere HA.

  • Increased resources for the management stack.
  • You must perform life cycle management of the DLS instance.

AIR-NVD-LIC-002

For Cloud License Service (CLS) instances, Internet access is required.

Internet access is required between a licensed client and a CLS instance. Ports 80 and 443 (Egress) must be allowed.

  • Introduces potential security risks.
  • You must enforce firewall rules, intrusion detection systems, and monitoring.

VMware Data Services Manager Design

Table 14. Design Decisions on VMware Data Services Manager Design for Private AI Ready Infrastructure for VMware Cloud Foundation

Decision ID

Design Decision

Design Justification

Design Implication

AIR-DSM-001

Deploy VMware Data Services Manager in the management domain.

A 1:1 relationship between a VMware Data Services Manager appliance and a vCenter Server instance is required. The vCenter Server instances for the VI workload domans in a VMware Cloud Foundation instance run in the management domain.

You must deploy one VMware Data Services Manager appliance per vCenter Server which impacts the required resources for the management domain and its clusters.

AIR-DSM-002

For production-grade deployments, deploy PostgreSQL databases in HA mode (3 or 5 nodes).

High Availability of Vector Databases, increasing the overall availability of the whole system that depends on the DBs.

Increased resource consumption of the target VI WLD, and increased number used IP Addresses.

AIR-DSM-003

Allocate enough IP addresses for the IP pools of infrastructure policies.

You determine the number of IP addresses reserved for the IP pools according to the requirements and to the high availability topology of the database deployed by using VMware Data Services Manager. For example, a 5-node PostgreSQL cluster requires 7 IP addresses - one for each node, one for for kube_VIP, and one for database load balancing).

You must consider planning and subnet sizing.

AIR-DSM-004

Define VM classes in VMware Data Services Manager that align to your resource requirements.

Consider the use case, types of workloads using the databases, amount of data, Transactions per Second (TPS), and other factors, such as target infrastructure overcommitment if applicable.

See Data Services Manager Documentation and Data Modernization with VMware Data Services Manager.

You must consider VMware Data Services Manager planning and design.

AIR-DSM-005

Configure LDAP as Directory Service for VMware Data Services Manager.

LDAP (TLS available if needed) can be configured as the identity provider to import users and assign roles on VMware Data Services Manager.

Increased security operation costs. You must allow port access from VMware Data Services Manager to the LDAP identity source:

  • LDAP - 389 TCP

  • LDAPS - 636 TCP/UDP

AIR-DSM-006

Configure the S3-compatible object store, for example, MinIO, with TLS.

The provider repositories for core VMware Data Services Manager storage, backup, logs and database backups must be enabled with TLS.

  • Security and complexity is increased.
  • You must manage TLS certificates.

AIR-DSM-007

Create a VMware Tanzu Network account account and use it to configure a refresh token in VMware Data Services Manager.

Database templates and software updates are uploaded to VMware Tanzu Network.

In a connected environment, you must configure a Tanzu Network Refresh Token as part of the VMware Data Services Manager setup. In a disconnected environment, you must download the air-gapped environment repository and uploaded it manually to the Provider Repository.

You must perform this operation manually.

AIR-DSM-008

If you plan to run databases managed by VMware Data Services Manager on SAN ESA clusters, create a vSphere SPBM policy that is based on erasure coding.

Provides performance that is equivalent to RAID 1 but with no compromises and with better space efficiency.

The available erasure coding, RAID 5 or RAID 6, depends on the size of the all-flash vSAN ESA cluster. Erasure Coding 5 RAID 5 erasure coding requires a minimum of 4 ESXi hosts while RADI 6 erasure coding requires a minimum of 6 ESXi hosts.

  • Design complexity, cost, and management overhead of the solution are increased.
  • You must perform this operation manually or by using PowerCLI.

AIR-DSM-009

Use RAID 5 or RAID 6 erasure coding as the default vSAN storage policy for databases.

Eliminates the trade-off of performance and deterministic space efficiency. Set FTT=1 for RAID 5 and FTT=2 for RAID 6 according to the number of hosts in the vSAN ESA cluster and your data availability requirements.

Design complexity is increased.