The appendix aggregates all design decisions of the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution. You can use this design decision list for reference related to the end state of the environment and potentially to track your level of adherence to the design and any justification for deviations.
For full design details, see Detailed Design of Private AI Ready Infrastructure for VMware Cloud Foundation and Detailed Design for VMware Private AI Foundation with NVIDIA for Private AI Ready Infrastructure for for VMware Cloud Foundation.
Compute Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-COMPUTE-001 |
Select servers with CPUs with a high number of cores. |
To optimize computational efficiency and minimize the need for scaling out by adding more nodes, consider scaling up the CPU core count in each server. By choosing CPUs with a high number of cores, you can effectively handle multiple inference threads simultaneously. This approach maximizes hardware utilization and enhances the capacity to manage parallel tasks, leading to improved performance and resource utilization in inference workloads |
High-end CPUs might increase the overall cost of the solution. |
AIR-COMPUTE-002 |
Select a fast-access memory. |
Minimal latency for data retrieval is crucial for real-time inference applications. Increased latency reduces inference performance and give a poor user experience. |
Re-purposing available servers might not be a feasible option and overall cost of the solution might increase. |
AIR-COMPUTE-003 |
Select CPUs with Advanced Vector Extensions (AVX, AVX2, or AVX-512). |
CPUs with support for AVX or AVX2 can improve performance in deep learning tasks by accelerating vector operations. |
Re-purposing available servers might not be a feasible option and overall cost of the solution might increase. |
Network Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-TZU-NET-001 |
Set up networking for 100 Gbps or higher if possible. |
100 Gbps networking provides enough bandwidth and very low latency for inference and fine-tuning use cases backed by vSAN ESA. |
The cost of the solution is increased. |
AIR-TZU-NET-002 |
Add a /28 overlay-backed NSX segment for use by the Supervisor control plane nodes. |
Supports the Supervisor control plane nodes. |
You must create the overlay-backed NSX segment. |
AIR-TZU-NET-003 |
Use a dedicated /20 subnet for pod networking. |
A single /20 subnet is sufficient to meet the design requirement of 2000 pods. |
You must set up a private IP space behind a NAT that you can use in multiple Supervisors. |
AIR-TZU-NET-004 |
Use a dedicated /22 subnet for services. |
A single /22 subnet is sufficient to meet the design requirement of 2000 pods. |
Private IP space behind a NAT that you can use in multiple Supervisors. |
AIR-TZU-NET-005 |
Use a dedicated /24 or larger subnet on your corporate network for ingress endpoints. |
A /24 subnet is sufficient to meet the design requirement of 2000 pods in most cases. |
This subnet must be routable to the rest of the corporate network. A /24 subnet will be sufficient for most use cases, but you should evaluate your ingress needs before deployment. |
AIR-TZU-NET-006 |
Use a dedicated /24 or larger subnet on your corporate network for egress endpoints. |
A /24 subnet is sufficient to meet the design requirement of 2000 pods in most cases. |
This subnet must be routable to the rest of the corporate network. A /24 subnet will be sufficient for most use cases, but you should evaluate your egress needs before to deployment. |
Accelerators Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-ACCELERATE-001 |
Select GPUs with high memory bandwidth |
AI workloads require high memory bandwidth to efficiently handle large amounts of data. Look for GPUs with high memory bandwidth specifications. |
|
AIR-ACCELERATE-002 |
Select GPUs with large memory capacity. |
To handle efficiently LLMs, select GPUs equipped with substantial memory capacities. LLMs containing billions of parameters demand significant GPU memory resources for model fine-tuning and inference. |
|
AIR-ACCELERATE-003 |
Evaluate and compare compute performance of the available options of GPUs. |
Assess the GPU's compute performance based on metrics like CUDA cores (for NVIDIA GPUs) or stream processors (for AMD GPUs). Higher compute performance provide support for faster model training and inference, particularly beneficial for complex AI tasks. |
|
AIR-ACCELERATE-004 |
Evaluate cooling and power efficiency of GPUs. |
To manage the strain large language models place on GPUs, prioritize systems with efficient cooling and power management to mitigate high power consumption and heat generation. |
You must select server platforms focused on GPU. |
Storage Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-STORAGE-001 |
Use vSAN ESA with 100 Gbps networking and,if possible, RDMA. |
Provides high performance and efficiency. Although the minimum bandwidth for vSAN ESA is 25 Gbps, 100 Gbps and faster provides the best performance in terms of bandwidth and latency for all AI use cases. |
|
AIR-STORAGE-002 |
Use vSAN ESA RAID 5 or RAID 6 erasure coding. |
Provides performance equal to RAID 1 mirroring. |
None. |
AIR-STORAGE-003 |
Leave data compression enabled for vSAN ESA. |
Enables transmitting data in compressed state across hosts in the cluster. Data compression in vSAN ESA is controllable using storage policies. |
None. |
Deployment Specification Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-SPBM-CFG-001 |
Create a vSphere tag and tag category, and apply the vSphere tag to the vSAN datastore in the shared edge and workload vSphere cluster in the VI workload domain. |
Supervisor activation requires the use of vSphere Storage Based Policy Management (SPBM). To assign the vSAN datastore to the Supervisor, you need to create a vSphere tag and tag category to create an SPBM rule. |
You must perform this operation manually or by using PowerCLI. |
AIR-SPBM-CFG-002 |
Create a vSphere Storage Policy Based Management (SPBM) policy that specifies the vSphere tag you created for the Supervisor. |
When you create the SPBM policy and define the vSphere tag for the Supervisor, you can then assign that SPBM policy during Supervisor activation. |
You must perform this operation manually or by using PowerCLI. |
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-TZU-CFG-001 |
Activate vSphere with Tanzu on the shared edge and workload vSphere cluster in the VI workload domain. |
The Supervisor is required to run Kubernetes workloads natively and to deploy Tanzu Kubernetes Grid clusters natively using Tanzu Kubernetes Grid Service. |
Ensure the shared edge and workload vSphere cluster is sized to support the Supervisor control plane, any additional integrated management workloads, and any customer workloads. |
AIR-TZU-CFG-002 |
Deploy the Supervisor with small-size control plane nodes. |
Deploying the control plane nodes as small-size appliances gives you the ability to run up to 2,000 pods within your Supervisor. If your pod count is higher than 2,000 for the Supervisor, you must deploy control plane nodes that can handle that level of scale. |
You must consider the size of the control plane nodes. |
AIR-TZU-CFG-003 |
Use NSX as provider of the software-defined networking for the Supervisor. |
You can deploy a Supervisor either by using NSX or vSphere networking . VMware Cloud Foundation uses NSX for software-defined networking across the SDDC. Deviating for vSphere with Tanzu would increase the operational overhead. |
None. |
AIR-TZU-CFG-004 |
Deploy the NSX Edge cluster with large-size nodes. |
Large-size NSX Edge nodes are the smallest size supported to activate a Supervisor. |
You must account for the size of the NSX Edge nodes. |
AIR-TZU-CFG-005 |
Deploy a single-zone Supervisor. |
A three-zone Supervisor requires three separate vSphere clusters. |
No change to existing design or procedures with single-zone Supervisor. |
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-HRB-CFG-001 |
Deploy Contour as an Ingress Supervisor Service. |
Harbor requires Contour on the target Supervisor to provide Ingress Service. The Ingress IP address provided by Contour must be resolved to the Harbor FQDN. |
None. |
AIR-HRB-CFG-002 |
Deploy the Harbor Registry as a Supervisor Service. |
Harbor as a Supervisor Service has replaced the integrated registry in previous vSphere versions. |
You must provide the following configuration:
|
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-TZU-CFG-006 |
Deploy a Tanzu Kubernetes Grid Cluster in the Supervisor. |
For applications that require upstream Kubernetes compliance, a Tanzu Kubernetes Grid Cluster is required. |
None. |
AIR-TZU-CFG-007 |
For a disconnected environment, configure a local content library for Tanzu Kubernetes releases (TKrs) for use in the shared edge and workload vSphere cluster. |
In a disconnected environment, the Supervisor is unable to pull TKr images from the central public content library maintained by VMware. To deploy a Tanzu Kubernetes Grid on a Supervisor, you must configure a content library in the shared edge and workload vSphere cluster with the required images, downloaded from the public library. |
You must manually configure the content library. |
AIR-TZU-CFG-008 |
Use Antrea as the container network interface (CNI) for your Tanzu Kubernetes Grid clusters. |
Antrea is the default CNI for Tanzu Kubernetes Grid clusters. |
New Tanzu Kubernetes Grid clusters are deployed with Antrea as the CNI, unless you specify Calico. |
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-TZU-CFG-009 |
Deploy Tanzu Kubernetes Grid clusters with a minimum of three control plane nodes. |
Deploying three control plane nodes ensures the control plance state of your Tanzu Kubernetes Grid cluster stays if a node failure occurs. Horizontal and vertical scaling of the control plane is supported. See Scale a TKG Cluster on Supervisor Using Kubectl. |
None. |
AIR-TZU-CFG-010 |
For production environments, deploy Tanzu Kubernetes Grid clusters with a minimum of three worker nodes. |
Deploying three worker nodes provides a higher level of availability of your workloads deployed to the cluster. |
You must configure your customer workloads to use effectively the additional worker nodes in the cluster for high availability at an application-level. |
AIR-TZU-CFG-011 |
Deploy Tanzu Kubernetes Grid clusters with small-size control plane nodes if your cluster will have less than 10 worker nodes. |
You must size the control plane of a Tanzu Kubernetes Grid cluster according to the amount of worker nodes and pod density. |
The size of the cluster nodes impacts the scale of a given cluster. If you must add a node to a cluster, consider the use of larger nodes. For AI GPU-enabled workloads, the GPU is the constraining factor for the amount of worker nodes that could be deployed. |
Life Cycle Management Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-TZU-LCM-001 |
For life cycle management of a GPU-enabled VI workload domain, use a vSphere Lifecycle Manager image with a custom ESXi image that includes the GPU driver and any other core components from the GPU vendor. |
|
You must create the customer vSphere Lifecycle Manager image before you deploy the VI workload domain. |
AIR-TZU-LCM-002 |
Use the vSphere Client for life cycle management of a Supervisor. |
Life cycle management of a Supervisor is not integrated in SDDC Manager. |
You perform deployment, patching, updates, and upgrades of a Supervisor and its components manually. |
AIR-TZU-LCM-003 |
Use kubectl for life cycle management of a Tanzu Kubernetes Grid cluster. |
Life cycle management of a Tanzu Kubernetes Grid cluster is not integrated in SDDC Manager. |
You perform deployment, patching, updates, and upgrades of a Tanzu Kubernetes Grid cluster and its components manually. |
Information Security and Access Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-TZU-SEC-001 |
Create a security group in Active Directory for DevOps administrators. Add users who need edit permissions within a namespace to the group and grant Can Edit permissions to the namespace for that group. If you require different permissions per namespace, create additional groups. |
Necessary for auditable role-based access control within the Supervisor and Tanzu Kubernetes Grid clusters. |
You must define and manage security groups, group membership, and security controls in Active Directory. |
AIR-TZU-SEC-002 |
Create a security group in Active Directory for DevOps administrators. Add users who need read-only permissions in a namespace to the group, and grant Can View permissions to the namespace for that group. If you require different permissions per namespace, create additional groups. |
Necessary for auditable role-based access control within the Supervisor and Tanzu Kubernetes Grid clusters. |
You must define and manage security groups, group membership, and security controls in Active Directory. |
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-TZU-SEC-003 |
Replace the default self-signed certificate for the Supervisor management interface with a PEM-encoded, CA-signed certificate. |
Ensures that the communication between administrators and the Supervisor management interface is encrypted by using a trusted certificate. |
You must replace and manager certificates manually, outside certificate management automation of SDDC Manager. |
AIR-TZU-SEC-004 |
Use a SHA-2 or higher algorithm when signing certificates. |
The SHA-1 algorithm is considered less secure and has been deprecated. |
Not all certificate authorities support SHA-2. |
NVIDIA Licensing System Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-NVD-LIC-001 |
For Delegated License Service (DLS) instances, account for extra compute, storage, and network resources as part of your management domain. |
DLS is deployed as a virtual appliance with specific hardware requirements. The appliance can also be configured in a high availability setup, independent from vSphere HA. |
|
AIR-NVD-LIC-002 |
For Cloud License Service (CLS) instances, Internet access is required. |
Internet access is required between a licensed client and a CLS instance. Ports 80 and 443 (Egress) must be allowed. |
|
VMware Data Services Manager Design
Decision ID |
Design Decision |
Design Justification |
Design Implication |
---|---|---|---|
AIR-DSM-001 |
Deploy VMware Data Services Manager in the management domain. |
A 1:1 relationship between a VMware Data Services Manager appliance and a vCenter Server instance is required. The vCenter Server instances for the VI workload domans in a VMware Cloud Foundation instance run in the management domain. |
You must deploy one VMware Data Services Manager appliance per vCenter Server which impacts the required resources for the management domain and its clusters. |
AIR-DSM-002 |
For production-grade deployments, deploy PostgreSQL databases in HA mode (3 or 5 nodes). |
High Availability of Vector Databases, increasing the overall availability of the whole system that depends on the DBs. |
Increased resource consumption of the target VI WLD, and increased number used IP Addresses. |
AIR-DSM-003 |
Allocate enough IP addresses for the IP pools of infrastructure policies. |
You determine the number of IP addresses reserved for the IP pools according to the requirements and to the high availability topology of the database deployed by using VMware Data Services Manager. For example, a 5-node PostgreSQL cluster requires 7 IP addresses - one for each node, one for for kube_VIP, and one for database load balancing). |
You must consider planning and subnet sizing. |
AIR-DSM-004 |
Define VM classes in VMware Data Services Manager that align to your resource requirements. |
Consider the use case, types of workloads using the databases, amount of data, Transactions per Second (TPS), and other factors, such as target infrastructure overcommitment if applicable. See Data Services Manager Documentation and Data Modernization with VMware Data Services Manager. |
You must consider VMware Data Services Manager planning and design. |
AIR-DSM-005 |
Configure LDAP as Directory Service for VMware Data Services Manager. |
LDAP (TLS available if needed) can be configured as the identity provider to import users and assign roles on VMware Data Services Manager. |
Increased security operation costs. You must allow port access from VMware Data Services Manager to the LDAP identity source:
|
AIR-DSM-006 |
Configure the S3-compatible object store, for example, MinIO, with TLS. |
The provider repositories for core VMware Data Services Manager storage, backup, logs and database backups must be enabled with TLS. |
|
AIR-DSM-007 |
Create a VMware Tanzu Network account account and use it to configure a refresh token in VMware Data Services Manager. |
Database templates and software updates are uploaded to VMware Tanzu Network. In a connected environment, you must configure a Tanzu Network Refresh Token as part of the VMware Data Services Manager setup. In a disconnected environment, you must download the air-gapped environment repository and uploaded it manually to the Provider Repository. |
You must perform this operation manually. |
AIR-DSM-008 | If you plan to run databases managed by VMware Data Services Manager on SAN ESA clusters, create a vSphere SPBM policy that is based on erasure coding. |
Provides performance that is equivalent to RAID 1 but with no compromises and with better space efficiency. The available erasure coding, RAID 5 or RAID 6, depends on the size of the all-flash vSAN ESA cluster. Erasure Coding 5 RAID 5 erasure coding requires a minimum of 4 ESXi hosts while RADI 6 erasure coding requires a minimum of 6 ESXi hosts. |
|
AIR-DSM-009 |
Use RAID 5 or RAID 6 erasure coding as the default vSAN storage policy for databases. |
Eliminates the trade-off of performance and deterministic space efficiency. Set FTT=1 for RAID 5 and FTT=2 for RAID 6 according to the number of hosts in the vSAN ESA cluster and your data availability requirements. |
Design complexity is increased. |