Deployment Specification for Private AI Ready Infrastructure for VMware Cloud Foundation

When vSphere with Tanzu is activated on a vSphere cluster running in a VI workload domain, a Kubernetes control plane is instantiated by using Photon OS virtual machines. This layer contains multiple objects that activate the capability to run Kubernetes workloads natively in the ESXi hosts, instantiating the Supervisor.

Deployment Model for Private AI Ready Infrastructure for VMware Cloud Foundation

You determine the use of the different services, the sizing of those resources, and how they are deployed and managed based on the design objectives for the Private AI Ready Infrastructure for VMware Cloud Foundation validated solution.

vSphere Storage Based Policy Management Configuration

You must configure a datastore with the activation requirements before activating a Supervisor. The Supervisor configuration requires the use of vSphere Storage Policy Based Management (SPBM) policies for control plane nodes, ephemeral disks, and image cache. These policies correlate to Kubernetes storage policies that can be assigned to vSphere Namespaces. These policies are consumed in a Supervisor or a Tanzu Kubernetes Grid cluster.

Table 1. Design Decisions on vSphere Storage Policy Based Management for Private AI Ready Infrastructure for VMware Cloud Foundation
Decision ID	Design Decision	Design Justification	Design Implication
AIR-SPBM-CFG-001	Create a vSphere tag and tag category, and apply the vSphere tag to the vSAN datastore in the shared edge and workload vSphere cluster in the VI workload domain.	Supervisor activation requires the use of vSphere Storage Based Policy Management (SPBM). To assign the vSAN datastore to the Supervisor, you need to create a vSphere tag and tag category to create an SPBM rule.	You must perform this operation manually or by using PowerCLI.
AIR-SPBM-CFG-002	Create a vSphere Storage Policy Based Management (SPBM) policy that specifies the vSphere tag you created for the Supervisor.	When you create the SPBM policy and define the vSphere tag for the Supervisor, you can then assign that SPBM policy during Supervisor activation.	You must perform this operation manually or by using PowerCLI.

Supervisor

A vSphere cluster that is activated for vSphere with Tanzu is called a Supervisor. After a Supervisor is instantiated, a vSphere administrator can create vSphere Namespaces. Developers can run modern applications that consist of containers running inside vSphere Pods and create Tanzu Kubernetes clusters when upstream Kubernetes compliant clusters are required.

The Supervisor uses ESXi hosts as worker nodes. This is achieved by using an additional process, Spherelet, that is created on each host. Spherelet is a kubelet that is ported natively to the ESXi host and allows the host to become part of the Kubernetes cluster.

You can use vSphere zones and multi-zone Supervisor architecture to implement high availability at the vSphere cluster level for your workloads running within a Tanzu Kubernetes Grid cluster.

Table 2. Design Decisions on the Supervisor for Private AI Ready Infrastructure for VMware Cloud Foundation
Decision ID	Design Decision	Design Justification	Design Implication
AIR-TZU-CFG-001	Activate vSphere with Tanzu on the shared edge and workload vSphere cluster in the VI workload domain.	The Supervisor is required to run Kubernetes workloads natively and to deploy Tanzu Kubernetes Grid clusters natively using Tanzu Kubernetes Grid Service.	Ensure the shared edge and workload vSphere cluster is sized to support the Supervisor control plane, any additional integrated management workloads, and any customer workloads.
AIR-TZU-CFG-002	Deploy the Supervisor with small-size control plane nodes.	Deploying the control plane nodes as small-size appliances gives you the ability to run up to 2,000 pods within your Supervisor. If your pod count is higher than 2,000 for the Supervisor, you must deploy control plane nodes that can handle that level of scale.	You must consider the size of the control plane nodes.
AIR-TZU-CFG-003	Use NSX as provider of the software-defined networking for the Supervisor.	You can deploy a Supervisor either by using NSX or vSphere networking . VMware Cloud Foundation uses NSX for software-defined networking across the SDDC. Deviating for vSphere with Tanzu would increase the operational overhead.	None.
AIR-TZU-CFG-004	Deploy the NSX Edge cluster with large-size nodes.	Large-size NSX Edge nodes are the smallest size supported to activate a Supervisor.	You must account for the size of the NSX Edge nodes.
AIR-TZU-CFG-005	Deploy a single-zone Supervisor.	A three-zone Supervisor requires three separate vSphere clusters.	No change to existing design or procedures with single-zone Supervisor.

Harbor Supervisor Service

To use Harbor with vSphere with Tanzu, you deploy it as a Supervisor Service. Before you install Harbor as a service, you must install Contour.

All Tanzu Kubernetes Grid clusters running on the host Supervisor trust the Harbor Registry running as a Supervisor Service by default. Tanzu Kubernetes Grid clusters running on Supervisors different from the Supervisor where Harbor is installed must have network connectivity to Harbor. These Tanzu Kubernetes Grid clusters must be able to resolve the Harbor FQDN and establish trust with the Harbor Registry.

Table 3. Design Decisions on the Harbor Supervisor Service for AI Ready Infrastructure for VMware Cloud Foundation
Decision ID	Design Decision	Design Justification	Design Implication
AIR-HRB-CFG-001	Deploy Contour as an Ingress Supervisor Service.	Harbor requires Contour on the target Supervisor to provide Ingress Service. The Ingress IP address provided by Contour must be resolved to the Harbor FQDN.	None.
AIR-HRB-CFG-002	Deploy the Harbor Registry as a Supervisor Service.	Harbor as a Supervisor Service has replaced the integrated registry in previous vSphere versions.	You must provide the following configuration: Harbor FQDN Record and Pointer Record (PTR) for the Harbor Registry IP (this IP is provided by the Contour Ingress Service) Manage Supervisor Services privilege in vCenter Server.

Tanzu Kubernetes Grid Cluster

A Tanzu Kubernetes Grid cluster is a full distribution of the open-source Kubernetes container orchestration software that is packaged, signed, and supported by VMware. Tanzu Kubernetes clusters are provisioned by the VMware Tanzu™ Kubernetes Grid™ Service in the Supervisor. The cluster consists of at least one control plane node and at least one worker node. The Tanzu Kubernetes Grid Service deploys the clusters as Photon OS appliances on top of the Supervisor. You determine the deployment parameters (the size and the number of control plane and worker nodes, Kubernetes distribution version, etc.) to be deployed by using a YAML definition through kubectl.

You can provide high-availability to Tanzu Kubernetes Grid clusters when they are deployed on a three vSphere Zone Supervisor. A vSphere zone maps to a vSphere cluster, which means that when you deploy a Supervisor on three vSphere zones, it uses the resources of all three underlying vSphere clusters. This architecture protects your Kubernetes workloads running on Tanzu Kubernetes Grid clusters from failure at a vSphere cluster level. In a single-zone deployment, high availability for Tanzu Kubernetes Grid clusters is provided on an ESXi host level by vSphere HA.

Table 4. Design Decisions on the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation
Decision ID	Design Decision	Design Justification	Design Implication
AIR-TZU-CFG-006	Deploy a Tanzu Kubernetes Grid Cluster in the Supervisor.	For applications that require upstream Kubernetes compliance, a Tanzu Kubernetes Grid Cluster is required.	None.
AIR-TZU-CFG-007	For a disconnected environment, configure a local content library for Tanzu Kubernetes releases (TKrs) for use in the shared edge and workload vSphere cluster.	In a disconnected environment, the Supervisor is unable to pull TKr images from the central public content library maintained by VMware. To deploy a Tanzu Kubernetes Grid on a Supervisor, you must configure a content library in the shared edge and workload vSphere cluster with the required images, downloaded from the public library.	You must manually configure the content library.
AIR-TZU-CFG-008	Use Antrea as the container network interface (CNI) for your Tanzu Kubernetes Grid clusters.	Antrea is the default CNI for Tanzu Kubernetes Grid clusters.	New Tanzu Kubernetes Grid clusters are deployed with Antrea as the CNI, unless you specify Calico.

Sizing Compute and Storage Resources for Private AI Ready Infrastructure for VMware Cloud Foundation

Consider compute and storage requirements when sizing the necessary resources for the validated solution.

You size the compute and storage requirements for the vSphere with Tanzu management workloads, Tanzu Kubernetes Grid cluster management workloads, NSX Edge nodes, and GPU-enabled workloads deployed on vSphere, VM Service in the Supervisor, or a Tanzu Kubernetes Grid cluster.

Table 5. Compute and Storage Resource Requirements for vSphere with Tanzu
Virtual Machine	Nodes	Total vCPUs	Total Memory	Total Storage
Supervisor with small-size control plane for up to 10 workloads	3	12	48 GB	200 GB
Harbor Supervisor Service	N/A	7	7 GB	200 GB
NSX Edge nodes (large nodes)	Minimum of 2	16	64 GB	400 GB

Table 6. Design Decisions on Sizing the Tanzu Kubernetes Grid Cluster for Private AI Ready Infrastructure for VMware Cloud Foundation
Decision ID	Design Decision	Design Justification	Design Implication
AIR-TZU-CFG-009	Deploy Tanzu Kubernetes Grid clusters with a minimum of three control plane nodes.	Deploying three control plane nodes ensures the control plance state of your Tanzu Kubernetes Grid cluster stays if a node failure occurs. Horizontal and vertical scaling of the control plane is supported. See Scale a TKG Cluster on Supervisor Using Kubectl.	None.
AIR-TZU-CFG-010	For production environments, deploy Tanzu Kubernetes Grid clusters with a minimum of three worker nodes.	Deploying three worker nodes provides a higher level of availability of your workloads deployed to the cluster.	You must configure your customer workloads to use effectively the additional worker nodes in the cluster for high availability at an application-level.
AIR-TZU-CFG-011	Deploy Tanzu Kubernetes Grid clusters with small-size control plane nodes if your cluster will have less than 10 worker nodes.	You must size the control plane of a Tanzu Kubernetes Grid cluster according to the amount of worker nodes and pod density.	The size of the cluster nodes impacts the scale of a given cluster. If you must add a node to a cluster, consider the use of larger nodes. For AI GPU-enabled workloads, the GPU is the constraining factor for the amount of worker nodes that could be deployed.