To operate your Private AI Ready Infrastructure for VMware Cloud Foundation validated solution in an enterprise environment, you must be able to scale up and scale out efficiently.
Scaling up Within a VI Workload Domain
Scaling up can be an effective method for providing additional resources for your workloads. The most straightforward of mechanisms available to scale up within a Tanzu Enabled VI workload domain is resizing the Supervisor Cluster control plane nodes. When you resize the control plane nodes, you expand your ability to manage additional vSphere Pods and Tanzu Kubernetes clusters within a single vSphere Cluster. You must comply with increased CPU and memory requirements for the three Supervisor Cluster control plane nodes.
Control plane node size |
vCPUs per node |
Memory per node |
Maximum vSphere Pods |
---|---|---|---|
Tiny |
2 |
8 GB |
1000 |
Small |
4 |
16 GB |
2000 |
Medium |
8 |
24 GB |
4000 |
Large |
16 |
32 GB |
8000 |
You can also scale up individual worker nodes in a Supervisor Cluster. Those worker nodes are the ESXi hosts that make up the vSphere Cluster. You can scale a cluster up without incurring additional licensing cost by increasing the number of physical cores, up to 32 cores per socket before additional licensing is required, and the amount of physical memory in each ESXi host. Since this design recommends for at least N+1 sizing within the Supervisor Cluster, you can perform rolling hardware upgrades until each ESXi host is upgraded.
Scaling up NSX resources for the software-defined network, load balancers, NAT, and so on, is less effective than doing so with compute resources. To support activation of a Supervisor Cluster, the associated NSX Edge cluster nodes must already be sized large at a minimum. You do not have much room to expand into those deployed NSX Edge cluster nodes. Scaling out is much more effective here.
Scaling up GPU resources on an existing vSphere Cluster depends on the amount of GPUs that the physical server can have, power, and cool. A homogenous GPU configuration within the same cluster streamlines resource allocation, workload distribution, and troubleshooting processes.
Scaling Out Within a VI Workload Domain
There are multiple ways in which resources can be effectively scaled out within a VI workload domain. The primary approach is to scale out a Supervisor Cluster by adding ESXi hosts. This approach is effective only to a certain point, at which time adding vSphere Clusters to the workload domain and activating them as Supervisor Clusters is the preferred method. The point at which this method becomes necessary depends on many factors, including but not exclusively:
- Availability – placing many eggs into a single basket, some applications might require scale outside of a single Kubernetes cluster or a fault domain.
- Manageability – time to remediate when applying updates or upgrades, complexity of many namespaces or tenants within a domain.
- Performance – noisy neighbors, overloading individual components (network, CPU, memory), scalability limits within the domain.
- Recoverability – loss of the entire fault domain in larger clusters drives up recovery time, recovery point objectives are harder to meet with more backup activity within a domain.
- Security – separation of duties, RBAC.
When you decide to add another vSphere Cluster for use as a Supervisor Cluster, you have to decide how to scale your NSX Edge cluster resources, as well. Additional Supervisor Clusters within the VI workload domain must have their own set of software-defined network (SDN) components, so you must instantiate a new NSX Edge cluster. At this point, you have two choices:
Deploy an additional NSX Edge cluster within the workload domain with an additional Tier-0 Gateway, or
Deploy an additional NSX Edge cluster without the Tier-0 Gateway and attach a Tier-1 Gateway running on the NSX Edge cluster to the Tier-0 Gateway running on the original NSX Edge cluster.
This also applies in the case where software-define networking resources are scaled out by adding another NSX Edge cluster without adding more compute resources with an additional Supervisor Cluster. The Tier-0 Gateway in NSX provides dynamic routing from the SDN to the top-of-rack switches. In this case, it might be additional management overhead without much benefit. The opposite side of that manageability argument is consistency across NSX Edge cluster deployed in the environment. Aberrant configurations can make troubleshooting harder, especially if the SDN environment is not well-documented.