Scale Management for Developer Ready Infrastructure for VMware Cloud Foundation

To operate your Developer Ready Infrastructure for VMware Cloud Foundation validated solution in an enterprise environment, you must be able to scale up and scale out efficiently.

Scaling up Within a VI Workload Domain

Scaling up can be an effective method for providing additional resources for your workloads. The most straightforward of mechanisms available to scale up within a VI workload domain is resizing the Supervisor control plane nodes. When you resize the control plane nodes, you expand your ability to manage additional vSphere Pods and Tanzu Kubernetes clusters within a single vSphere Cluster. You must comply with increased CPU and memory requirements for the three Supervisor control plane nodes.

Table 1. Supervisor control plane node sizing
Control plane node size	vCPUs per node	Memory per node	Maximum vSphere Pods
Tiny	2	8GB	1000
Small	4	16GB	2000
Medium	8	24GB	4000
Large	16	32GB	8000

The compute resource requirements for three Supervisor control plane nodes at the recommended small sizing starting point is 12 vCPUs and 48GB of memory. Within that compute footprint, you can run up to 2000 vSphere Pods. If you want to scale that up to 8000 vSphere pods within that Supervisor, the aggregate compute resource requirements for the control plane increases to 48 vCPUs and 96GB of memory. At the resource levels required to effectively run 8000 vSphere Pods and the user workloads contained therein, this is a good option to keep the Supervisor operating effectively.

You can also scale up individual worker nodes in a Supervisor. Those worker nodes are the ESXi hosts that make up the vSphere Cluster. You can scale a cluster up without incurring additional licensing cost by increasing the number of physical cores, up to 32 cores per socket before additional licensing is required, and the amount of physical memory in each ESXi host. Since this design recommends for at least N+1 sizing within the Supervisor, you can perform rolling hardware upgrades until each ESXi host is upgraded.

Scaling up NSX resources for the software-defined network, load balancers, NAT, and so on, is less effective than doing so with compute resources. To support activation of a Supervisor, the associated NSX Edge cluster nodes must already be sized large at a minimum. You do not have much room to expand into those deployed NSX Edge cluster nodes. Scaling out is much more effective here.

Scaling Out Within a VI Workload Domain

There are multiple ways in which resources can be effectively scaled out within a VI workload domain. The primary approach is to scale out a Supervisor by adding ESXi hosts. This approach is effective only to a certain point, at which time adding vSphere Clusters to the workload domain and activating them as Supervisors is the preferred method. The point at which this method becomes necessary depends on many factors, including but not exclusively:

Availability - placing many eggs into a single basket, some applications might require scale outside of a single Kubernetes cluster or a fault domain.
Manageability - time to remediate when applying updates or upgrades, complexity of many namespaces or tenants within a domain.
Performance - noisy neighbors, overloading individual components (network, CPU, memory), scalability limits within the domain.
Recoverability - Loss of the entire fault domain in larger clusters drives up recovery time, recovery point objectives are harder to meet with more backup activity within a domain.
Security - separation of duties, RBAC

When you decide to add another vSphere Cluster for use as a Supervisor, you have to decide how to scale your NSX Edge cluster resources, as well. Additional Supervisors within the VI workload domain must have their own set of software-defined network (SDN) components, so you must instantiate a new NSX Edge cluster. At this point, you have two choices:

Deploy an additional NSX Edge cluster within the workload domain with an additional Tier-0 Gateway, or
Deploy an additional NSX Edge cluster without the Tier-0 Gateway and attach a Tier-1 Gateway running on the NSX Edge cluster to the Tier-0 Gateway running on the original NSX Edge cluster.

This also applies in the case where software-define networking resources are scaled out by adding another NSX Edge cluster without adding more compute resources with an additional Supervisor. The Tier-0 Gateway in NSX provides dynamic routing from the SDN to the top-of-rack switches. In this case, it might be additional management overhead without much benefit. The opposite side of that manageability argument is consistency across NSX Edge cluster deployed in the environment. Aberrant configurations can make troubleshooting harder, especially if the SDN environment is not well-documented.