For Enterprise Edge reference architecture in the data center, vCenter and infrastructure scaling limits, managing TKG clusters, and best practices around networking design for Enterprise Edge deployments are essential to architecting secure, scalable, and performant designs.
The reference design provided for the data center will include the minimum requirements to run Enterprise Edge and the corresponding components in the data center. It will not include the architecture for deploying NSX Advanced Load Balancer in the data center or discuss design considerations for VMware Cloud Foundation (VCF).
If you already leverage vSphere and other VMware infrastructure software in your data center, ECS Enterprise Edge will not require architectural changes to your existing architecture. Only additional software components such as vCenter, vSAN Witness, TKG clusters, and optionally Harbor will need to be deployed in the data center to meet the Enterprise Edge requirements.
Data Center Scaling Limits
An Enterprise Edge deployment consists of hundreds and potentially thousands of edge locations that the data center will have to support. It’s important to understand the scale limits for different components so administrators can plan and architect the data center resources accordingly.
Component |
Recommended Scale Limit |
Comments |
---|---|---|
vCenter |
2000 hosts per vCenter |
Up to 15 vCenter’s can be linked to support additional hosts |
vSphere Distributed Switch (VDS) |
128 VDS per vCenter 16 VDS per host |
|
vSAN Witness |
64 2-node vSAN clusters per shared Witness Appliance |
Requires Extra-large witness VM (6 vCPUs and 32 GB vRAM) |
TKG Management Cluster |
Supports up to 200 TKG workload clusters |
Using a sample deployment in a vCenter shown in the following diagram, we can see the most challenging limits will be the number of hosts per vCenter and the imposed 128 VDS per vCenter. To scale beyond 2000 hosts, in which the number of edge sites depends on the edge type, additional vCenter appliances must be deployed. The scaling limit of 128 VDS per vCenter allows for the following VDS design options for the edge sites:
One VDS per edge cluster. This will limit the vCenter to manage a total of 128 edge sites.
One VDS per multiple edge clusters. This design allows the vCenter to scale beyond 128 edge sites, but it requires consistent VLAN schemes at edge locations that share a VDS.
Because single-node edges generally have a simple network design, the 2nd option can be effective for single-node Enterprise Edges and may simplify layer 2 network management for those sites. However, it should be pointed out that such a design will cause configuration changes on the VDS to affect multiple edges as it enforces a template for those sharing the VDS. We recommend that administrators evaluate the pros and cons against their own edge deployments.
TKG Clusters
Tanzu Kubernetes Grid utilizes an open-source project called Cluster API. Cluster API is a set of custom resource definitions (CRDs) and controllers that get deployed inside a Kubernetes Cluster. It is responsible for the life cycle management of Kubernetes clusters where applications run. We refer to the cluster containing Cluster API as the Management Cluster. When deploying Enterprise Edge in a Hub and Spoke topology, we recommend deploying 1 management cluster per vCenter in the datacenter for a group of edges. We also recommend leveraging these edge management clusters only for the life cycle management of edge workload clusters. If you need to deploy workload clusters in the data center for running automation tools or a central harbor, we recommend using a separate management cluster for the life cycle and managing those Kubernetes clusters.
When TKG management clusters are provisioned, the nodes provisioned will be assigned IP from DHCP (TKG MGMT network). Likewise, when the Management cluster provisions workload clusters, VMs for TKG workload clusters at the edge will also need to be assigned IP addresses from DHCP (TKG WL network). At the edge, the required range for DHCP will be dependent on the number of K8s control plane nodes and worker nodes deployed. We recommend minimally 2 times the number of expected nodes/VMs. Patching existing clusters will temporarily stand-up additional nodes/VMs as it rolls through all the nodes in the cluster.
As of writing this guide, TKG IPAM is in development and will provide an alternative to leveraging DHCP for VM IP assignment.
Care should be taken when network connectivity between the data center and edge is unreliable or highly latent. The following is a list of cross-network dependencies when provisioning and managing remote edge clusters:
Activity |
Path |
Lost or Degraded Connectivity Impact |
Compensating operations |
---|---|---|---|
Stage TKG OVA templates locally on Edge storage |
Datacenter vCenter to subscription source |
Copy to edge fails |
Use Content Library Service to propagate OVA templates as it supports retry and timeout logic, propagate OVA templates at a time when networking is known to be stable |
Management Cluster Provision k8s cluster nodes |
Management cluster to vCenter, vCenter to Edge hosts |
Timeout creating cluster |
Delete cluster and retry once connectivity restored |
Workload Cluster Health Check |
Management Cluster to Edge Workload Cluster |
Health check threshold reached, possible recreate of cluster |
Tune MachineHealthcheck for longer period or disable |
Edge workload Persistent Volume request |
Edge vSphere CSI driver to vCenter |
PVC cannot be created or remounted under new Node after Node failure |
Reprovision app workload, make sure important business data is backed up using data service specific backup processes, use other Kubernetes storage provider that does not require connectivity to datacenter resources |
To back up and restore the workloads hosted by TKG Workload clusters, you can use Velero, an open-source community standard tool for backing up and restoring Kubernetes cluster objects and persistent volumes. Velero supports a variety of storage providers to store its backups. You can find more information here.Please visit “Backup and Restore” under the “Lifecycle Management” section of the Tanzu Edge Solution Reference Architecture guide for more details.