Site Availability

For site disaster resovery and maintenance, multiple sites can be used to implement geographic redundancy. This ensures that a secondary site can be used to complement the primary site.


Redundancy Choice	Design Considerations
Singe Site	For non-critical services where longer downtime is acceptable. Services are stateless and traffic can be routed to another instance in case of failure. Partial site loss that can be recovered by using vSphere HA and shared storage. Backup-based approaches are acceptable for VM state recovery.
Active-Standby	Suitable for a recovery point objective with acceptable maximum of five minutes. High speed and capacity links can be established between primary and secondary site Higher capital costs required to maintain replicas in recovery site
Active-Active	Suitable for mission-critical services with almost no downtime Service are deployed in active-active configuration across multiple sites Any site should be able to handle the load of both active sites in case of failure Higher capital cost is required in each site in case either site fails This configuration is currently not available in this release

Availability Zones

Redundancy can be built in a single site to protect against power outages, for example. However, this design choice comes with duplicate hardware cost overhead. This design is out of scope of this reference architecture.

By design, multiple availability zones belong to a single region. The physical distance between the availability zones is short enough to offer low, single-digit latency and large bandwidth between the zones. This architecture allows the cloud infrastructure in the availability zone to operate as a single virtual data center within a region. Workloads can be operated across multiple availability zones in the same region as if they were part of a single virtual data center. This supports an architecture with high availability that is suitable for mission critical applications. When the distance between two locations of equipment becomes too large, these locations can no longer function as two availability zones within the same region and must be treated as multi-site design.

Multi-Region

Multiple sites support placing workloads closer to the CSP's customers, for example, by operating one site in US Northern California and one site in US Southern California. In this multi-site design, a secondary site can become a recovery site for the primary site. When components in the primary site become compromised, all nodes are brought up in the failover site.

VMware vSphere Replication is used to replicate VM disks from the primary to the secondary site. In case of failure, the Resource Pod workloads can be brought up on the secondary site and routing updated to divert subscriber traffic to the secondary site.