Introduction to Stretched Clusters

Stretched clusters extend the Virtual SAN cluster from a single site to two sites for a higher level of availability and intersite load balancing. Stretched clusters are typically deployed in environments where the distance between data centers is limited, such as metropolitan or campus environments.

You can use stretched clusters to manage planned maintenance and avoid disaster scenarios, because maintenance or loss of one site does not affect the overall operation of the cluster. In a stretched cluster configuration, both sites are active sites. If either site fails, Virtual SAN uses the storage on the other site. vSphere HA restarts any VM that must be restarted on the remaining active site.

You must designate one site as the preferred site. The other site becomes a secondary or nonpreferred site. The system uses the preferred site only in cases where there is a loss of network connection between the two active sites, so the one designated as preferred is the one that remains operational.

A Virtual SAN stretched cluster can tolerate one link failure at a time without data becoming unavailable. A link failure is a loss of network connection between the two sites or between one site and the witness host. During a site failure or loss of network connection, Virtual SAN automatically switches to fully functional sites.

For more information about working with stretched clusters, see the Virtual SAN Stretched Cluster Guide.

Witness Host

Each stretched cluster consists of two sites and one witness host. The witness host resides at a third site and contains the witness components of virtual machine objects. It contains only metadata, and does not participate in storage operations.

The witness host serves as a tiebreaker when a decision must be made regarding availability of datastore components when the network connection between the two sites is lost. In this case, the witness host typically forms a Virtual SAN cluster with the preferred site. But if the preferred site becomes isolated from the secondary site and the witness, the witness host forms a cluster using the secondary site. When the preferred site is online again, data is resynchronized to ensure that both sites have the latest copies of all data.

If the witness host fails, all corresponding objects become noncompliant but are fully accessible.

The witness host has the following characteristics:

The witness host can use low bandwidth/high latency links.
The witness host cannot run VMs.
A single witness host can support only one Virtual SAN stretched cluster.
The witness host must have one VMkernel adapter with Virtual SAN traffic enabled, with connections to all hosts in the cluster. The witness host uses one VMkernel adapter for management and one VMkernel adapter for Virtual SAN data traffic. The witness host can have only one VMkernel adapter dedicated to Virtual SAN.
The witness host must be a standalone host dedicated to the stretched cluster. It cannot be added to any other cluster or moved in inventory through vCenter Server.

The witness host can be a physical host or an ESXi host running inside a VM. The VM witness host does not provide other types of functionality, such as storing or running VMs. Multiple witness hosts can run as VMs on a single physical server. For patching and basic networking and monitoring configuration, the VM witness host works in the same way as a typical ESXi host. You can manage it with vCenter Server, patch it and update it by using esxcli or vSphere Update Manager, and monitor it with standard tools that interact with ESXi hosts.

You can use a witness virtual appliance as the witness host in a stretched cluster. The witness virtual appliance is an ESXi host in a VM, packaged as an OVF or OVA. The appliance is available in different options, based on the size of the deployment.

Stretched Cluster Versus Fault Domains

Stretched clusters provide redundancy and failure protection across data centers in two geographical locations. Fault domains provide protection from rack-level failures within the same site. Each site in a stretched cluster resides in a separate fault domain.

A stretched cluster requires three fault domains: the preferred site, the secondary site, and a witness host.

In Virtual SAN 6.6 and later releases, you can provide an additional level of local fault protection for virtual machine objects in stretched clusters. When you configure a stretched cluster with four or more hosts in each site, the following policy rules are available for objects in the cluster:

Primary level of failures to tolerate (PFTT). This rule defines the number of host and device failures that a virtual machine object can tolerate across the two sites. The default value is 1, and the maximum value is 3.
Secondary level of failures to tolerate. Defines the number of host and object failures that a virtual machine object can tolerate within a single site. The default value is 0, and the maximum value is 3.
Affinity. This rule is available only if the Primary level of failures to tolerate is set to 0. You can set the Affinity rule to None, Preferred, or Secondary. This rule enables you to restrict virtual machine objects to a selected site in the stretched cluster. The default value is None.

Note: When you configure the Secondary level of failures to tolerate for the stretched cluster, the Fault tolerance method rule applies to the Secondary level of failures to tolerate. The failure tolerance method used for the Primary level of failures to tolerate (PFTT) defaults to RAID 1.

In a stretched cluster with local fault protection, even when one site is unavailable, the cluster can perform repairs on missing or broken components in the available site.