Use the following best practices for continuous availability (CA).
Understand what Continuous Availability (CA) provides (or does not provide) before enabling (or disabling)
Like HA, enabling CA requires double the resources, as data is stored redundantly in node pairs as opposed to only on one node when CA is disabled. Since the data is being stored in two nodes, this limits the total capacity by 50%.
Review the vRealize Operations Sizing Guidelines for more information.
Deploy the witness node prior to enabling CA
The witness node must be deployed and added to the cluster in order to enable CA.
Deploy the witness node in a separate datacenter
The witness node serves as a tiebreaker when a decision must be made regarding availability of vRealize Operations when the network connection between the two fault domains is lost. Keeping the witness node separate will ensure cluster availability if one of the datacenters is lost.
Ensure that the witness node has a reliable connection to both fault domains
The latency between witness node and fault domains must be as good as between the fault domains and it must be the same for both fault domains.
CA must have an even number of analytics nodes before enabling CA
If the current cluster size consists of an odd number of analytics nodes, deploy one additional analytics node and add to the cluster. The added node must be the same version and size of the existing analytics nodes.
Deploy fault domains into the highest object level as possible
Having fault domains separated into the highest object level in order of datacenters, then clusters, and then hosts will ensure the highest level of availability during failures.
CA will allow losing one fault domain for the cluster to remain functional
It is important to understand and weigh the cost of the extra resources, and placement of fault domains, to the benefits that CA provides.
Enable CA only after all nodes in the cluster have been added and are online
Add all even number of data nodes and witness node to the cluster before enabling CA. On new deployments, add data nodes to build the cluster to fit the appropriate sizing and then enable CA. If you are adding new data nodes to an existing cluster, add as many even numbered data nodes as necessary, then enable HA. The goal is to minimize the number of times you enable CA; the process to enable CA can be very disruptive, so enable CA only when necessary.
Deploy all analytics nodes in the same data center for each fault domain
All analytics nodes must be in the same data center for each fault domain, to ensure latency requirements are consistently met for providing efficient cross node communication and optimal cluster performance.
Deploy analytics cluster nodes on separate hosts in each fault domain
If possible, establish a 1:1 mapping for nodes to hosts. This will minimize the impact to the fault domain if one host goes down.
Use anti-affinity rules that keep nodes on specific hosts in the vSphere cluster
To keep nodes separately on different hosts, use anti-affinity rules to prevent grouping of nodes on specific hosts. The idea is to prevent multiple nodes from going down if hosted on one node.
Name nodes independent of role
Roles may change for nodes so statically naming a node a specific name may be confusing. For example, a node named ‘Master’ may no longer be the actual master node after promoting the replica node. This will avoid user confusion associated with poor naming convention.
CA is not a substitute for a backup and recovery plan
CA allows the cluster to remain functional without data loss while at least one node from all node pairs is available so a separate backup and recovery solution must be used. See vRealize Suite Documentation for supported backup utilities and procedures.
CA is not a Disaster Recovery (DR) strategy
CA for vRealize Operations is not a disaster recovery mechanism so a separate DR solution must be used. See vRealize Suite Documentation. CA allows the cluster to be stretched across two fault domains, with the ability to experience up to one fault domain failure and to recover without causing cluster downtime. The entire cluster does not recover if multiple node pairs, across fault domains, fail at the same time.
Hosts need to be on the same storage in each fault domain
For performance and consistency, use of the same storage is required.