About VMware Aria Operations High Availability

VMware Aria Operations supports high availability (HA). HA creates a replica for the VMware Aria Operations primary node and protects the analytics cluster against the loss of a node.

With HA, data stored in the primary node is always 100% backed up on the replica node. To activate HA, you must have at least one data node deployed, in addition to the primary node. If you have more than one data node, the data stored in the primary node can be stored and replicated in any of the other nodes. But in case the primary node fails, only the replica node can function as the replacement of the primary node.

HA is not a disaster recovery mechanism. HA protects the analytics cluster against the loss of only one node, and because only one loss is supported, you cannot stretch nodes across vSphere clusters in an attempt to isolate nodes or build failure zones.
When HA is activated, the replica can take over all functions that the primary provides, were the primary to fail for any reason. If the primary fails, failover to the replica is automatic and requires only two to three minutes of VMware Aria Operations downtime to resume operations and restart data collection.
When a primary node problem causes failover, the replica node becomes the primary node, and the cluster runs in degraded mode. To get out of degraded mode, take one of the following steps.
- Return to HA mode by correcting the problem with the primary node. When a primary node exits an HA-activated cluster, primary node does not rejoin with the cluster without manual intervention. Therefore, restart the VMware Aria Operations Analytics process on the downed node to change its role to replica and rejoin the cluster.
- Remove the failed primary node then re-activate HA by converting a data node into replica. Removed primary nodes cannot be repaired and readded to VMware Aria Operations.
- Remove the old, failed primary node and then change to non-HA operation by deactivating HA. Removed primary nodes cannot be repaired and readded to VMware Aria Operations.
In the administration interface, after an HA replica node takes over and becomes the new primary node, you cannot remove the previous, offline primary node from the cluster. In addition, the previous node remains listed as a primary node. To refresh the display and activate removal of the node, refresh the browser.
When HA is activated, the cluster can survive the loss of one data node without losing any data. However, HA protects against the loss of only one node at a time, of any kind, so simultaneously losing data and primary/replica nodes, or two or more data nodes, is not supported. Instead, VMware Aria Operations HA provides additional application level data protection to ensure application level availability.
When HA is activated, it lowers VMware Aria Operations capacity and processing by half, because HA creates a redundant copy of data throughout the cluster, and the replica backup of the primary node. Consider your potential use of HA when planning the number and size of your VMware Aria Operations cluster nodes. See Sizing the VMware Aria Operations Cluster.
When HA is activated, deploy analytics cluster nodes on separate hosts for redundancy and isolation. One option is to use anti-affinity rules that keep nodes on specific hosts in the vSphere cluster.
If you cannot keep the nodes separate, you should not activate HA. A host fault might cause the loss of more than one node, which is not supported, and all of VMware Aria Operations can become unavailable.

The opposite is also true. Without HA, you can keep nodes on the same host, and it will not make a difference. Without HA, the loss of even one node can make all of VMware Aria Operations unavailable.
When you power off the data node and change the network settings of the VM, this affects the IP address of the data node. After this point, the HA cluster is no longer accessible and all the nodes have a status of "Waiting for analytics". Verify that you have used a static IP address.
When you remove a node that has one or more vCenter adapters configured to collect data from a HA-activated cluster, one or more vCenter adapters associated with that node stops collecting. You change the adapter configuration to pin them to another node before removing the node.
Administration UI shows the resource cache count, which is created for active objects only, but the Inventory displays all objects. Therefore, when you remove a node from a HA-activated cluster allowing the vCenter adapters collect data and rebalance each node, the Inventory displays a different quantity of objects from that shown in the Administration UI.