Recover a Non-Operational Controller Cluster

When two of the three NSX Advanced Load Balancer Controller nodes within a cluster are permanently down and not recoverable, the remaining Controller node in the cluster will be marked operationally down due to the lack of a cluster quorum.

Note:

All SEs continue to operate in headless mode.

Follow the steps below to return to a highly available three-node cluster:

To recover the cluster, you must first convert the remaining healthy Controller node to a single-node cluster configuration. Thereafter, two new nodes can be added to the cluster.
There are two ways of recovering a Controller, that is, with configuration and without configuration. It is important to recover one node with configuration to ensure it is made the Controller leader while other nodes are added as followers to the cluster:
- To recover a Controller with configuration, use the /opt/avi/scripts/recover_cluster.py script.
- To recover a Controller without configuration (essentially a factory reset; rarely necessary), use the /opt/avi/scripts/clean_cluster.py script instead. It is not reversible. The Controller will take a longer time to recreate the database. The/opt/avi/scripts/clean_cluster.py script performs the below tasks:
  - By default, this script reboots the connected SEs, unless the script is run with the switch.
    /opt/avi/scripts/clean_cluster.py --skip-se-reboot
  - The only way to login to the Controller node after running the script is to reset the admin password through the UI.

Typical Recovery

To convert the remaining Controller node to a single-node cluster while preserving the NSX Advanced Load Balancer configuration, execute the following script from the root account. If you attempt to execute it from a non-root account, the script will fail with a Permission denied message. Run sudo and enter the admin password to be promoted to root before running the script.

root@controller1:/home/admin# /opt/avi/scripts/recover_cluster.py

The script will request confirmation as a precaution and remind the user must run the script as root.

It is highly recommended to power off the other Controllers that were part of the cluster when running the recover_cluster.py script. Failure to do so can put the current and other nodes in an inoperable state.

The script stops all services on the Controller and restarts them. The Controller will be down and inaccessible for a few minutes.

Once the script finishes, you can log into the Controller node as a single-node cluster. To make this a highly available three-node cluster, add two new, unconfigured Controllers nodes to the cluster.

Note:

Ensure that the Controllers are on the same base and patch version.