Recover a non-operational cluster where two of the three nodes are permanently down and not recoverable.

When two of the three NSX Advanced Load Balancer Controller nodes within a cluster are permanently down and not recoverable, the remaining Controller node in the cluster will be marked operationally down due to the lack of a cluster quorum.

Note:

All Service Engines continue to operate in so-called “headless mode”.

The following discusses the steps to be used to return to a highly available three-node cluster:

  1. To recover the cluster, the remaining healthy Controller node needs to be first be converted to a single-node cluster configuration. Thereafter, two new nodes can be added to the cluster.

  2. There are two ways of recovering a Controller, with configuration and without configuration. It is important to recover one node with configuration to ensure it is made the Controller leader, while other nodes are added as followers to the cluster:

    • To recover a Controller with configuration, use the /opt/avi/scripts/recover_cluster.py script.

    • To recover a Controller without configuration (essentially a factory reset; rarely necessary), use the /opt/avi/scripts/clean_cluster.py script instead. This is not reversible. The Controller will take a longer time to recreate the database. The/opt/avi/scripts/clean_cluster.py script performs the below tasks:

      • By default, this script reboots the connected SEs, unless the script is run with the switch: /opt/avi/scripts/clean_cluster.py --skip-se-reboot. The only way to login to the Controller node after running the Script is to reset the admin password through the UI.

Typical Recovery

To convert the remaining Controller node to a single-node cluster while preserving the NSX Advanced Load Balancer configuration, execute the following script from the root account. If you attempt to execute it from a non-root account, the script will fail with a Permission denied message. Run sudo and enter the admin password to be promoted to root before running the script.

root@controller1:/home/admin# /opt/avi/scripts/recover_cluster.py

The script will ask for confirmation as a precaution and remind the user must run the script as root.

It is highly recommended to power off the other Controllers that were part of the cluster when running the recover_cluster.py script. Failure to do so can put the current and other nodes in an inoperable state.

The script stops all services on the Controller and restarts them. The Controller will be down and inaccessible for a few minutes.

Once the script finishes, you will be able to log into the Controller node as a single-node cluster. To make this a highly available three-node cluster, add two new, unconfigured Controllers nodes to the cluster.