Recover from a VMware Cloud Director Appliance Primary Cell Failure in a High Availability Cluster

If the primary cell is not running properly, to recover the VMware Cloud Director database, one of the standby cells must become the new primary cell and you must deploy a new standby. Depending on the failure mode, the VMware Cloud Director appliance automatically promotes a standby cell as the new primary or you must promote it manually.

Depending on the failover mode of the VMware Cloud Director appliance, there are two different workflows for recovering from a primary cell failure. You can use these workflows to reuse the IP addresses and hostname of the failed primary when you deploy the new standby.

Recovery Workflow for Manual Failover Mode

If the primary cell is in the Not reachable or Failed state and the two standby cells are in the Running state, you can recover from the failure by using the appliance HTML5 user interface and the VMware Cloud Director appliance API.

To view the state of the cells in the cluster, see View Your VMware Cloud Director Appliance Cluster Health and Failover Mode.

If possible, by using the cell management tool, shut down the VMware Cloud Director process. From the failed primary cell, run the following command
```
/opt/vmware/vcloud-director/bin/cell-management-tool  -u <sysadmin user> cell --shutdown
```
Power off the failed primary VM.
Promote a standby cell to become the new primary.
1. Log in as root to the appliance management UI of a running standby cell, https://standby_ip_address:5480.
2. In the Role column for the standby cell that you want to become the new primary cell, click Promote.
The management UI shows two cells with the primary role. The original primary has a failed status and the new primary has a running status. The cluster health is Degraded.
From any cell other than the failed primary, using the appliance API Unregister method, remove the failed primary appliance from the repmgr high availability cluster. See the VMware Cloud Director Appliance API documentation.
Remove the failed primary appliance from the VMware Cloud Director server group.
1. Log in as an administrator to the Service Provider Admin Portal.
2. From the top navigation bar, under Resources, select Cloud Resources.
3. In the left panel, click Cloud Cells.
4. Select the inactive cell and click Unregister.
If you want to reuse the IP address and hostname of the failed primary, ensure that the failed primary appliance remains powered off or use the vSphere Client to delete it.
Deploy a new standby appliance. You can deploy the appliance by using the vSphere Client or deploy the appliance by using the VMware OVF Tool.
After deploying the new standby, the cluster health must be Healthy.
If the VMware Cloud Director appliance FIPS mode was on before the restore, you must set it again by using the VMware Cloud Director appliance API.
The cell FIPS mode restores automatically.

Recovery Workflow for Automatic Failover Mode

If the primary is in the Failed state, VMware Cloud Director automatically promotes a standby cell as the new running primary but the cluster is in the Degraded state because there is only one running standby cell. You can recover from the failure by using the HTML5 user interface and the VMware Cloud Director appliance API.

To view the state of the cells in the cluster, see View Your VMware Cloud Director Appliance Cluster Health and Failover Mode.

If possible, by using the cell management tool, shut down the VMware Cloud Director process. From the failed primary cell, run the following command
```
/opt/vmware/vcloud-director/bin/cell-management-tool  -u <sysadmin user> cell --shutdown
```
Power off the failed primary VM.
The management UI shows two cells with the primary role. The original primary has a failed status and the new primary has a running status. The cluster health is Degraded.
From any cell other than the failed primary, by using the appliance API Unregister method, remove the failed primary appliance from the repmgr high availability cluster. See the VMware Cloud Director Appliance API documentation.
Remove the failed primary appliance from the VMware Cloud Director server group.
1. Log in as an administrator to the Service Provider Admin Portal.
2. From the top navigation bar, under Resources, select Cloud Resources.
3. In the left panel, click Cloud Cells.
4. Select the inactive cell and click Unregister.
If you want to reuse the IP address and hostname of the failed primary, ensure that the failed primary appliance is powered off or use the vSphere Client to delete it.
Deploy a new standby appliance. You can deploy the appliance by using the vSphere Client or deploy the appliance by using the VMware OVF Tool. After deploying the new standby, the cluster health must be Healthy.
From any cell other than the failed primary cell, use the appliance API Failover method to reset the cluster failover mode to Automatic. See the VMware Cloud Director Appliance API documentation.
If the VMware Cloud Director appliance FIPS mode was on before the restore, you must set it again by using the VMware Cloud Director appliance API.
The cell FIPS mode restores automatically.