Failover and Failback

Two operations crucial to keeping your data safe in the event of a disaster are failover and failback; both of these require some manual intervention, as they must be done carefully to ensure the clusters end up in the correct state. The two operations also come into play when carrying out a planned downtime event.

The failover operation makes your recovery cluster the primary production cluster while you are performing the necessary work to make the original primary cluster healthy again after a disaster has occurred or a scheduled downtime event is in progress.

The failback operation returns production responsibility to the original primary cluster once the disaster is resolved or a scheduled downtime event is complete.

This topic walks you through the steps to perform the failover operation and the steps to perform the failback operation. We begin by defining a couple of important terms.

Terminology

This topic employs the following terminology:

The acting primary cluster is the original recovery cluster that has been promoted to primary and then activated as a primary after the disaster or the planned downtime event.
The original primary cluster is the cluster that was the primary cluster up until the moment it went down.

Failover Steps

On the original primary cluster, disable WAL archiving with the following two commands to ensure that no more WAL archive files are accidentally pushed to the repository:
```
gpconfig -r archive_mode  
gpconfig -r archive_command 
```
Stop the original primary cluster, to ensure that the primary cluster is no longer pushing data to the repository:
```
gpstop -a
```
Promote the recovery cluster:
```
gpdr promote
```
Edit the pg_hba.conf files of the coordinator and segments to update the host records of this cluster. Note: This step is only necessary if GPDR provisioned the recovery cluster for you.
Start up the recovery cluster:
```
gpstart -a
```
Optional. If the original primary cluster had mirrors, and you want them in the acting primary cluster, set them up as described here.

Remove stale configuration files from the cluster:

gpssh –e “rm -rf /usr/local/gpdr/configs/*”

Install any extensions that were present on the original primary cluster.
Update your production workloads to point to the acting primary cluster.
Start backing up data from this cluster, either using a new repository or re-using the same repository you were using before, as described below. The new repository approach is safer because it reduces the chance of data conflict.

Establish New Repository
1. Configure a new repository path; this ensures that there is no data conflict with the older primary cluster in case it still pushes data after failover. See the Preparation Tasks topic for help configuring the path.
2. Configure the backup:
```
gpdr configure backup --config-file <path-to-config-file> 
```
3. Resume the backup workflow; at this point, you may only take a full backup:
```
gpdr backup -t full
```
  The recovery cluster is now the acting primary cluster.
Reuse Existing Repository

With this approach, you will continue to use the repository path you first established when you set up the original primary cluster; however, you must ensure the downed original primary cluster will never accidentally push data to this repository. In addition, you may use the same configuration file you used when you set up the original primary cluster.
1. Configure the backup:
```
gpdr configure backup --config-file <path-to-config-file> 
```
2. Resume the backup workflow; at this point you may take either a full backup or an incremental backup.
  
  Taking an incremental backup may be faster if there are not significant changes between the last full backup and the current state of the cluster. If the divergence is signifcant you might be better off going with a new full backup instead.
```
gpdr backup -t full
```
  OR
```
gpdr backup -t incr
```
  The recovery cluster is now the acting primary cluster.
You may now proceed with the normal Greenplum Disaster recovery flow, creating restore points and incremental backups.

Failback Steps

Configure the original primary cluster for restore. You may use either an existing cluster, with the --use-existing-cluster option, or ask GPDR to configure a cluster for you, with the --recovery-cluster-config-file option. In the latter case, be sure to use the recovery configuration file specific to the cluster you are running on. Update the contents of this file to correctly configure hosts and data directories for this cluster.
```
gpdr configure restore --config-file <path-to-config-file> --recovery-cluster-config-file <path-to-recovery-config-file>
```
Perform an incremental restore, using the latest restore point:
```
gpdr restore -t incr --restore-point latest
```
On the acting primary cluster, disable WAL archiving with the following two commands to ensure that no more WAL archive files are accidentally pushed to the repository:
```
gpconfig -r archive_mode  
gpconfig -r archive_command 
```
Stop the acting primary cluster, to ensure that it is no longer pushing data to the repository:
```
gpstop -a
```
Promote the original primary cluster:
```
gpdr promote
```
Edit the pg_hba.conf files of the coordinator and segments to update the host records of the original primary cluster. Note: This step is only necessary if GPDR provisioned the recovery cluster for you.
Start up the original primary cluster:
```
gpstart -a
```
Reconfigure the primary cluster for backup:
```
gpdr configure backup --config-file <path-to-config-file> 
```
Failback is now complete. The original primary cluster is once again the primary cluster and the acting primary cluster is once again the recovery cluster
Resume taking backups:
```
gpdr backup -t { full | incr }
```