Configuring Disaster Recovery

This page describes how to set up a disaster recovery configuration for VMware SQL with Postgres for Kubernetes.

Overview

The VMware SQL with Postgres for Kubernetes operator allows you to create a Disaster Recovery scenario, where an instance on a primary site can fail over to an instance on a target (Disaster Recovery ) site. VMware Postgres Operator Disaster Recovery relies on backups and WAL shipping from the primary site to a remote storage location.

The disaster recovery workflow can be used across namespaces in the same cluster, or across remote clusters, as long as the two instances have a matching yaml file, and can both access the backup location.

The VMware Postgres Operator instance in the primary Kubernetes cluster is referred to as the "source" instance, and the instance deployed in the Disaster Recovery Kubernetes cluster is referred to as the "target" instance.

Note: A target instance involved in a Disaster Recovery scenario cannot be part of an HA setup while being a Disaster Recovery target.

Prerequisites

Before setting up a Disaster Recovery scenario, ensure that you meet the following prerequisites:

The primary and target cluster are upgraded to VMware Postgres Operator 2.1.0 as a minimum.
You have familiarity with the VMware Postgres Operator backup and restore process and the backup and restore custom resources (CRDs). For details, review Backing Up and Restoring VMware Postgres Operator.
You have configured an S3 or Azure data store for the backups.
You have created and applied a matching PostgresBackupLocation object at the primary and target site, that represents the backup target for the primary site and the restore source for the target site. For details on creating a PostgresBackupLocation object see Configure the Backup Location. Ensure that you have configured the correct namespace in the CRD.
The selected source instance has a backupLocation field that refers to the PostgresBackupLocation you will use for the Disaster Recovery scenario. For details, see Create the instance CRD.
If your target instance is already created, ensure that its instance yaml matches the source instance.

Set up continuous restore

Create a continuous restore flow by utilizing the backups on the primary site.

On the Disaster Recovery Kubernetes cluster, edit the target instance yaml and set the deploymentOptions fields accordingly:

highAvailability:
  enabled: false
  readReplicas: 0
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <source-instance-stanza-name>

where <source-instance-stanza-name> is the name of the backup stanza created for the source instance. To get the stanza on the source instance data pod, run:

kubectl get postgres <source-instance> -o jsonpath={.status.stanzaName}

For example:

kubectl get postgres source-postgres-example -o jsonpath={.status.stanzaName}

will have an output similar to:

default-source-postgres-example-5eaa4601-e903-467b-9833-80055e95d819

Deploy the target instance using kubectl apply -f <target-instance>.yaml.

Note the following for the target instance:
- The Postgres server of a target instance won't be started until at least one backup has been performed on the source instance. After the target instance's Postgres server initializes, it will be available for read-only queries until it is promoted to act as the primary.
- An instance created as a continuousRestoreTarget cannot create backups as described in Backing up and restoring VMware Postgres Operator.
- The target instance cannot have HA enabled. If the highAvailability.enabled field is set to "true", when you apply the yaml with continuousRestoreTarget: true, you will receive an error similar to:
```
"spec.highAvailability.enabled must be false when spec.deploymentOptions.continuousRestoreTarget is set to 'true'"
```
To verify that the continuous restore is working properly on the target instance, you can refer to the field status.lastRestoreTime in the target instance by running this command on target Kubernetes cluster:
```
kubectl -n <namespace> get postgres postgres-sample -o jsonpath='{.status.lastRestoreTime}'
```
This value should match the last transaction in the most recent WAL file that exists on the remote storage.

Fail over to the Disaster Recovery site

Failover to the Disaster Recovery site if the primary site goes down, or if you want to test the disaster recovery flow, or if you wish to perform maintenance on the primary site. In a failover scenario, promote the target instance on the Disaster Recovery Kubernetes cluster so that it starts serving traffic, by following these steps:

Make sure that there is no application traffic against the source instance on the primary site.
Promote the target instance by setting the spec.deploymentOptions.continuousRestoreTarget field to false and applying the change. This update will restart the instance and initialize the Postgres server, and the server would be ready to accept any read/write requests.

Wait until the target instance is in a Running state by running this command:
```
kubectl wait postgres <target-instance> --for=jsonpath={.status.currentState}=Running --timeout=12m
```
An output similar to the following will be shown when instance is ready to accept read-write connections:
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
Take an on-demand backup on the target instance. For details see Perform an On-Demand Backup.

Set the source instance, edit the instance yaml, and set it as a continuous restore target of the target instance:

highAvailability:
  enabled: false
  readReplicas: 0
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <target-instance-stanza-name>

Get the value of <target-instance-stanza-name> by running this command on the target instance:

kubectl get postgres <target-instance> -o jsonpath={.status.stanzaName}

For example:

kubectl get postgres target-postgres-example -o jsonpath={.status.stanzaName}

which will show an output similar to:

default-target-postgres-example-7d4e2f84-f521-43c2-b3c9-73c3fde3dc8e

(Optional) After the target instance is up and running, you can enable highAvailability mode if necessary. Verify that the target instance is running as expected:
```
kubectl wait postgres <name-of-target-instance> --for=jsonpath={.status.currentState}=Running --timeout=10m
```
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
Then enable high availability in the target yaml:
```
highAvailability:
  enabled: true
  readReplicas: 1
```
Apply the changes.
You can now reroute any application traffic from the source instance to the target instance.

Fail back to the primary site

Fail back to the primary site after the primary is back up, or after the maintenance is completed.

Make sure that there is no application traffic against the target instance.
Confirm that the source instance at the primary site is in continuous restore.

Run the following command:
```
kubectl get postgres <source-instance> -o jsonpath="{.spec.deploymentOptions.continuousRestoreTarget}"
```
The output should match the following:
```
true
```
Note: Any value other than true is invalid and the source instance will need to be updated to reflect the correct value.
Take an on-demand backup on the target instance. For details see Perform an On-Demand Backup.
Verify that WAL files shipped during the backup of the target instance have been restored on the source instance by checking the value of status.lastRestoreTime in the source instance. It should be equal to the last time a transaction was performed on the target instance.

Update the target instance yaml to the original configuration, which will revert the instance back to being a continuous restore target:

highAvailability:
  enabled: false
  readReplicas: 0
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <source-instance-stanza-name>

On the source instance at the primary site, update the field deploymentOptions.continuousRestoreTarget to false. This will restart the instance and bring the postgres server back up.

Wait until the source instance is in a Running state by running this command:
```
kubectl wait postgres <source-instance> --for=jsonpath={.status.currentState}=Running --timeout=12m
```
An output similar to the following will be shown when instance is ready to accept read-write connections:
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
(Optional) After the instance is up and running, you can enable highAvailability mode if necessary. You can do this by updating the instance spec:
```
highAvailability:
  enabled: true
  readReplicas: 1
```
Apply the changes.
You can now reroute your application traffic back to the source instance.