Configuring Disaster Recovery

This page describes how to set up a disaster recovery configuration for VMware SQL with Postgres for Kubernetes.

Overview

The VMware SQL with Postgres for Kubernetes operator allows you to create a Disaster Recovery scenario, where an instance on a primary site can fail over to an instance on a target (Disaster Recovery ) site. VMware Postgres Operator Disaster Recovery relies on synchronized backups that become restore targets for the Disaster Recovery site target instance.

The disaster recovery workflow can be used across namespaces in the same cluster, or across remote clusters, as long as the two instances have a matching yaml file, and can both access the backup location.

The VMware Postgres Operator instance in the primary Kubernetes cluster is referred to as the "source" instance, and the instance deployed in the Disaster Recovery Kubernetes cluster is referred to as the "target" instance.

Note: A target instance involved in a Disaster Recovery scenario cannot be part of an HA setup while being a Disaster Recovery target.

Prerequisites

Before setting up a Disaster Recovery scenario, ensure that you meet the following prerequisites:

The primary and target cluster are upgraded to VMware Postgres Operator 1.8.0 as a minimum.
You have familiarity with the VMware Postgres Operator backup and restore process and the backup and restore custom resources (CRDs). For details, review Backing Up and Restoring VMware Postgres Operator.
You have configured an S3 or Azure data store for the backups.
You have created and applied a matching PostgresBackupLocation object at the primary and target site, that represents the backup target for the primary site, and the restore source for the target site. For details on creating a PostgresBackupLocation object see Configure the Backup Location. Ensure that you have configured the correct namespace in the CRD.
The selected source instance has a backupLocation field that refers to the PostgresBackupLocation you will use for the Disaster Recovery scenario. For details, see Create the instance CRD.
If your target instance is already created, ensure that its instance yaml matches the source instance.

Set up continuous restore

Create a continuous restore flow by utilizing the backups on the primary site.

On the Disaster Recovery Kubernetes cluster, edit the target instance yaml and set the deploymentOptions fields accordingly:

highAvailability:
  enabled: false
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <source-instance-stanza-name>

where <source-instance-stanza-name> is the name of the backup stanza created for the source instance. To get the stanza on the source instance data pod, run:

kubectl get postgres <source-instance> -o jsonpath={.status.stanzaName}

For example:

kubectl get postgres source-postgres-example -o jsonpath={.status.stanzaName}

will have an output similar to:

default-postgres-sample-5eaa4601-e903-467b-9833-80055e95d819

Deploy the target instance using kubectl apply -f <target-instance>.yaml.

Note the following for the target instance:
- The Postgres server of a target instance is not initialized, and does not accept any reads or writes. The target instance's Postgres server initializes only after a failover scenario, when the target instance gets promoted to act as the primary.
- An instance created as a continuousRestoreTarget cannot create backups as described in Backing up and restoring VMware Postgres Operator.
- The target instance cannot have HA enabled. If the highAvailability.enabled field is set to "true", when you apply the yaml with continuousRestoreTarget: true, you will receive an error similar to:
```
"spec.highAvailability.enabled must be false when spec.deploymentOptions.continuousRestoreTarget is set to 'true'"
```
Verify that the target instance is running in a continuous restore mode by checking the pg-container container logs in the target instance data pod:
```
kubectl logs <target-instance-data-pod> -n <namespace> -c pg-container
```
The log output would be similar to:
```
This is a target instance set up for continuous restore, postgres server won't be started.
```
On the source cluster, set up scheduled backups for the source instance by creating a PostgresBackupSchedule object. For more details see Create Scheduled Backups.

After the backups start running successfully, they will be synchronized to the namespace in the target Kubernetes cluster, and will be used for continuous restore on the target instance.
Verify that the continuous restore is working properly on the target instance, you can refer to the field status.lastRestoreTime in the target instance by running the below command on target Kubernetes cluster:
```
kubectl -n <namespace> get postgres postgres-sample -o jsonpath='{.status.lastRestoreTime}'
```
This value should match the latest backup's completed time.

Fail over to the Disaster Recovery site

Failover to the Disaster Recovery site if the primary site goes down, or if you want to test the disaster recovery flow, or if you wish to perform maintenance on the primary site. In a failover scenario, promote the target instance on the Disaster Recovery Kubernetes cluster so that it starts serving traffic, by following these steps:

Set the source instance, edit the instance yaml, and set it as a continuous restore target of the target instance:

highAvailability:
  enabled: false
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <target-instance-stanza-name>

Get the value of <target-instance-stanza-name> by running this command on the target instance:

kubectl get postgres <target-instance> -o jsonpath={.status.stanzaName}

For example:

kubectl get postgres target-postgres-example -o jsonpath={.status.stanzaName}

which will show an output similar to:

default-postgres-sample-7d4e2f84-f521-43c2-b3c9-73c3fde3dc8e

Promote the target instance by setting the spec.deploymentOptions.continuousRestoreTarget field to false and applying the change. This update will restart the instance, initialize the Postgres server, and the server would be ready to accept any read/write requests.
(Optional) after the instance is up and running, you can enable highAvailability mode if necessary. Verify that the target instance is running as expected:
```
kubectl wait postgres <name-of-target-instance> --for=jsonpath={.status.currentState}=Running --timeout=10m
```
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
Then enable high availability in the target yaml:
```
highAvailability:
  enabled: true
```
Apply the changes.
You can now reroute any application traffic from the source instance to the target instance.

Fail back to the primary site

Fail back to the primary site after the primary is back up, or after the maintenance is completed.

Take an on-demand backup on the target instance. For details see Perform an On-Demand Backup.

Update the target instance yaml to the original configuration, which will revert the instance back to being a continuous restore target:

highAvailability:
  enabled: false
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <source-instance-stanza-name>

Edit the source instance on the primary site and set spec.deploymentOptions.continuousRestoreTarget to true. Apply the changes.
Verify that the latest backup taken on the target instance has been restored in the source instance by checking the value of status.lastRestoreTime in the source instance. It should be equal to the status.timeCompleted of the on-demand backup you took in step 1.
After the restore is done, update the field deploymentOptions.continuousRestoreTarget to false on the source instance at the primary site. This will restart the instance and bring the postgres server back up.
(Optional) after the instance is up and running, you can enable highAvailability mode if necessary. Verify that the source instance is running as expected:
```
kubectl wait postgres <name-of-source-instance> --for=jsonpath={.status.currentState}=Running --timeout=10m
```
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
Then enable high availability in the target yaml:
```
highAvailability:
  enabled: true
```
Apply the changes.
You can now reroute your application traffic back to the source instance.