Configuring Disaster Recovery

This page describes how to set up a disaster recovery configuration for VMware Tanzu for Postgres on Kubernetes.

Overview

The VMware Tanzu for Postgres on Kubernetes operator allows you to create a Disaster Recovery scenario, where an instance on a primary site can failover to an instance on a target (Disaster Recovery ) site. VMware Postgres Operator Disaster Recovery relies on backups and the WAL file shipping from the primary site to a remote storage location.

The disaster recovery workflow can be used across namespaces in the same cluster, or across remote clusters, as long as the two instances have a matching yaml file, and can both access the backup location.

The VMware Postgres Operator instance in the primary Kubernetes cluster is referred to as the "source" instance, and the instance deployed in the Disaster Recovery Kubernetes cluster is referred to as the "target" instance.

Note: A target instance involved in a Disaster Recovery scenario cannot be part of an HA setup while being a Disaster Recovery target.

Prerequisites

Before setting up a Disaster Recovery scenario, ensure that you meet the following prerequisites:

You have upgraded the primary and target cluster to VMware Postgres Operator 3.0.0.
You have familiarity with the VMware Postgres Operator backup and restore processes and the backup and restore custom resources (CRDs). For details, review Backing Up and Restoring VMware Postgres Operator.
You have configured an S3 or Azure data store for the backups.
You have created and applied a matching PostgresBackupLocation object at the primary and target site, that represents the backup target for the primary site and the restore source for the target site. For details on creating a PostgresBackupLocation object, see Configure the Backup Location.
You have configured the correct namespace in the CRD.
The selected source instance has a backupLocation field that refers to the PostgresBackupLocation you want to use for the Disaster Recovery scenario. For details, see Create the instance CRD.
If your target instance is already created, ensure that its instance yaml matches the source instance.

Set up continuous restore

Create a continuous restore flow by utilizing the backups on the primary site.

On the Disaster Recovery Kubernetes cluster, edit the target instance yaml and set the deploymentOptions fields accordingly:
```
highAvailability:
  enabled: false
  readReplicas: 0
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <source-instance-stanza-name>
```
Where <source-instance-stanza-name> is the name of the backup stanza created for the source instance. To get the stanza on the source instance data pod, run:
```
kubectl get postgres <source-instance> -o jsonpath={.status.stanzaName}
```
For example:
```
kubectl get postgres source-postgres-example -o jsonpath={.status.stanzaName}
```
This will have an output similar to:
```
default-source-postgres-example-5eaa4601-e903-467b-9833-80055e95d819
```
VMware Postgres Operator 3.0. introduces a feature called streaming replication. As described by the official PostgreSQL site, the file-based shipping is asynchronous, and a WAL file is not shipped until it reaches 16 MB by default. As a result, there is a window for data loss if the source instance suffers a catastrophic failure. Streaming replication allows the target instance to stay more up to date than is possible with file-based shipping. The target instance connects to the source instance, which streams WAL records to the target instance as they're generated without waiting for the WAL file to be filled. To enable it, set streamingOptions under the deploymentOptions field.
```
highAvailability:
  enabled: false
  readReplicas: 0
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <source-instance-stanza-name>
  streamingOptions:
    streamingFromPrimaryHost: true
    primaryHost: <source-instance-host-address>
    port: <postgres-port>
    certificate:
      name: <source-instance-replication-certificate-secret>
```
Where <source-instance-host-address> is the address of the source instance. Normally, it’s the IP of the read-write service of the source instance.

For example:
```
kubectl get service source-postgres-example
```
This will have an output similar to:
```
NAME                      TYPE           CLUSTER-IP        EXTERNAL-IP     PORT(S)          AGE
source-postgres-example   LoadBalancer   10.43.14.98       192.168.205.2   5432:31209/TCP   6h43m
```
Then <source-instance-host-address> should be 192.168.205.2 and <postgres-port> should be 5432. If the <postgres-port> is not specified, it is 5432 by default.

Note
31209 is the node port. Don’t use it unless the <source-instance-host-address> is specified with the node IP.

<source-instance-replication-certificate-secret> is the certificate secret name for replication of the source instance. If the source instance uses generated certificates, that is, spec.certificateSecretName of the source instance is not specified, then it should be <source-instance-name>-replication-ssl-secret.

For example:
```
kubectl get secret  -l postgres-instance=source-postgres-example
```
This will have an output similar to:
```
NAME                                                        TYPE                                  DATA     AGE
source-postgres-example-db-secret                           Opaque                                  5      24s
source-postgres-example-read-only-user-db-secret   servicebinding.io/postgresql                     8      24s
source-postgres-example-read-write-user-db-secret  servicebinding.io/postgresql                     8      24s
source-postgres-example-app-user-db-secret         servicebinding.io/postgresql                     8      24s
source-postgres-example-pgbackrest-secret                   Opaque                                  3      24s
source-postgres-example-pgbackrest-config-secret            Opaque                                  3      24s
source-postgres-example-monitor-secret                      Opaque                                  4      24s
source-postgres-example-empty-secret                        Opaque                                  0      24s
source-postgres-example-metrics-secret                      Opaque                                  4      24s
source-postgres-example-additional-db-creds                 Opaque                                  1      24s
source-postgres-example-internal-ssl-secret             kubernetes.io/tls                           3      24s
source-postgres-example-metrics-tls-secret              kubernetes.io/tls                           3      24s
source-postgres-example-replication-ssl-secret          kubernetes.io/tls                           3      23s
```
The <source-instance-replication-certificate-secret> is source-postgres-example-replication-ssl-secret.

The following are some important considerations:
- If the source instance uses generated certificates, that is, spec.certificateSecretName of the source instance is not specified, this secret must be copied manually to the same namespace of the same K8S cluster with the target instance. For example, in source instance:
```
kubectl get secret source-postgres-example-replication-ssl-secret -o yaml > /tmp/replication-certificate.yaml
```
  Copy /tmp/replication-certificate.yaml to the Kubernetes cluster of the target instance. Modify this file to change the namespace to the target instance namespace and then run the following in the same cluster and namespace of target instances:
```
kubectl apply -f /tmp/replication-certificate.yaml
```
  The certificate is renewed every 60 days but is still valid for another 30 days. In other words, the certificate expires every 90 days, but it is renewed 30 days before the expiration. This means updating the certificate for the target instance needs to be done after the certificate is renewed but before it expires. To get the expiration day, please run the following command in the source instance:
```
kubectl get certificate source-postgres-example-replication-ssl-certificate -o jsonpath="{.status.notAfter}"
```
  This will have an output similar to:
```
2024-09-23T08:52:01Z%
```
  To get the renew day, please run the following command in the source instance:
```
kubectl get certificate source-postgres-example-replication-ssl-certificate -o jsonpath="{.status.renewalTime}"
```
  This will have an output similar to:
```
2024-08-24T08:52:01Z%
```
  This means manually copying the secret again to the same namespace of the same K8S cluster with the target instance must be done between 2024-08-24T08:52:01 and 2024-09-23T08:52:01. After the copy, check the new renew and expiration dates and perform the next secret copy between those dates.
- If the source instance uses a custom Certificate Authority (CA) certificate, that is, spec.certificateSecretName of the source instance is specified, then you must generate the certificate secret manually with common name streaming_replication and the same CA with spec.certificateSecretName of the source instance.
- A source instance supports multiple target instances with streaming replication enabled. But the value of GUC max_wal_senders (in postgresql operator, the default value is 12) may need to be increased if too many target instances with streaming replication enabled are set. This parameter should be slightly higher than the maximum number of target instances at least. To change the GUC value of max_wal_senders, see Customizing the PostgreSQL Server. For example, the configmap can be the following:
```
apiVersion: v1
kind: ConfigMap
metadata:
 name: my-postgresql-configmap
 labels:
  app: postgres
data:
 max_wal_senders: "30"
```
  And the customConfig field of the instance’s yaml should be modified accordingly:
```
......
spec:
 customConfig:
  postgresql:
   name: my-postgresql-configmap
......
```
Deploy the target instance using kubectl apply -f <target-instance>.yaml.

Note the following for the target instance:
- The Postgres server of a target instance won't be started until at least one backup has been performed on the source instance. After the target instance's Postgres server initializes, it is available for read-only queries until it is promoted to act as the primary.
- An instance created as a continuousRestoreTarget cannot create backups as described in Backing up and restoring VMware Postgres Operator.
- The target instance cannot have HA enabled. If the highAvailability.enabled field is set to true, when you apply the yaml with continuousRestoreTarget: true, you will receive an error similar to:
```
"spec.highAvailability.enabled must be false when spec.deploymentOptions.continuousRestoreTarget is set to 'true'"
```
To verify that the continuous restore is working properly on the target instance, you can refer to the field status.lastRestoreTime in the target instance by running this command on target Kubernetes cluster:
```
kubectl -n <namespace> get postgres postgres-sample -o jsonpath='{.status.lastRestoreTime}'
```
This value should match the last transaction in the most recent WAL file that exists on the remote storage.
If streamingFromPrimaryHost is enabled, to verify that the streaming replication is working properly on the target instance, you can refer to the status.streamingStatus field in the target instance by running this command on the target Kubernetes cluster:
```
kubectl -n <namespace> get postgres <target-instance> -o jsonpath='{.status.streamingStatus}'
```
The possible values are:
- Inapplicable: When streamingFromPrimaryHost is not enabled.
- Unknown: When failing to get streaming status.
- Working: When streaming replication is working.
- Pending: When streaming replication is pending.
After the target instance is created, it downloads a backup for restore and then downloads WAL files to replay the logs, during which the streaming replication is pending. That means after the target instance is created, the status.streamingStatus field is Pending first and then turns to Working in a few minutes. If the configuration is incorrect or there are network problems that cause the streaming replication to fail, the status will also be Pending.

Failover to the Disaster Recovery Site

Failover to the Disaster Recovery site occurs if the primary site goes down, if you want to test the disaster recovery flow, or if you wish to perform maintenance on the primary site. In a failover scenario, promote the target instance on the Disaster Recovery Kubernetes cluster so that it starts serving traffic by following these steps:

Make sure that there is no application traffic against the source instance on the primary site.
If the source instance is still available, then invoke the SQL command select pg_switch_wal() on the source instance to ensure the last WAL file gets archived.
If the source instance is still available, verify that WAL files of the source instance have been restored on the target instance by checking the value of status.lastRestoreTime in the target instance. It should be equal to the last time a transaction was performed on the source instance.
Promote the target instance by setting the spec.deploymentOptions.continuousRestoreTarget and spec.deploymentOptions.streamingOptions.streamingFromPrimaryHost fields to false and applying the change. This update restarts the instance and initializes the Postgres server, and the server becomes ready to accept any read/write requests.

Wait until the target instance is in a Running state by running this command:
```
kubectl wait postgres <target-instance> --for=jsonpath={.status.currentState}=Running --timeout=12m
```
An output similar to the following is shown when the instance is ready to accept read-write connections:
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
Take an on-demand backup on the target instance. For more information, see Perform an On-Demand Backup.

Set the source instance, edit the instance YAML, and set it as a continuous restore target of the target instance:

highAvailability:
  enabled: false
  readReplicas: 0
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <target-instance-stanza-name>

Get the value of <target-instance-stanza-name> by running this command on the target instance:

kubectl get postgres <target-instance> -o jsonpath={.status.stanzaName}

For example:

kubectl get postgres target-postgres-example -o jsonpath={.status.stanzaName}

This shows an output similar to:

default-target-postgres-example-7d4e2f84-f521-43c2-b3c9-73c3fde3dc8e

If needed, streaming replication can be set in this step.

highAvailability:
 enabled: false
 readReplicas: 0
deploymentOptions:
 continuousRestoreTarget: true
 sourceStanzaName: <target-instance-stanza-name>
 streamingOptions:
  streamingFromPrimaryHost: true
  primaryHost: <target-instance-host-address>
  port: <postgres-port>
  certificate:
   name: <target-instance-replication-certificate-secret>

If streamingFromPrimaryHost is enabled, verify that the streaming replication is working properly on the source instance by referring to the field status.streamingStatus in the source instance with this command on the source Kubernetes cluster:
```
kubectl -n <namespace> get postgres <source-instance> -o jsonpath='{.status.streamingStatus}'
```
(Optional) After the target instance is up and running, you can enable highAvailability mode if necessary. Verify that the target instance is running as expected:
```
kubectl wait postgres <name-of-target-instance> --for=jsonpath={.status.currentState}=Running --timeout=10m
```
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
Then enable high availability in the target YAML:
```
highAvailability:
  enabled: true
  readReplicas: 1
```
Apply the changes.
You can now reroute any application traffic from the source instance to the target instance.

Fail Back to the Primary Site

Fail back to the primary site after the primary is back up, or after the maintenance is completed.

Make sure that there is no application traffic against the target instance.
Confirm that the source instance at the primary site is in continuous restore.

Run the following command:
```
kubectl get postgres <source-instance> -o jsonpath="{.spec.deploymentOptions.continuousRestoreTarget}"
```
The output should match the following:
```
true
```
Note
Any value other than true is invalid and the source instance must be updated to reflect the correct value.
Invoke the SQL command select pg_switch_wal() on the target instance to ensure the last WAL file gets archived.
Verify that WAL files of the target instance have been restored on the source instance by checking the value of status.lastRestoreTime in the source instance. It should be equal to the last time a transaction was performed on the target instance.
On the source instance at the primary site, update the deploymentOptions.continuousRestoreTarget and spec.deploymentOptions.streamingOptions.streamingFromPrimaryHost fields to false. This restarts the instance and brings the Postgres server back up.

Wait until the source instance is in a Running state by running this command:
```
kubectl wait postgres <source-instance> --for=jsonpath={.status.currentState}=Running --timeout=12m
```
An output similar to the following is shown when the instance is ready to accept read-write connections:
```
postgres.sql.tanzu.vmware.com/postgres-sample condition met
```
Take an on-demand backup on the source instance. For more information, see Perform an On-Demand Backup.

Update the target instance YAML to the original configuration, which reverts the instance back to being a continuous restore target:

highAvailability:
  enabled: false
  readReplicas: 0
deploymentOptions:
  continuousRestoreTarget: true
  sourceStanzaName: <source-instance-stanza-name>

If needed, streaming replication can be set in this step.

highAvailability:
 enabled: false
 readReplicas: 0
deploymentOptions:
 continuousRestoreTarget: true
 sourceStanzaName: <source-instance-stanza-name>
 streamingOptions:
  streamingFromPrimaryHost: true
  primaryHost: <source-instance-host-address>
  port: <postgres-port>
  certificate:
   name: <source-instance-replication-certificate-secret>

(Optional) After the instance is up and running, you can enable highAvailability mode if necessary. You can do this by updating the instance spec:
```
highAvailability:
  enabled: true
  readReplicas: 1
```
Apply the changes.
You can now reroute your application traffic back to the source instance.