This page describes how to set up a disaster recovery configuration for VMware SQL with Postgres for Kubernetes.
The VMware SQL with Postgres for Kubernetes operator allows you to create a Disaster Recovery scenario, where an instance on a primary site can fail over to an instance on a target (Disaster Recovery ) site. VMware Postgres Operator Disaster Recovery relies on backups and WAL shipping from the primary site to a remote storage location.
The disaster recovery workflow can be used across namespaces in the same cluster, or across remote clusters, as long as the two instances have a matching yaml file, and can both access the backup location.
The VMware Postgres Operator instance in the primary Kubernetes cluster is referred to as the "source" instance, and the instance deployed in the Disaster Recovery Kubernetes cluster is referred to as the "target" instance.
Note: A target instance involved in a Disaster Recovery scenario cannot be part of an HA setup while being a Disaster Recovery target.
Before setting up a Disaster Recovery scenario, ensure that you meet the following prerequisites:
backupLocation
field that refers to the PostgresBackupLocation you will use for the Disaster Recovery scenario. For details, see Create the instance CRD.Create a continuous restore flow by utilizing the backups on the primary site.
On the Disaster Recovery Kubernetes cluster, edit the target instance yaml and set the deploymentOptions
fields accordingly:
highAvailability:
enabled: false
readReplicas: 0
deploymentOptions:
continuousRestoreTarget: true
sourceStanzaName: <source-instance-stanza-name>
where <source-instance-stanza-name>
is the name of the backup stanza created for the source instance. To get the stanza on the source instance data pod, run:
kubectl get postgres <source-instance> -o jsonpath={.status.stanzaName}
For example:
kubectl get postgres source-postgres-example -o jsonpath={.status.stanzaName}
will have an output similar to:
default-source-postgres-example-5eaa4601-e903-467b-9833-80055e95d819
Deploy the target instance using kubectl apply -f <target-instance>.yaml
.
Note the following for the target instance:
continuousRestoreTarget
cannot create backups as described in Backing up and restoring VMware Postgres Operator.highAvailability.enabled
field is set to "true", when you apply the yaml with continuousRestoreTarget: true
, you will receive an error similar to: "spec.highAvailability.enabled must be false when spec.deploymentOptions.continuousRestoreTarget is set to 'true'"
To verify that the continuous restore is working properly on the target instance, you can refer to the field status.lastRestoreTime
in the target instance by running this command on target Kubernetes cluster:
kubectl -n <namespace> get postgres postgres-sample -o jsonpath='{.status.lastRestoreTime}'
This value should match the last transaction in the most recent WAL file that exists on the remote storage.
Failover to the Disaster Recovery site if the primary site goes down, or if you want to test the disaster recovery flow, or if you wish to perform maintenance on the primary site. In a failover scenario, promote the target instance on the Disaster Recovery Kubernetes cluster so that it starts serving traffic, by following these steps:
Make sure that there is no application traffic against the source instance on the primary site.
Promote the target instance by setting the spec.deploymentOptions.continuousRestoreTarget
field to false
and applying the change. This update will restart the instance and initialize the Postgres server, and the server would be ready to accept any read/write requests.
Wait until the target instance is in a Running state by running this command:
kubectl wait postgres <target-instance> --for=jsonpath={.status.currentState}=Running --timeout=12m
An output similar to the following will be shown when instance is ready to accept read-write connections:
postgres.sql.tanzu.vmware.com/postgres-sample condition met
Take an on-demand backup on the target instance. For details see Perform an On-Demand Backup.
Set the source instance, edit the instance yaml, and set it as a continuous restore target of the target instance:
highAvailability:
enabled: false
readReplicas: 0
deploymentOptions:
continuousRestoreTarget: true
sourceStanzaName: <target-instance-stanza-name>
Get the value of <target-instance-stanza-name>
by running this command on the target instance:
kubectl get postgres <target-instance> -o jsonpath={.status.stanzaName}
For example:
kubectl get postgres target-postgres-example -o jsonpath={.status.stanzaName}
which will show an output similar to:
default-target-postgres-example-7d4e2f84-f521-43c2-b3c9-73c3fde3dc8e
(Optional) After the target instance is up and running, you can enable highAvailability
mode if necessary. Verify that the target instance is running as expected:
kubectl wait postgres <name-of-target-instance> --for=jsonpath={.status.currentState}=Running --timeout=10m
postgres.sql.tanzu.vmware.com/postgres-sample condition met
Then enable high availability in the target yaml:
highAvailability:
enabled: true
readReplicas: 1
Apply the changes.
You can now reroute any application traffic from the source instance to the target instance.
Fail back to the primary site after the primary is back up, or after the maintenance is completed.
Make sure that there is no application traffic against the target instance.
Confirm that the source instance at the primary site is in continuous restore.
Run the following command:
kubectl get postgres <source-instance> -o jsonpath="{.spec.deploymentOptions.continuousRestoreTarget}"
The output should match the following:
true
Note: Any value other than true
is invalid and the source instance will need to be updated to reflect the correct value.
Take an on-demand backup on the target instance. For details see Perform an On-Demand Backup.
Verify that WAL files shipped during the backup of the target instance have been restored on the source instance by checking the value of status.lastRestoreTime
in the source instance. It should be equal to the last time a transaction was performed on the target instance.
Update the target instance yaml to the original configuration, which will revert the instance back to being a continuous restore target:
highAvailability:
enabled: false
readReplicas: 0
deploymentOptions:
continuousRestoreTarget: true
sourceStanzaName: <source-instance-stanza-name>
On the source instance at the primary site, update the field deploymentOptions.continuousRestoreTarget
to false
. This will restart the instance and bring the postgres server back up.
Wait until the source instance is in a Running state by running this command:
kubectl wait postgres <source-instance> --for=jsonpath={.status.currentState}=Running --timeout=12m
An output similar to the following will be shown when instance is ready to accept read-write connections:
postgres.sql.tanzu.vmware.com/postgres-sample condition met
(Optional) After the instance is up and running, you can enable highAvailability
mode if necessary. You can do this by updating the instance spec:
highAvailability:
enabled: true
readReplicas: 1
Apply the changes.
You can now reroute your application traffic back to the source instance.