This topic tells you how to configure and use Warm Standby Replication when you are using VMware Tanzu RabbitMQ for Tanzu Application Service.
Tanzu RabbitMQ for Tanzu Application Service supports Warm Standby Replication, a strategy that replicates or copies data, namely continuous schema definitions and messages, from an upstream (primary) RabbitMQ instance to a downstream (standby) instance. In the event of a failure on the upstream (primary) instance, an administrator can quickly start the recovery process. This means putting the downstream (standby) instance into service as the new upstream (primary) instance with minimal downtime or data loss. This topic describes configuring Warm Standby Replication when using Tanzu RabbitMQ for Tanzu Application Service.
With the Warm Standby Replication feature, you get:
Schema synchronization does not synchronize Kubernetes objects.
The Warm Standby Replication process uses the following plug-ins:
The Continuous Schema Replication plug-in connects the upstream (primary) instance to the downstream (standby) instance through a schema replication link. The downstream (standby) instances connect to their upstream (primary) instance and initiate sync operations. These operations synchronize the schema definition on the downstream side with the same schema definition of that which is on the upstream side. A node running in the downstream mode (a follower) can be converted to an upstream (leader) on the fly. This makes the node disconnect from its original source, thereby stopping all syncing. The node then continues operating as a member of an independent instance, no longer associated with its original upstream. Such conversion is called a downstream promotion and should be completed in case of a disaster recovery event.
To ensure improved data safety and reduce the risk of data loss, it is not enough to automate the replication of RabbitMQ entities (schema objects). The Warm Standby Replication feature implements a hybrid replication model. In addition to schema definitions, it also manages the automated and continuous replication of enqueued messages from the upstream (primary) instance. During the setup process, a replication policy is configured at the vhost level in the upstream (primary) instance indicating which downstream queues should be matched and targeted for message replication. Messages and relevant metrics from the upstream queues are then pushed to the downstream queues through a streaming log that the downstream(s) subscribe to. Currently, quorum queues, classic queues, and stream queues are supported for message replication.
ImportantFor quorum and classic queues, RabbitMQ instances replicate messages in the queues to the downstream (standby) instance, but these messages are not published into the queues in the downstream (standby) instance until that downstream (standby) instance is promoted to the upstream (primary) instance.** Every 60 seconds (by default), the timestamp of the oldest message in each queue is sent to the downstream (standby) instance, the promotion process uses the timestamp as a cutoff point for message recovery. So based on the timestamp of the oldest message, all messages from the “oldest” to the “current” timestamp are recovered when the promotion process happens. With this process, the probability of duplicate messages in a busy RabbitMQ instance scenario is high because of the timestamp refresh interval and stale messages pushing the timestamp further into the past. However, it also brings the guarantee of not losing or missing any messages during the recovery process. Streams work differently, messages are replicated directly to streams on the downstream (standby) instance. However, you cannot publish new messages to streams on the downstream (standby) instance until the downstream (standby) instance is promoted to the upstream (primary) instance.
Before continuing, ensure that all Requirements for Warm Standby Replication are in place.
NoteThere can be multiple downstream (standby) instances linked to one upstream (primary) instance. This setup describes one upstream instance and one downstream instance.
Important Before you begin, consider that Warm Standby Replication uses stream queues to log/copy changes. As a result, the number of messages can grow to be very large, but because RabbitMQ streams can store large amounts of data efficiently, there is minimal memory used.
Configure an upstream plan in the appropriate upstream Tanzu Application Service foundation by selecting the Warm Standby Replication Mode to Upstream in the relevant on-demand plan configuration.
Optional Configure the Warm Standby Replication Disk Use Limit and Warm Standby Replication Retention Time Limit.
Create an upstream (primary) RabbitMQ instance:
cf create-service p.rabbitmq <UPSTREAM_PLAN> <INSTANCE_NAME>
Where: - <UPSTREAM_PLAN>
is the upstream on-demand plan. - <INSTANCE_NAME>
is an instance name of your choosing.
Determine the service GUID for the upstream RabbitMQ instance and retrieve the Replication Credentials for the upstream RabbitMQ Instance from the upstream Runtime CredHub. The username is rabbitmq-replication-user
and the password is stored in the CredHub secret /p-bosh/service-instance_<SERVICE_GUID>/rabbitmq_replication_password
, where <SERVICE_GUID>
is the service instance GUID for the upstream RabbitMQ instance. Also note the AMQP and Stream endpoints for the upstream RabbitMQ instance. If using Service Gateway, these are allocated from the TCP router. Record this information in a JSON file with the following format:
{
"replication_credentials": {
"amqp_endpoints": [<AMQP_ENDPOINT>],
"stream_endpoints": [<STREAM_ENDPOINT>],
"username": "rabbitmq-replication-user",
"password": <REPLICATION_PASSWORD>
}
}
Where: - <AMQP_ENDPOINT>
and <STREAM_ENDPOINT>
are the AMQP and Stream endpoints for the upstream RabbitMQ instance. - <REPLICATION_PASSWORD>
is the replication password retrieved from the upstream Runtime CredHub.
Configure a replication policy {"remote-dc-replicate": true}
that matches queues that should be replicated by the Warm Standby Replication. If all queues should be replicated, the policy should match ^.*
. In order to avoid policy clashes, this policy should be merged with any other policies that are applied in the upstream RabbitMQ instance.
To configure the Warm Standby Replication for the downstream RabbitMQ instance, complete the following steps:
Configure a downstream plan in the appropriate upstream Tanzu Application Service foundation by selecting the Warm Standby Replication Mode to Upstream in the relevant on-demand plan configuration.
Optional Configure the Warm Standby Replication Disk Use Limit and Warm Standby Replication Retention Time Limit.
Create a downstream (standby) RabbitMQ instance, providing the replication credentials for the upstream instance through the additional JSON recorded previously:
cf create-service p.rabbitmq <DOWNSTREAM_PLAN> <INSTANCE_NAME> -c <ADDITIONAL_JSON>
Where: - <DOWNSTREAM_PLAN>
is the downstream on-demand plan. - <INSTANCE_NAME>
is an instance name of your choosing. - <ADDITIONAL_JSON>
is the previously recorded replication credentials in JSON format.
Once configured, the downstream RabbitMQ instance replicates all state from the upstream RabbitMQ instance. This means that users created locally in the downstream instance are overwritten by the upstream. In order to access the downstream RabbitMQ instance, use the replication credentials or other credentials valid for the upstream RabbitMQ instance.
You can complete the following steps to verify that Warm Standby Replication is configured correctly.
Access the RabbitMQ Management UI to visualize the schema sync and warm standby replication information. A link to the RabbitMQ management interface can be found in the service instance details. You can use the replication user recorded when configuring warm standby replication to access the downstream RabbitMQ instance.
Select the Admin tab. You can then access the Schema Replication, Standby Replication, and Standby Promotion tabs in the Admin section. The following figure displays the Schema Replication tab.
If you published messages to streams covered by replication policies (see step 5 of Configuring Warm Standby Replication on the Upstream (Primary) Instance), you can list the number of messages replicated for each stream.
NotePromotion and reconfiguration happen at runtime and do not involve RabbitMQ instance restarts or redeployment.
A downstream (standby) instance with synchronised schema and messages is only useful if it can be turned into a new upstream (primary) instance in case of a disaster event. This process is known as “downstream promotion”.
In the case of a disaster event, the recovery process involves several steps: * The downstream (standby) instance is promoted to an upstream (primary) instance by the service operator. When this happens, all upstream links are closed and for every virtual host, unacknowledged messages are re-published to their original destination queues. This just applies to messages belonging to classic and quorum queues, streams are already stored in the target stream on the downstream (standby) instance. * Applications are redeployed or reconfigured to connect to the newly promoted upstream (primary) instance. * Other downstream (standby) instances must be reconfigured to follow the new promoted instance.
When “downstream promotion” happens, a promoted downstream (standby) instance is detached from its original upstream (primary) instance. It then operates as an independent instance that can be used as an upstream (primary) instance. It does not sync from its original upstream but can be configured to collect messages for off-site replication to another datacenter.
Be aware of the following:
To promote a downstream (standby) instance, that is, to start the disaster recovery process, perform an authenticated HTTP POST
command to the downstream management API at /tanzu/osr/downstream/promote
:
curl -X POST --user "rabbitmq-replication-user:<REPLICATION_PASSWORD>" "<DOWNSTREAM_MANAGEMENT_URL>/tanzu/osr/downstream/promote"
Where:
<REPLICATION_PASSWORD>
is the password of the replication user.<DOWNSTREAM_MANAGEMENT_URL>
is the URL of the downstream management UI, which can be found in the service instance information for the downstream service instance.During promotion of the downstream instance, Warm Standby Replication does not support the recovery of messages that are routed to target queues by the AMQP 0.9.1 BCC header.
After the recovery process completes, the instance can be used as usual.
Complete the following steps if you need to after promoting the downstream (standby) instance to be the upstream (primary) instance for disaster recovery.
After promotion, the replicated data on the old downstream (which is effectively the new promoted upstream) can be erased from disk. For example, if a instance in Dublin is the upstream (primary) instance, and an instance in London is the downstream (standby) instance, and the instance in London gets promoted to be the upstream (primary) instance. After promotion, you can now remove previous downstream-related data from the instance in London (as it is now promoted and running as the upstream (primary) instance) by running the following command:
rabbitmqctl delete_all_data_on_standby_replication_cluster
To reverse the process, failing back to the original upstream instance, perform the following steps.
Temporarily configure the original upstream instance into downstream mode by running the following commands:
rabbitmqctl set_schema_replication_mode downstream
rabbitmqctl set_standby_replication_mode downstream
rabbitmqctl set_schema_sync_upstream_endpoints '{"endpoints": [<PROMOTED_AMQP_ENDPOINT>], "username": "rabbitmq-replication-user", "password": <REPLICATION_PASSWORD>}'
rabbitmqctl set_standby_replication_upstream_endpoints '{"endpoints": [<PROMOTED_STREAM_ENDPOINT>], "username": "rabbitmq-replication-user", "password": <REPLICATION_PASSWORD>}'
Where <PROMOTED_AMQP_ENDPOINT>
and <PROMOTED_STREAM_ENDPOINT>
are the AMQP and Stream endpoints of the promoted instance.
Once the instance is replicating messages from the now promoted previously downstream instance, perform the promotion process on the original upstream instance.
The original upstream instance is now once again an upstream. The downstream instance may now be restarted and returned to downstream operations.
The number of replicated messages for classic queues, quorum queues, and streams can be viewed in the Standby Replication section of the admin panel in the upstream RabbitMQ Management UI.
To start, stop, or restart the replication in both the Schema Replication (SchemaReplication) and Standby Replication (StandbyReplication) plug-ins at the same time, run the following commands as required:
rabbitmqctl enable_warm_standby
rabbitmqctl disable_warm_standby
rabbitmqctl restart_warm_standby
If the instance size changes, the virtual hosts “owned” by every node might change. To delete the data for the virtual hosts that nodes no longer own, run the following command:
rabbitmqctl delete_orphaned_data_on_standby_replication_downstream_cluster
To delete the internal streams on the upstream (primary) instance, run the following command:
rabbitmqctl delete_internal_streams_on_standby_replication_upstream_cluster
To inspect the size of the data replicated, run the following command:
rabbitmqctl display_disk_space_used_by_standby_replication_data
To disconnect the downstream to stop message replication, run the following command:
rabbitmqctl disconnect_standby_replication_downstream
To (re)connect the downstream, to start/resume message replication, run:
rabbitmqctl connect_standby_replication_downstream
After promotion, replicated data on the old downstream (standby) instance, which is now effectively the newly promoted upstream (primary) instance, can be erased from disk with:
rabbitmqctl delete_all_data_on_standby_replication_cluster
If the previous command is run on an active downstream (standby) instance, it deletes all transferred data until the time of deletion, it might also stop the replication process. To ensure it continues, the downstream must be disconnected and connected again by using the commands listed above.
Learn how to isolate and resolve problems with Warm Standby Replication by using the following information.
Messages and message acknowledgements are continually stored in the upstream (primary) instance. The downstream (standby) instance connects to the upstream (primary) instance. The downstream (standby) instance reads from the internal stream on the upstream (primary) instance where the messages are stored, and then stores these messages in an internal stream in the downstream (standby) instance. Messages transferred to the downstream (standby) instance for streams are visible in RabbitMQ Management UI automatically (before promotion). You can also see them by running the rabbitmqctl list_queues
command.
To inspect the information about the stored messages in the downstream (standby) instance, run the following command:
rabbitmq-diagnostics inspect_standby_downstream_metrics
# Inspecting standby downstream metrics related to recovery...
# queue timestamp vhost
# ha.qq 1668785252768 /
If the previous command returns the name of a queue called example
, it means that the downstream (standby) instance has messages for queue example
ready to be re-published, in the event of Promoting the Downstream (Standby) Instance for Disaster Recovery.
If the queue you are searching for is not displayed in the list, verify the following items in the upstream (primary) instance:
remote-dc-replicate: true
?Quorum
or Classic
?Can the replicator user for Warm Standby Replication authenticate? Run the following command to verify:
rabbitmqctl authenticate_user rabbitmq-replication-user <REPLICATION_PASSWORD>
Where <REPLICATION_PASSWORD>
was the password determined in step 4. See Configuring Warm Standby Replication on the Upstream (Primary) Instance.
If the previous checks are correct, check the downstream (standby) instance RabbitMQ logs for any related errors.
The number of replicated messages for classic queues, quorum queues, and streams can be viewed in the Standby Replication section of the admin panel in the downtream RabbitMQ Management UI.