VMware Tanzu RabbitMQ supports continuous schema definition and message replication to a remote cluster, which makes it easy to run a standby cluster for disaster recovery.
This feature is not available in the open source RabbitMQ distribution.
Standby Replication uses two plugins, the Continuous Schema Replication plugin and the Standby Message Replication plugin. Note, these plugins are jointly configured using the Standby Offsite Replication Operator as per the following instructions.
The plugin's combined replication model has a number of features and limitations:
In case of a disaster event the recovery process involves several steps:
As explained later in this guide, promotion and reconfiguration happen on the fly, and do not involve RabbitMQ node restarts or redeployment.
This guide covers how to use Standby Offsite Replication Operator to configure the Standby Offsite Replication plugin. If you haven't installed Standby Offsite Replication Operator, please see: installation guide.
This guide is structured in the following sections:
To set up active passive topology, you must enable the required plugins and provide specific RabbitMQ server configurations.
The following yaml syntax is an example of an upstream (active) RabbitmqCluster with the required configurations:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: upstream-rabbit
spec:
...
rabbitmq:
additionalPlugins:
- rabbitmq_stream
- rabbitmq_schema_definition_sync
- rabbitmq_schema_definition_sync_prometheus # optional
- rabbitmq_standby_replication
additionalConfig: |
schema_definition_sync.operating_mode = upstream
standby.replication.operating_mode = upstream
# message stream retention limit (can either be size or time based)
standby.replication.retention.size_limit.messages = 5000000000
# standby.replication.retention.time_limit.messages = 12h
The following yaml syntax is an example of a downstream (passive) RabbitmqCluster with the required configurations:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: downstream-rabbit
spec:
...
rabbitmq:
additionalPlugins:
- rabbitmq_stream
- rabbitmq_schema_definition_sync
- rabbitmq_schema_definition_sync_prometheus
- rabbitmq_standby_replication
additionalConfig: |
schema_definition_sync.operating_mode = downstream
standby.replication.operating_mode = downstream
schema_definition_sync.downstream.locals.users = ^default_user_
schema_definition_sync.downstream.locals.global_parameters = ^standby
# message stream retention limit (can either be size or time based)
standby.replication.retention.size_limit.messages = 5000000000
# standby.replication.retention.time_limit.messages = 12h
Check the status of the upstream and downstream RabbitmqClusters, ensure that the pods for these clusters are running.
To use the Standby Replication feature, you must configure two plugins: Continuous Schema Replication and Standby Message Replication using the Standby Offsite Replication Operator. These plugins are already enabled in the RabbitmqCluster definition.
The messaging topology operator can be used to declare the configurations needed in a declarative way.
To configure the Continuous Schema Replication plugin for the upstream cluster, complete the following steps:
The following yaml code provides an example of how to configure the user and secret.
apiVersion: v1
kind: Secret
metadata:
name: upstream-secret
type: Opaque
stringData:
username: test-user
password: test-password
---
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
name: rabbitmq-replicator
spec:
rabbitmqClusterReference:
name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
importCredentialsSecret:
name: rabbitmq-replicator
write
, configure
, and read
permissions for the user on the rabbitmq_schema_definition_sync
vhost. These permissions are required to ensure the schema definition sync plugin operates correctly. The following yaml code provides an example of how to configure these permissions on the rabbitmq_schema_definition_sync
vhost.
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all
spec:
vhost: "rabbitmq_schema_definition_sync" # name of a vhost
userReference:
name: rabbitmq-replicator
permissions:
write: ".*"
configure: ".*"
read: ".*"
rabbitmqClusterReference:
name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
5671
instead of 5672
below. You apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
name: upstream
namespace: upstream
spec:
endpoints: "UPSTREAM_EXTERNAL_IP:5672"
upstreamSecret:
name: upstream-secret
rabbitmqClusterReference:
name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
In the following example, the schema definition sync plugin is configured to collect messages for all quorum queues in vhost test
:
apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
name: upstream-configuration
spec:
operatingMode: "upstream" # has to be "upstream" to configure an upstream RabbitMQ cluster; required value
upstreamModeConfiguration: # list of policies that Operator will create
replicationPolicies:
- name: test-policy # policy name; required value
pattern: "^.*" # any regex expression that will be used to match quorum queues name; required value
vhost: "test" # vhost name; must be an existing vhost; required value
rabbitmqClusterReference:
name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
Note that the spec.operatingMode
field must be set to upstream
to provide upstream related configurations.
spec.upstreamModeConfiguration.replicationPolicies
is a list, and name
, pattern
, vhost
are the required values for the operator policies.
Note that vhost test
must be an existing vhost, which can be created with our topology operator also:
apiVersion: rabbitmq.com/v1beta1
kind: Vhost
metadata:
name: default
spec:
name: "test" # vhost name
tags: ["standby_replication"]
rabbitmqClusterReference:
name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
Note, the "standby_replication"
tag and the together with the permissions are used by the plugin to select the vhost to replicate. together with the necessary permissions:
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
name: rabbitmq-replicator.defaultvhost.all
spec:
vhost: "test" # name of a vhost
userReference:
name: rabbitmq-replicator
permissions:
write: ".*"
configure: ".*"
read: ".*"
rabbitmqClusterReference:
name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
You must also configure the Continuous Schema Replication and Standby Message Replication plugins using the Standby Offsite Replication Operator for downstream replications.
To configure the Continuous Schema Replication plugin for the downstream cluster, complete the following steps:
apiVersion: v1
kind: Secret
metadata:
name: upstream-secret
type: Opaque
stringData:
username: test-user
password: test-password
---
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
name: rabbitmq-replicator
spec:
rabbitmqClusterReference:
name: downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
importCredentialsSecret:
name: upstream-secret
write
, configure
, and read
permissions for the user on the rabbitmq_schema_definition_sync
vhost. These permissions are required to ensure the schema definition sync plugin operates correctly.The following yaml code provides an example of how to configure these permissions on the rabbitmq_schema_definition_sync
vhost.
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all
spec:
vhost: "rabbitmq_schema_definition_sync" # name of a vhost
userReference:
name: rabbitmq-replicator
permissions:
write: ".*"
configure: ".*"
read: ".*"
rabbitmqClusterReference:
name: downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
name: downstream
spec:
endpoints: "UPSTREAM_EXTERNAL_IP:5672"
upstreamSecret:
name: upstream-secret
rabbitmqClusterReference:
name: downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
standby_replication_upstream
global parameter in the downstream RabbitMQ.The following example connects the RabbitMQ cluster downstream-rabbit
to the RabbitMQ cluster at the UPSTREAM_EXTERNAL_IP:5552
endpoint. Note, the use of the RabbitMQ Stream Protocol port: 5552
.
---
apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
name: downstream-configuration
spec:
operatingMode: "downstream" # has to be "downstream" to configure an downstream RabbitMQ cluster
downstreamModeConfiguration:
endpoints: "UPSTREAM_EXTERNAL_IP:5552" # comma separated list of endpoints to the upstream RabbitMQ
upstreamSecret:
name: upstream-secret # an existing Kubernetes secret; required value
rabbitmqClusterReference:
name: downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
Note, the spec.operatingMode
field must be set to downstream
to provide downstream related configurations.
spec.downstreamModeConfiguration.endpoints
is a comma separated list containing endpoints to connect to the upstream RabbitMQ. Endpoints must be reachable from this downstream cluster with the stream protocol port. If you are securing standby replication with TLS, the stream protocol port is 5551
instead of 5552
.
spec.downstreamModeConfiguration.upstreamSecret
is the name of an exising Kubernetes secret in the same namespace. This secret must contain the username
and password
keys. It is used as credentials to connect to the upstream RabbitMQ. For example:
---
apiVersion: v1
kind: Secret
metadata:
name: upstream-secret
type: Opaque
stringData:
username: test-user # upstream cluster username
password: test-password # upstream cluster password
You can update both upstream and downstream configurations. Be aware that once a StandbyReplication
custom resource is created, you won't be able to update its spec.operatingMode
and spec.rabbitmqClusterReference
. To change these two immutable fields, you will need to delete and re-create the resource.
Operator does not watch updates for the Kubernetes secret object which is set in spec.downstreamModeConfiguration.upstreamSecret
. If credentials have been updated in the secret, you can force the Operator to update by adding a temporary label or annotation to the custom resource.
Note that updates to spec.upstreamModeConfiguration.replicationPolicies
is not fully supported. Operator won't clean up removed policies from the list. Other update operations such as updating policy name, vhost, and patterns, and adding a new policy are supported.
You can configure the upstream and downstream clusters to perform replication over TLS, which secures the communications between the clusters.
First, configure your clusters with Secrets containing TLS certificates by following this TLS Example.
You can then use these certificates by including the configuration parameters in the configuration file. Include these parameters in the same format as the ssl_options
, which are detailed in Enabling TLS Support in RabbitMQ.
On the upstream cluster, set the parameters under schema_definition_sync.ssl_options
:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: upstream-rabbit
spec:
...
tls:
secretName: tls-secret
rabbitmq:
additionalPlugins:
- rabbitmq_stream
- rabbitmq_schema_definition_sync
- rabbitmq_schema_definition_sync_prometheus
- rabbitmq_standby_replication
additionalConfig: |
schema_definition_sync.operating_mode = upstream
standby.replication.operating_mode = upstream
standby.replication.retention.size_limit.messages = 5000000000
schema_definition_sync.ssl_options.cacertfile = /etc/rabbitmq-tls/ca.crt
schema_definition_sync.ssl_options.certfile = /etc/rabbitmq-tls/tls.crt
schema_definition_sync.ssl_options.keyfile = /etc/rabbitmq-tls/tls.key
schema_definition_sync.ssl_options.verify = verify_none
schema_definition_sync.ssl_options.fail_if_no_peer_cert = false
On the downstream cluster, set the parameters under schema_definition_sync.ssl_options
and standby.replication.downstream.ssl_options
:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: downstream-rabbit
spec:
...
tls:
secretName: tls-secret
rabbitmq:
additionalPlugins:
- rabbitmq_stream
- rabbitmq_schema_definition_sync
- rabbitmq_schema_definition_sync_prometheus
- rabbitmq_standby_replication
additionalConfig: |
schema_definition_sync.operating_mode = downstream
standby.replication.operating_mode = downstream
schema_definition_sync.downstream.locals.users = ^default_user_
schema_definition_sync.downstream.locals.global_parameters = ^standby
standby.replication.retention.size_limit.messages = 5000000000
schema_definition_sync.ssl_options.cacertfile = /etc/rabbitmq-tls/ca.crt
schema_definition_sync.ssl_options.certfile = /etc/rabbitmq-tls/tls.crt
schema_definition_sync.ssl_options.keyfile = /etc/rabbitmq-tls/tls.key
schema_definition_sync.ssl_options.verify = verify_none
schema_definition_sync.ssl_options.fail_if_no_peer_cert = false
standby.replication.downstream.ssl_options.cacertfile = /etc/rabbitmq-tls/ca.crt
standby.replication.downstream.ssl_options.certfile = /etc/rabbitmq-tls/tls.crt
standby.replication.downstream.ssl_options.keyfile = /etc/rabbitmq-tls/tls.key
standby.replication.downstream.ssl_options.verify = verify_none
standby.replication.downstream.ssl_options.fail_if_no_peer_cert = false
Important: Peer verification (normally configured by setting ssl_options.verify
to verify_peer
) is not supported for Standby Replication. schema_definition_sync.ssl_options.verify
and standby.replication.downstream.ssl_options.verify
must be set to verify_none
.
You can remove upstream and downstream configurations by deleting the StandbyReplication
custom resource. To delete an upstream configuration, Operator removes all replication policies set in spec.upstreamModeConfiguration.replicationPolicies
from the RabbitMQ.
To delete a downstream configuration, Operator removes the standby_replication_upstream
global parameter.
Having a standby cluster with synchronised schema and messages is only useful if it can be turned into a new primary cluster in case of a disaster event. In this guide we will refer to such event as a downstream promotion.
A promoted downstream cluster is detached from its upstream to operate independently. Promotion is never performed by the plugin itself; the decision is performed by a human operator. Promotion is triggered via a CLI command.
A promoted downstream becomes a "regular" cluster that can, if needed, itself serve as an upstream. It does not sync from its original upstream but can be configured to collect messages for offsite replication to another datacenter.
When a cluster is promoted, a few things happen:
The promotion process takes time. The amount of time it takes will be proportional to the retention period used. This operation will be CPU and disk I/O intensive.
Every downstream node will be responsible for recovering the virtual hosts it "owns". This helps distribute the load between cluster members.
To list virtual hosts available for promotion, i.e. have local data to recover:
rabbitmqctl list_vhosts_available_for_standby_replication_recovery
To initiate a recovery procedure:
rabbitmqctl promote_standby_replication_downstream_cluster [--start-from-scratch] [--all-available] [--exclude-virtual-hosts \"<vhost1>,<vhost2>,<...>\"]
The flag --start-from-scratch
recovers messages from the earliest available data instead of last timestamp recovered previously, even if information about the last recovery is available.
The flag --all-available
forces to recover all messages available if neither the last cutoff nor the last recovery information are available.
Virtual hosts could be excluded from promotion with the --exclude-virtual-hosts
flag.
To display promotion summary (in case a promotion was attempted):
rabbitmqctl display_standby_promotion_summary
The recovery process stores a summary on disk indicating what was the last timestamp recovered. This allows for idempotent recovery that avoids recovering the same set of messages twice.
After the recovery process completes, the cluster can be used as usual.
If the cluster size changes, the virtual hosts "owned" by every node might change. To delete the data for the virtual hosts that nodes do not longer own, use the next command:
rabbitmqctl delete_orphaned_data_on_standby_replication_downstream_cluster
To inspect the size of the data replicated:
rabbitmqctl display_disk_space_used_by_standby_replication_data
To disconnect the downstream, effectively stopping message replication, run:
rabbitmqctl disconnect_standby_replication_downstream
To (re)connect the downstream, effectively starting/resuming message replication, run:
rabbitmqctl connect_standby_replication_downstream
Note that if the promoted cluster is to be restarted, its operating mode must be updated in the configuration file as well, otherwise it will revert back to its originally configured mode, downstream
.
The plugin does not make any assumptions about what happens to the original cluster that has experienced a disaster event. It can be gone permanently, brought back as a standby for the newly promoted one or be eventually promoted back.
After promotion, the replicated data on the old downstream can be erased from disk with:
rabbitmqctl delete_all_data_on_standby_replication_cluster
If this command is used on an active downstream, it will deleted all tranferred data until that point, but it also might stop the replication. To ensure it continues, the downstream must be disconnected and connected again using the commands listed above.
To delete the internal streams on the upstream:
rabbitmqcl delete_internal_streams_on_standby_replication_upstream_cluster
To inspect the number of messages replicated for each virtual host, exchange and routing key:
rabbitmq-diagnostics inspect_local_data_available_for_standby_replication_recovery
This is a very expensive operation as it reads and parses all data on disk, it should be used with care. This operation can take a long time to run, even for medium data sizes.