VMware Tanzu RabbitMQ supports continuous schema definition and message replication to a remote cluster, which makes it easy to run a standby cluster for disaster recovery.

This feature is not available in the open source RabbitMQ distribution.

How it works

Standby Replication uses two plugins, the Continuous Schema Replication plugin and the Standby Message Replication plugin. Note, these plugins are jointly configured using the Standby Offsite Replication Operator as per the following instructions.

The plugin's combined replication model has a number of features and limitations:

  • Schema syncing happens periodically, so with volatile topologies followers (downstreams) will always be trailing behind the leader (upstream). With a sync interval of thirty seconds the lag will usually be within one minute.
  • The schema (virtual hosts, users, queues, and so on) on the downstream side is replaced with that on the upstream side
  • All communication between the sides is completely asynchronous and avoids introducing cluster co-dependencies
  • Except for the initial import, definitions are transferred and imported incrementally
  • Definitions are transferred in a compressed binary format to reduce bandwidth usage
  • Only Quorum Queues may be replicated
  • Messages are replicated immediately on the upstream, to minimize the window for lost messages in the event of an upstream failure
    • This is prior to message routing, so for instance the enforcement of queue length limits has not been applied rejected messages will still be replicated
  • Retention limits for replicated message data should be configured manually and appropriately (along with appropriate node disk size provisioning) based on the expected number of messages sitting ready on the upstream at any given time

In case of a disaster event the recovery process involves several steps:

  • A standby cluster will be promoted to active by the service operator
  • Applications will be redeployed or reconfigured to connect to the newly promoted cluster
  • Other standby clusters have to be reconfigured to follow the newly promoted cluster

As explained later in this guide, promotion and reconfiguration happen on the fly, and do not involve RabbitMQ node restarts or redeployment.

Configuration

This guide covers how to use Standby Offsite Replication Operator to configure the Standby Offsite Replication plugin. If you haven't installed Standby Offsite Replication Operator, please see: installation guide.

This guide is structured in the following sections:

  • Cluster Configurations
  • Configuring Upstream Replication
  • Configuring Downstream Replication
  • Update Configurations
  • Replication over TLS
  • Delete Configurations

Cluster Configurations

To set up active passive topology, you must enable the required plugins and provide specific RabbitMQ server configurations.

The following yaml syntax is an example of an upstream (active) RabbitmqCluster with the required configurations:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: upstream-rabbit
spec:
...
  rabbitmq:
    additionalPlugins:
      - rabbitmq_stream
      - rabbitmq_schema_definition_sync
      - rabbitmq_schema_definition_sync_prometheus # optional
      - rabbitmq_standby_replication
    additionalConfig: |
      schema_definition_sync.operating_mode = upstream
      standby.replication.operating_mode = upstream
      # message stream retention limit (can either be size or time based)
      standby.replication.retention.size_limit.messages = 5000000000
      # standby.replication.retention.time_limit.messages = 12h

The following yaml syntax is an example of a downstream (passive) RabbitmqCluster with the required configurations:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: downstream-rabbit
spec:
...
  rabbitmq:
    additionalPlugins:
      - rabbitmq_stream
      - rabbitmq_schema_definition_sync
      - rabbitmq_schema_definition_sync_prometheus
      - rabbitmq_standby_replication
    additionalConfig: |
      schema_definition_sync.operating_mode = downstream
      standby.replication.operating_mode = downstream
      schema_definition_sync.downstream.locals.users = ^default_user_
      schema_definition_sync.downstream.locals.global_parameters = ^standby
      # message stream retention limit (can either be size or time based)
      standby.replication.retention.size_limit.messages = 5000000000
      # standby.replication.retention.time_limit.messages = 12h

Check the status of the upstream and downstream RabbitmqClusters, ensure that the pods for these clusters are running.

Configuring Upstream Replication

To use the Standby Replication feature, you must configure two plugins: Continuous Schema Replication and Standby Message Replication using the Standby Offsite Replication Operator. These plugins are already enabled in the RabbitmqCluster definition.
The messaging topology operator can be used to declare the configurations needed in a declarative way.

To configure the Continuous Schema Replication plugin for the upstream cluster, complete the following steps:

  1. Configure a secret to contain a replication-schema user and the user's credentials.
    This user will be used from the downstream cluster to establish a connection and manage the replication.

The following yaml code provides an example of how to configure the user and secret.

apiVersion: v1
kind: Secret
metadata:
  name: upstream-secret
type: Opaque
stringData:
  username: test-user
  password: test-password
---
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
  name: rabbitmq-replicator
spec:
  rabbitmqClusterReference:
    name:  upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  importCredentialsSecret:
    name: rabbitmq-replicator
  1. Add the write, configure, and read permissions for the user on the rabbitmq_schema_definition_sync vhost. These permissions are required to ensure the schema definition sync plugin operates correctly.

The following yaml code provides an example of how to configure these permissions on the rabbitmq_schema_definition_sync vhost.

apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all
spec:
  vhost: "rabbitmq_schema_definition_sync" # name of a vhost
  userReference:
    name: rabbitmq-replicator
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: upstream-rabbit  # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  1. Configure the SchemaReplication object using the following yaml example code. Note, the endpoint is the service external IP of the upstream cluster. If you are securing Schema Sync with TLS, the port in the endpoint is 5671 instead of 5672 below. You
    `
apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
  name: upstream
  namespace: upstream
spec:
  endpoints: "UPSTREAM_EXTERNAL_IP:5672"
  upstreamSecret:
    name: upstream-secret
  rabbitmqClusterReference:
    name: upstream-rabbit  # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  1. Configure the Standby Offsite Replication Operator. Note, when you are configuring this operator, by default, you are configuring the Standby Message Replication plugin also.
    You can use the Standby Offsite Replication Operator to configure which quorum queues that the plugin should collect messages for.

In the following example, the schema definition sync plugin is configured to collect messages for all quorum queues in vhost test:

apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
  name: upstream-configuration
spec:
  operatingMode: "upstream" # has to be "upstream" to configure an upstream RabbitMQ cluster; required value
  upstreamModeConfiguration: # list of policies that Operator will create
    replicationPolicies:
      - name: test-policy # policy name; required value
        pattern: "^.*" # any regex expression that will be used to match quorum queues name; required value
        vhost: "test" # vhost name; must be an existing vhost; required value
  rabbitmqClusterReference:
    name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.

Note that the spec.operatingMode field must be set to upstream to provide upstream related configurations.

spec.upstreamModeConfiguration.replicationPolicies is a list, and name, pattern, vhost are the required values for the operator policies.

Note that vhost test must be an existing vhost, which can be created with our topology operator also:

apiVersion: rabbitmq.com/v1beta1
kind: Vhost
metadata:
  name: default
spec:
  name: "test" # vhost name
  tags: ["standby_replication"]
  rabbitmqClusterReference:
    name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.

Note, the "standby_replication" tag and the together with the permissions are used by the plugin to select the vhost to replicate. together with the necessary permissions:

apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.defaultvhost.all
spec:
  vhost: "test" # name of a vhost
  userReference: 
    name: rabbitmq-replicator
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: upstream-rabbit  # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.

Configuring Downstream Replication

You must also configure the Continuous Schema Replication and Standby Message Replication plugins using the Standby Offsite Replication Operator for downstream replications.

To configure the Continuous Schema Replication plugin for the downstream cluster, complete the following steps:

  1. Configure a secret to contain a replication-schema user and the user's credentials. The following yaml code provides an example of how to configure the user and secret.
apiVersion: v1
kind: Secret
metadata:
  name: upstream-secret
type: Opaque
stringData:
  username: test-user
  password: test-password
---
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
  name: rabbitmq-replicator
spec:
  rabbitmqClusterReference:
    name:  downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  importCredentialsSecret:
    name: upstream-secret
  1. Add the write, configure, and read permissions for the user on the rabbitmq_schema_definition_sync vhost. These permissions are required to ensure the schema definition sync plugin operates correctly.

The following yaml code provides an example of how to configure these permissions on the rabbitmq_schema_definition_sync vhost.

apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all
spec:
  vhost: "rabbitmq_schema_definition_sync" # name of a vhost
  userReference:
    name: rabbitmq-replicator
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: downstream-rabbit  # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  1. Configure the SchemaReplication object using the following yaml example code. Note, the endpoint is the service external IP of the upstream cluster.
apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
  name: downstream
spec:
  endpoints: "UPSTREAM_EXTERNAL_IP:5672"
  upstreamSecret:
    name: upstream-secret
  rabbitmqClusterReference:
    name: downstream-rabbit  # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  1. Configure the Standby Offsite Replication Operator. Note, when you are configuring this operator, by default, you are configuring the Standby Message Replication plugin also.
    You can use Standby Offsite Replication Operator to configure a downstream (passive) RabbiMQ cluster to connect to a specific RabbitMQ. The operator takes the endpoints and credentials that are provided to set the standby_replication_upstream global parameter in the downstream RabbitMQ.

The following example connects the RabbitMQ cluster downstream-rabbit to the RabbitMQ cluster at the UPSTREAM_EXTERNAL_IP:5552 endpoint. Note, the use of the RabbitMQ Stream Protocol port: 5552.

---
apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
  name: downstream-configuration
spec:
  operatingMode: "downstream" # has to be "downstream" to configure an downstream RabbitMQ cluster
  downstreamModeConfiguration:
    endpoints: "UPSTREAM_EXTERNAL_IP:5552" # comma separated list of endpoints to the upstream RabbitMQ
    upstreamSecret:
      name: upstream-secret # an existing Kubernetes secret; required value
  rabbitmqClusterReference:
    name: downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.

Note, the spec.operatingMode field must be set to downstream to provide downstream related configurations.

spec.downstreamModeConfiguration.endpoints is a comma separated list containing endpoints to connect to the upstream RabbitMQ. Endpoints must be reachable from this downstream cluster with the stream protocol port. If you are securing standby replication with TLS, the stream protocol port is 5551 instead of 5552.

spec.downstreamModeConfiguration.upstreamSecret is the name of an exising Kubernetes secret in the same namespace. This secret must contain the username and password keys. It is used as credentials to connect to the upstream RabbitMQ. For example:

---
apiVersion: v1
kind: Secret
metadata:
  name: upstream-secret
type: Opaque
stringData:
  username: test-user # upstream cluster username
  password: test-password # upstream cluster password

Updating Configurations

You can update both upstream and downstream configurations. Be aware that once a StandbyReplication custom resource is created, you won't be able to update its spec.operatingMode and spec.rabbitmqClusterReference. To change these two immutable fields, you will need to delete and re-create the resource.

Operator does not watch updates for the Kubernetes secret object which is set in spec.downstreamModeConfiguration.upstreamSecret. If credentials have been updated in the secret, you can force the Operator to update by adding a temporary label or annotation to the custom resource.

Note that updates to spec.upstreamModeConfiguration.replicationPolicies is not fully supported. Operator won't clean up removed policies from the list. Other update operations such as updating policy name, vhost, and patterns, and adding a new policy are supported.

Replication over TLS

You can configure the upstream and downstream clusters to perform replication over TLS, which secures the communications between the clusters.

First, configure your clusters with Secrets containing TLS certificates by following this TLS Example.

You can then use these certificates by including the configuration parameters in the configuration file. Include these parameters in the same format as the ssl_options, which are detailed in Enabling TLS Support in RabbitMQ.

On the upstream cluster, set the parameters under schema_definition_sync.ssl_options:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
   name: upstream-rabbit
spec:
...
  tls:
    secretName: tls-secret
  rabbitmq:
    additionalPlugins:
    - rabbitmq_stream
    - rabbitmq_schema_definition_sync
    - rabbitmq_schema_definition_sync_prometheus
    - rabbitmq_standby_replication
    additionalConfig: |
       schema_definition_sync.operating_mode = upstream
       standby.replication.operating_mode = upstream
       standby.replication.retention.size_limit.messages = 5000000000
       schema_definition_sync.ssl_options.cacertfile            = /etc/rabbitmq-tls/ca.crt
       schema_definition_sync.ssl_options.certfile              = /etc/rabbitmq-tls/tls.crt
       schema_definition_sync.ssl_options.keyfile               = /etc/rabbitmq-tls/tls.key
       schema_definition_sync.ssl_options.verify                = verify_none
       schema_definition_sync.ssl_options.fail_if_no_peer_cert  = false

On the downstream cluster, set the parameters under schema_definition_sync.ssl_options and standby.replication.downstream.ssl_options:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
   name: downstream-rabbit
spec:
...
  tls:
    secretName: tls-secret
  rabbitmq:
    additionalPlugins:
    - rabbitmq_stream
    - rabbitmq_schema_definition_sync
    - rabbitmq_schema_definition_sync_prometheus
    - rabbitmq_standby_replication
    additionalConfig: |
       schema_definition_sync.operating_mode = downstream
       standby.replication.operating_mode = downstream
       schema_definition_sync.downstream.locals.users = ^default_user_
       schema_definition_sync.downstream.locals.global_parameters = ^standby
       standby.replication.retention.size_limit.messages = 5000000000
       schema_definition_sync.ssl_options.cacertfile            = /etc/rabbitmq-tls/ca.crt
       schema_definition_sync.ssl_options.certfile              = /etc/rabbitmq-tls/tls.crt
       schema_definition_sync.ssl_options.keyfile               = /etc/rabbitmq-tls/tls.key
       schema_definition_sync.ssl_options.verify                = verify_none
       schema_definition_sync.ssl_options.fail_if_no_peer_cert  = false
       standby.replication.downstream.ssl_options.cacertfile            = /etc/rabbitmq-tls/ca.crt
       standby.replication.downstream.ssl_options.certfile              = /etc/rabbitmq-tls/tls.crt
       standby.replication.downstream.ssl_options.keyfile               = /etc/rabbitmq-tls/tls.key
       standby.replication.downstream.ssl_options.verify                = verify_none
       standby.replication.downstream.ssl_options.fail_if_no_peer_cert  = false

Important: Peer verification (normally configured by setting ssl_options.verify to verify_peer) is not supported for Standby Replication. schema_definition_sync.ssl_options.verify and standby.replication.downstream.ssl_options.verify must be set to verify_none.

Deleting Configurations

You can remove upstream and downstream configurations by deleting the StandbyReplication custom resource. To delete an upstream configuration, Operator removes all replication policies set in spec.upstreamModeConfiguration.replicationPolicies from the RabbitMQ.

To delete a downstream configuration, Operator removes the standby_replication_upstream global parameter.

Downstream (Passive, Standby) Promotion

Having a standby cluster with synchronised schema and messages is only useful if it can be turned into a new primary cluster in case of a disaster event. In this guide we will refer to such event as a downstream promotion.

A promoted downstream cluster is detached from its upstream to operate independently. Promotion is never performed by the plugin itself; the decision is performed by a human operator. Promotion is triggered via a CLI command.

A promoted downstream becomes a "regular" cluster that can, if needed, itself serve as an upstream. It does not sync from its original upstream but can be configured to collect messages for offsite replication to another datacenter.

When a cluster is promoted, a few things happen:

  • All upstream links are closed
  • For every virtual host, unacknowledged message are re-published to their original destination queues

The promotion process takes time. The amount of time it takes will be proportional to the retention period used. This operation will be CPU and disk I/O intensive.

Every downstream node will be responsible for recovering the virtual hosts it "owns". This helps distribute the load between cluster members.

To list virtual hosts available for promotion, i.e. have local data to recover:

rabbitmqctl list_vhosts_available_for_standby_replication_recovery

To initiate a recovery procedure:

rabbitmqctl promote_standby_replication_downstream_cluster [--start-from-scratch] [--all-available] [--exclude-virtual-hosts \"<vhost1>,<vhost2>,<...>\"]

The flag --start-from-scratch recovers messages from the earliest available data instead of last timestamp recovered previously, even if information about the last recovery is available.

The flag --all-available forces to recover all messages available if neither the last cutoff nor the last recovery information are available.

Virtual hosts could be excluded from promotion with the --exclude-virtual-hosts flag.

To display promotion summary (in case a promotion was attempted):

rabbitmqctl display_standby_promotion_summary

The recovery process stores a summary on disk indicating what was the last timestamp recovered. This allows for idempotent recovery that avoids recovering the same set of messages twice.

After the recovery process completes, the cluster can be used as usual.

Additional Commands

If the cluster size changes, the virtual hosts "owned" by every node might change. To delete the data for the virtual hosts that nodes do not longer own, use the next command:

rabbitmqctl delete_orphaned_data_on_standby_replication_downstream_cluster

To inspect the size of the data replicated:

rabbitmqctl display_disk_space_used_by_standby_replication_data

To disconnect the downstream, effectively stopping message replication, run:

rabbitmqctl disconnect_standby_replication_downstream

To (re)connect the downstream, effectively starting/resuming message replication, run:

rabbitmqctl connect_standby_replication_downstream

Post-Promotion

Note that if the promoted cluster is to be restarted, its operating mode must be updated in the configuration file as well, otherwise it will revert back to its originally configured mode, downstream.

The plugin does not make any assumptions about what happens to the original cluster that has experienced a disaster event. It can be gone permanently, brought back as a standby for the newly promoted one or be eventually promoted back.

After promotion, the replicated data on the old downstream can be erased from disk with:

rabbitmqctl delete_all_data_on_standby_replication_cluster

If this command is used on an active downstream, it will deleted all tranferred data until that point, but it also might stop the replication. To ensure it continues, the downstream must be disconnected and connected again using the commands listed above.

To delete the internal streams on the upstream:

rabbitmqcl delete_internal_streams_on_standby_replication_upstream_cluster

Diagnostics

To inspect the number of messages replicated for each virtual host, exchange and routing key:

rabbitmq-diagnostics inspect_local_data_available_for_standby_replication_recovery

This is a very expensive operation as it reads and parses all data on disk, it should be used with care. This operation can take a long time to run, even for medium data sizes.

check-circle-line exclamation-circle-line close-line
Scroll to top icon