VMware RabbitMQ supports continuous schema definition and message replication to a remote cluster, which makes it easy to run a downstream (standby) cluster for disaster recovery.

Note This feature is only supported in VMware RabbitMQ. It is not supported in the Open Source RabbitMQ product.

Topics

Getting familiar with the Terminology

  • Upstream Cluster: The primary active cluster is most often called the “upstream” cluster. The “upstream” cluster reference is used in the configuration files, which you will come across when you are configuring the Warm Standby Replication feature. For the remainder of this documentation, it is referred to as the upstream (primary) cluster.
  • Downstream Cluster: The standby remote cluster is most often called the “downstream” cluster. The “downstream” cluster reference is used in the configuration files, which you will come across when you are configuring the Warm Standby Replication feature. For the remainder of this documentation, it is referred to as the downstream (standby) cluster.
  • Schema: Nodes and clusters store information that can be referred to as schema, metadata or topology. Users, vhosts, queues, exchanges, bindings, runtime parameters are all included in this category. This metadata is called definitions in RabbitMQ.
  • Sync Request: A sync request carries a payload that allows the upstream (primary) side to compute the difference between the schemas on the upstream (primary) and downstream (standby) clusters.
  • Sync Response: A sync response carries the difference plus all the definitions that are only present on the upstream (primary) side or conflict. The downstream (standby) side uses this information to apply the definitions. Any entities only present on the downstream are deleted, which ensures that downstreams follow their upstream's schema as closely as possible.
  • Sync Operation: A sync operation is a request/response sequence that involves a sync request sent by the downstream (standby) cluster and a sync response that is sent back by the upstream (primary) cluster.
  • Schema Replication: The automated process of continually replicating schema definitions from an upstream (primary) cluster to one or more downstream (standby) clusters.
  • Message Replication: The automated process of continually replicating published messages from an upstream (primary) cluster to one or more downstream (standby) clusters.
  • Loose Coupling: The upstream and its followers (downstreams) are loosely connected. If one end of the schema replication connection fails, the delta between clusters' schema will grow but neither will be affected in any other way. (This applies to the message replication connection as well.) If an upstream is under too much load to serve a definition request, or the sync plugin is unintentionally disabled, the downstream won't receive responses for sync requests for a period of time. If a downstream fails to apply definitions, the upstream is not affected and neither are its downstream peers. Therefore, availability of both sides are independent of each other. When multiple downstreams are syncing from a shared upstream, they do not interfere or coordinate with each other. Both sides have to do a little bit more work. On the upstream side, this load is shared between all cluster nodes. On the downstream side, the load should be minimal in practice, assuming that sync operations are applied successfully, so the delta does not accumulate.
  • Downstream Promotion: Promoting the downstream (standby) cluster to the Upstream (primary) cluster.

Why use Warm Standby Replication

  • It is an automated disaster recovery process which is much faster, takes little time to recover, and significantly reduces the risk of data loss.
  • Provides a way of transferring schema definitions in a compressed binary format which reduces bandwidth usage.
  • Avoids cluster co-dependencies because all communication between the sides is completely asynchronous. For example, a downstream (standby) cluster can run a different version of RabbitMQ.
  • Links to other clusters are easy to configure, which is important for disaster recovery (for example, if you are setting up more than one downstream (standby) cluster).

What is Replicated/What is not?

Replicated

  • Schema definitions such as vhosts, quorum queues, users, exchanges, bindings, runtime parameters, and so on.
  • Messages that are published to quorum queues.

Not Replicated

Schema synchronization does not synchronize Kubernetes objects.

How Warm Standby Replication Works

The Warm Standby Replication process uses the following plugins:

Continuous Schema Replication (SchemaReplication) Plugin

The Continuous Schema Replication plugin connects the upstream (primary) cluster to the downstream (standby) cluster via a schema replication link. The downstream (standby) clusters connect to their upstream (primary) cluster and initiate sync operations. These operations synchronize the schema definition on the downstream side with the same schema definition of that which is on the upstream side. A node running in the downstream mode (a follower) can be converted to an upstream (leader) on the fly. This will make the node disconnect from its original source, therefore stopping all syncing. The node will then continue operating as a member of an independent cluster, no longer associated with its original upstream. Such conversion is called a downstream promotion and should be completed in case of a disaster recovery event.

Standby Message Replication Plugin

To ensure improved data safety and reduce the risk of data loss, it is not enough to automate the replication of RabbitMQ entities (schema objects). The Warm Standby Replication feature implements a hybrid replication model. In addition to schema definitions, it also manages the automated and continuous replication of enqueued messages from the upstream (primary) cluster. During the setup process, a replication policy is configured at the vhost level in the upstream (primary) cluster indicating the downstream queues that should be matched and targeted for message replication. Messages and relevant metrics from the upstream queues are then pushed to the downstream queues via a streaming log which the downstream(s) subscribe to. Currently, both quorum queues and stream-based queues are supported for message replication.

Important Queue messages are also automatically replicated but they are not directly copied into the queues until the the downstream (standby) cluster is promoted to the upstream (primary) cluster. The replicated data includes the messages and a log that records information about the message state such as a record of acknowledgements. This way, when a disaster recovery event occurs and the downstream (standby) cluster is promoted to the upstream (primary) cluster, it uses the replicated log to filter out only those messages that are not already processed by the upsteam (primary) cluster.

Requirements for Warm Standby Replication

  • The Standby Replication Operator is installed on the Kubernetes cluster. Note, if the upstream (primary) cluster and the downstream (standby) cluster are on different Kubernetes clusters, then the Standby Replication Operator must be installed on both of these clusters. This operator is installed by default when the VMware RabbitMQ package is installed. The Standby Replication Operator is used to configure the Continuous Schema Replication and Standby Message Replication plugins.
  • The Continuous Schema Replication (SchemaReplication) and Standby Message Replication (StandbyReplication) plugins are enabled. Run the following commands to check whether the plugins are enabled:
    rabbitmq-plugins list rabbitmq_schema_definition_sync
    tabbitmq-plugins list rabbitmq_standby_replication
    
    If the following output is returned (example output here is for Continuous Schema Replication plugin), then the plugins are enabled:
    rabbitmq [ ~ ]$ rabbitmq-plugins list rabbitmq_schema_definition_sync
    Listing plugins with pattern "rabbitmq_schema_definition_sync" ...
     Configured: E = explicitly enabled; e = implicitly enabled
     | Status: * = running on rabbit@6b4e8ac05412
     |/
    [E*] rabbitmq_schema_definition_sync
    
  • Know the credentials (username and password) that you want to use for Warm Standby Replication.

Setting up and Configuring Warm Standby Replication

Before continuing, ensure that all Requirements for Warm Standby Replication are in place.

Note There can be multiple downstream (standby) clusters linked to one upstream (primary) cluster. This setup describes one upstream cluster and one downstream cluster.

Setting up the Upstream and Downstream RabbitMQ Clusters

  1. Set up the upstream (primary) and downstream (standby) clusters with the required plugins: Continuous Schema Replication (SchemaReplication) and Standby Message Replication by using the following yaml syntax.

    The following is an example of an upstream (primary) RabbitmqCluster cluster configuration which you can use:

    apiVersion: rabbitmq.com/v1beta1
    kind: RabbitmqCluster
    metadata:
      name: upstream-rabbit
    spec:
    ...
      rabbitmq:
        additionalPlugins:
          - rabbitmq_stream
          - rabbitmq_schema_definition_sync
          - rabbitmq_schema_definition_sync_prometheus # optional
          - rabbitmq_standby_replication
        additionalConfig: |
          schema_definition_sync.operating_mode = upstream
          standby.replication.operating_mode = upstream
          # message stream retention limit (can either be size or time based)
          standby.replication.retention.size_limit.messages = 5000000000
          # standby.replication.retention.time_limit.messages = 12h
    

    The following is an example of a downstream (standby) RabbitmqCluster cluster configuration which you can use:

    apiVersion: rabbitmq.com/v1beta1
    kind: RabbitmqCluster
    metadata:
      name: downstream-rabbit
    spec:
    ...
      rabbitmq:
        additionalPlugins:
          - rabbitmq_stream
          - rabbitmq_schema_definition_sync
          - rabbitmq_schema_definition_sync_prometheus
          - rabbitmq_standby_replication
        additionalConfig: |
          schema_definition_sync.operating_mode = downstream
          standby.replication.operating_mode = downstream
          schema_definition_sync.downstream.locals.users = ^default_user_
          schema_definition_sync.downstream.locals.global_parameters = ^standby
          # message stream retention limit (can either be size or time based)
          standby.replication.retention.size_limit.messages = 5000000000
          # standby.replication.retention.time_limit.messages = 12h
    
  2. Check the status of the upstream (primary) and downstream (standby) RabbitmqClusters, ensure that the pods for these clusters are running before continuing with the next steps below. To check the pods in the upstream (primary) and downstream (standby) clusters are running, run the following command.

    Note rbtmq-cluster in the following command is the name of an example namespace.

    kubectl get pod -n rbtmq-cluster
    

    Output similiar to the following should be returned:

    NAME READY STATUS RESTARTS AGE
    downstream-rabbit-server-0 1/1 Running 1 28d
    downstream-rabbit-server-1 1/1 Running 1 28d
    downstream-rabbit-server-2 1/1 Running 1 28d
    upstream-rabbit-server-0 1/1 Running 1 28d
    upstream-rabbit-server-1 1/1 Running 1 28d
    upstream-rabbit-server-2 1/1 Running 1 28d
    

    You can also check the upstream (primary) and downstream (standby) services are created properly by running this command:

    kubectl get svc -n rbtmq-cluster
    

    Output similar to the following should be returned:

    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    downstream-rabbit NodePort 10.100.198.3 <none> 5672:31414/TCP,15672:30479/TCP,5552:30399/TCP,15692:32563/TCP 28d
    downstream-rabbit-nodes ClusterIP None <none> 4369/TCP,25672/TCP 28d
    upstream-rabbit NodePort 10.100.153.24 <none> 5672:32516/TCP,15672:31792/TCP,5552:30702/TCP,15692:31009/TCP 28d
    upstream-rabbit-nodes ClusterIP None <none> 4369/TCP,25672/TCP 28d
    

Configuring Warm Standby Replication on the Upstream (Primary) Cluster

To use the Warm Standby Replication feature, you must now configure the Continuous Schema Replication (SchemaReplication) and Standby Message Replication plugins using the Standby Replication Operator.

To configure the Continuous Schema Replication plugin for the upstream cluster, complete the following steps:

  1. Configure a secret to contain a replication-schema user and the user's credentials.
    This user will be used from the downstream (standby) cluster to establish a connection and manage the replication.

    The following Standby Replication Operator yaml code provides an example of how to configure the user and secret.

    apiVersion: v1
    kind: Secret
    metadata:
      name: upstream-secret
    type: Opaque
    stringData:
      username: test-user
      password: test-password
    ---
    apiVersion: rabbitmq.com/v1beta1
    kind: User
    metadata:
      name: rabbitmq-replicator
    spec:
      rabbitmqClusterReference:
        name:  upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
      importCredentialsSecret:
        name: upstream-secret
    
  2. Add the write, configure, and read permissions for the user on the rabbitmq_schema_definition_sync vhost. These permissions are required to ensure the Continuous Schema Replication (SchemaReplication) plugin operates correctly. The following yaml code provides an example of how to configure these permissions on the rabbitmq_schema_definition_sync vhost.

    apiVersion: rabbitmq.com/v1beta1
    kind: Permission
    metadata:
      name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all
    spec:
      vhost: "rabbitmq_schema_definition_sync" # name of a vhost
      userReference:
        name: rabbitmq-replicator
      permissions:
        write: ".*"
        configure: ".*"
        read: ".*"
      rabbitmqClusterReference:
        name: upstream-rabbit  # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    
  3. Configure the SchemaReplication object using the following yaml example code. Note, the endpoint is the service external IP of the upstream (primary) cluster. If you are securing Warm Standby Replication with TLS (refer to Optional: Configuring Warm Standby Replication with TLS), the port in the endpoint is 5671 instead of 5672 below.

    apiVersion: rabbitmq.com/v1beta1
    kind: SchemaReplication
    metadata:
      name: upstream
      namespace: upstream
    spec:
      endpoints: "UPSTREAM_EXTERNAL_IP:5672"
      upstreamSecret:
        name: upstream-secret
      rabbitmqClusterReference:
        name: upstream-rabbit  # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    
  4. Configure the Standby Message Replication plugin using the following yaml example code, ensure kind is set to StandbyReplication which directs the Standby Replication Operator to configure Warm Standby Replication.
    You can use the Standby Replication Operator to configure which quorum queues that the plugin should collect messages for.

    In the following example, the schema definition sync plugin is configured to collect messages for all quorum queues in vhost test:

    apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
    kind: StandbyReplication
    metadata:
      name: upstream-configuration
    spec:
      operatingMode: "upstream" # has to be "upstream" to configure an upstream RabbitMQ cluster; required value
      upstreamModeConfiguration: # list of policies that Operator will create
        replicationPolicies:
          - name: test-policy # policy name; required value
            pattern: "^.*" # any regex expression that will be used to match quorum queues name; required value
            vhost: "test" # vhost name; must be an existing vhost; required value
      rabbitmqClusterReference:
        name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    

    Note The spec.operatingMode field must be set to upstream to provide upstream related configurations.

    Note spec.upstreamModeConfiguration.replicationPolicies is a list, and name, pattern, vhost are the required values for the operator policies.

    Note vhost test must be an existing vhost, which can be created with our topology operator also.

    apiVersion: rabbitmq.com/v1beta1
    kind: Vhost
    metadata:
      name: default
    spec:
      name: "test" # vhost name
      tags: ["standby_replication"]
      rabbitmqClusterReference:
        name: upstream-rabbit # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    

    Note The "standby_replication" tag and the permissions are used by the plugin to select the vhost to replicate.

    apiVersion: rabbitmq.com/v1beta1
    kind: Permission
    metadata:
      name: rabbitmq-replicator.defaultvhost.all
    spec:
      vhost: "test" # name of a vhost
      userReference: 
        name: rabbitmq-replicator
      permissions:
        write: ".*"
        configure: ".*"
        read: ".*"
      rabbitmqClusterReference:
        name: upstream-rabbit  # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    

Configuring Warm Standby Replication on the Downstream (Standby) Cluster

To use the Warm Standby Replication feature, as well as configuring the Continuous Schema Replication (SchemaReplication) and Standby Message Replication plugins on the upstream (primary) cluster, you must also configure them on the downstream (standby) cluster using the Standby Replication Operator. Before continuing with these steps, ensure you have configured the plugins on the upstream (primary) cluster first, refer to Configuring Warm Standby Replication on the Upstream (Primary) Cluster for more information.

To configure the Continuous Schema Replication plugin for the downstream cluster, complete the following steps:

  1. Configure a secret to contain a replication-schema user and the user's credentials. The following yaml code provides an example of how to configure the user and secret.

    apiVersion: v1
    kind: Secret
    metadata:
      name: upstream-secret
    type: Opaque
    stringData:
      username: test-user
      password: test-password
    ---
    apiVersion: rabbitmq.com/v1beta1
    kind: User
    metadata:
      name: rabbitmq-replicator
    spec:
      rabbitmqClusterReference:
        name:  downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
      importCredentialsSecret:
        name: upstream-secret
    
  2. Add the write, configure, and read permissions for the user on the rabbitmq_schema_definition_sync vhost. These permissions are required to ensure the Continuous Schema Replication (SchemaReplication) operates correctly.

    The following yaml code provides an example of how to configure these permissions on the rabbitmq_schema_definition_sync vhost.

    apiVersion: rabbitmq.com/v1beta1
    kind: Permission
    metadata:
      name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all
    spec:
      vhost: "rabbitmq_schema_definition_sync" # name of a vhost
      userReference:
        name: rabbitmq-replicator
      permissions:
        write: ".*"
        configure: ".*"
        read: ".*"
      rabbitmqClusterReference:
        name: downstream-rabbit  # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    
  3. Configure the SchemaReplication object using the following yaml example code. Note, the endpoint is the service external IP of the upstream cluster.

    apiVersion: rabbitmq.com/v1beta1
    kind: SchemaReplication
    metadata:
      name: downstream
    spec:
      endpoints: "UPSTREAM_EXTERNAL_IP:5672"
      upstreamSecret:
        name: upstream-secret
      rabbitmqClusterReference:
        name: downstream-rabbit  # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    
  4. Configure the Standby Message Replication plugin using the following yaml example code, ensure kind is set to StandbyReplication which directs the Standby Replication Operator to configure Warm Standby Replication.
    You can use Standby Replication Operator to configure a downstream (standby) RabbiMQ cluster to connect to a specific RabbitMQ. The operator takes the endpoints and credentials that are provided to set the standby_replication_upstream global parameter in the downstream (standby) RabbitMQ cluster.

    The following example connects the RabbitMQ cluster downstream-rabbit to the RabbitMQ cluster at the UPSTREAM_EXTERNAL_IP:5552 endpoint. Note, the use of the RabbitMQ Stream Protocol port: 5552.

    ---
    apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
    kind: StandbyReplication
    metadata:
      name: downstream-configuration
    spec:
      operatingMode: "downstream" # has to be "downstream" to configure an downstream RabbitMQ cluster
      downstreamModeConfiguration:
        endpoints: "UPSTREAM_EXTERNAL_IP:5552" # comma separated list of endpoints to the upstream RabbitMQ
        upstreamSecret:
          name: upstream-secret # an existing Kubernetes secret; required value
      rabbitmqClusterReference:
        name: downstream-rabbit # the downstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
    

    Note The spec.operatingMode field must be set to downstream to provide downstream related configurations.

    Note spec.downstreamModeConfiguration.endpoints is a comma separated list containing endpoints to connect to the upstream RabbitMQ. Endpoints must be reachable from this downstream cluster with the stream protocol port. If you are securing Warm Standby Replication with TLS, the stream protocol port is 5551 instead of 5552.

    Note spec.downstreamModeConfiguration.upstreamSecret is the name of an exising Kubernetes secret in the same namespace. This secret must contain the `username` and `password` keys. It is used as credentials to connect to the upstream RabbitMQ. For example:

    ---
    apiVersion: v1
    kind: Secret
    metadata:
      name: upstream-secret
    type: Opaque
    stringData:
      username: test-user # upstream cluster username
      password: test-password # upstream cluster password
    

Updating the Replication Configuration

You can update the replication configurations on the upstream (primary) and downstream (standby) clusters after the StandbyReplication custom resources are created on these clusters.

Important Notes:

  • Update operations such as updating the policy name, vhost, patterns, and adding a new policy are supported.
  • Updating the spec.upstreamModeConfiguration.replicationPolicies field is not fully supported so when you remove the definition of an existing policy from the yaml file, the existing policy will not be removed when you update the configuration, it still exists in the RabbitMQ definition. However, if you add a new policy to the yaml file, it creates the new policy in RabbitMQ. The Standby Replication Operator won't clean up removed policies from the list.
  • The spec.operatingMode and spec.rabbitmqClusterReference fields cannot be changed. If you need to update these fields, then you must delete the Warm Standby Replication custom resources on the upstream (primary) and downstream (standby) clusters, and then complete the previous sections again to configure them.
  • If you update the user credentials, the Standby Replication Operator does not monitor these updates on the Kubernetes secret object. For these updates to be implemented, you can force the Standby Replication Operator to update by adding a temporary label or annotation to the custom resource.

Deleting the Replication Configuration

You can remove upstream (primary) and downstream (standby) configurations by deleting the StandbyReplication custom resource.

  • To delete an upstream configuration, the Standby Replication Operator removes all replication policies set in spec.upstreamModeConfiguration.replicationPolicies from the RabbitMQ.
  • To delete a downstream configuration, the Standby Replication Operator removes the standby_replication_upstream global parameter.

Optional: Configuring Warm Standby Replication with TLS

You can configure the upstream (primary) and downstream (standby) clusters to complete replication over TLS, which secures communications between the clusters.

Complete the following steps:

  1. Configure your clusters with secrets containing TLS certificates by following this TLS Example.
  2. You can then use these certificates by including the ssl_options configuration parameters in the configuration file. Include these parameters in the same format as the ssl_options, which are detailed in Enabling TLS Support in RabbitMQ. First, configure your clusters with secrets containing TLS certificates by following this TLS Example.

On the upstream cluster, set the parameters under schema_definition_sync.ssl_options:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
   name: upstream-rabbit
spec:
...
  tls:
    secretName: tls-secret
  rabbitmq:
    additionalPlugins:
    - rabbitmq_stream
    - rabbitmq_schema_definition_sync
    - rabbitmq_schema_definition_sync_prometheus
    - rabbitmq_standby_replication
    additionalConfig: |
       schema_definition_sync.operating_mode = upstream
       standby.replication.operating_mode = upstream
       standby.replication.retention.size_limit.messages = 5000000000
       schema_definition_sync.ssl_options.certfile              = /etc/rabbitmq-tls/tls.crt
       schema_definition_sync.ssl_options.keyfile               = /etc/rabbitmq-tls/tls.key
       schema_definition_sync.ssl_options.verify                = verify_none
       schema_definition_sync.ssl_options.fail_if_no_peer_cert  = false

On the downstream cluster, set the parameters under schema_definition_sync.ssl_options and standby.replication.downstream.ssl_options:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
   name: downstream-rabbit
spec:
...
  tls:
    secretName: tls-secret
  rabbitmq:
    additionalPlugins:
    - rabbitmq_stream
    - rabbitmq_schema_definition_sync
    - rabbitmq_schema_definition_sync_prometheus
    - rabbitmq_standby_replication
    additionalConfig: |
       schema_definition_sync.operating_mode = downstream
       standby.replication.operating_mode = downstream
       schema_definition_sync.downstream.locals.users = ^default_user_
       schema_definition_sync.downstream.locals.global_parameters = ^standby
       standby.replication.retention.size_limit.messages = 5000000000
       schema_definition_sync.ssl_options.certfile              = /etc/rabbitmq-tls/tls.crt
       schema_definition_sync.ssl_options.keyfile               = /etc/rabbitmq-tls/tls.key
       schema_definition_sync.ssl_options.verify                = verify_none
       schema_definition_sync.ssl_options.fail_if_no_peer_cert  = false
       standby.replication.downstream.ssl_options.certfile              = /etc/rabbitmq-tls/tls.crt
       standby.replication.downstream.ssl_options.keyfile               = /etc/rabbitmq-tls/tls.key
       standby.replication.downstream.ssl_options.verify                = verify_none
       standby.replication.downstream.ssl_options.fail_if_no_peer_cert  = false

Important Peer verification (normally configured by setting ssl_options.verify to verify_peer) is not supported for Warm Standby Replication. schema_definition_sync.ssl_options.verify and standby.replication.downstream.ssl_options.verify must be set to verify_none.

Optional: Using Vault for Secrets

Operator supports fetching RabbitMQ credentials from HashiCorp Vault. If you use Vault to store your RabbitMQ user credentials, you can provide a path to read credentials from Vault instead of providing a Kubernetes secrets. The credentials must have been written to Vault already before referencing them in custom resources. The Vault secret must have fields username and password.

The following example shows how to configure SchemaReplication custom resource to get credentials from Vault:

---
apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
  name: downstream
spec:
  secretBackend:
    vault:
      secretPath: path/to/rabbitmq/creds # instead of spec.upstreamSecret
  ...

The following example shows how to configure StandbyReplication custom resource to get credentials from Vault:

---
apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
  name: downstream
spec:
  operatingMode: "downstream"
  downstreamModeConfiguration:
    secretBackend:
      vault:
        secretPath: path/to/rabbitmq/creds # instead of spec.downstreamModeConfiguration.upstreamSecret
  ...

Verifying Warm Standby Replication is Configured Correctly

You can complete the following steps to verify that Warm Standby Replication is configured correctly.

  1. Check if the topology objects are replicated in the downstream (standby) RabbitMQ cluster. You can do this in two ways either by logging into the RabbitMQ management UI for that specific cluster or by running the rabbitmqctl command with kubectl exec from the command line. After using either method, your upstream RabbitMQ topology (vhosts, users, queues, exchanges, policies, and so) should be returned and listed in the downstream cluster. If you don't see these topology objects, you can check the Standby Replication Operator, upstream (primary), and downstream (standby) RabbitMQ clusters logs to investigate the issue is.

  2. Pod exec to downstream RabbitMQ pods one by one and run the following command to list the vhosts with local data to recover. This list should contain all vhosts that you tagged with "standby_replication":

    rabbitmqctl list_vhosts_available_for_standby_replication_recovery
    
  3. If you published messages to queues in the vhosts that are tagged with "standby_replication", and covered by replication policies (refer to Configuring Warm Standby Replication on the Upstream (Primary) Cluster, step 4), you can list the number of messages replicated for each virtual host, exchange, and routing key by running the following command:

    rabbitmq-diagnostics inspect_local_data_available_for_standby_replication_recovery
    

    Note This command returns replicated messages per exchange and routing key including messages that are previously consumed. If does not return messages per queue. The routing of messages to their destination queue and the consumption of these messages does happen until promotion is initiated which must be taken into consideration when you are interpreting these numbers. If you set up replication a short time ago, the number of available messages should be small so this command should run quickly. This command can take a longer time to run if the amount of available data is substantial.

Promoting the Downstream (Standby) Cluster for Disaster Recovery

Note Promotion and reconfiguration happen on the fly, and do not involve RabbitMQ node restarts or redeployment.

A downstream (standby) cluster with synchronised schema and messages is only useful if it can be turned into a new upstream (primary) cluster in case of a disaster event. This process is known as “downstream promotion”.

In the case of a disaster event, the recovery process involves several steps:

  • The downstream (standby) cluster is promoted to an upstream (primary) cluster by the service operator. When this happens, all upstream links are closed and for every virtual host, unacknowledged messages are re-published to their original destination queues.
  • Applications are redeployed or reconfigured to connect to the newly promoted upstream (primary) cluster.
  • Other downstream (standby) clusters must be reconfigured to follow the new promoted cluster.

When “downstream promotion” happens, a promoted downstream (standby) cluster is detached from its original upstream (primary) cluster. It then operates as an independent cluster which can be used as an upstream (primary) cluster. It does not sync from its original upstream but can be configured to collect messages for offsite replication to another datacenter.

Notes:

  • The downstream promotion process takes time. The amount of time it takes is proportional to the retention period used. This operation is CPU and disk I/O intensive.

  • Every downstream node is responsible for recovering the virtual hosts it "owns", which helps distribute the load between cluster members. To list virtual hosts available for downstream promotion, that is, have local data to recover, run the following command:

    rabbitmqctl list_vhosts_available_for_standby_replication_recovery
    

To promote a downstream (standby) cluster, that is, start the disaster recovery process, run the following command:

rabbitmqctl promote_standby_replication_downstream_cluster [--start-from-scratch][--all-available] [--exclude-virtual-hosts \"<vhost1>,<vhost2>,<...>\"]

Where:

  • --start-from-scratch recovers messages from the earliest available data instead of the last timestamp recovered previously, even if information about the last recovery is available.
  • --all-available forces the recovery of all messages that are available if neither the last cutoff nor the last recovery information is available.
  • --exclude-virtual-hosts virtual hosts can be excluded from promotion with this flag.

To display the promotion summary (in case a promotion was attempted), run the following command:

rabbitmqctl display_standby_promotion_summary

The recovery process stores a summary on disk indicating the last timestamp that was recovered. This ensures the recovery avoids recovering the same set of messages twice.

After the recovery process completes, the cluster can be used as usual.

Post Promotion

Complete the following steps if you need to after promoting the downstream (standby) cluster to be the upstream (primary) cluster for disaster recovery.

  • After the downstream (standby) cluster is promoted, if you need to restart the promoted cluster, you must change the operatingMode: "downstream" to operatingMode: "upstream" because this modification does not happen automatically when the cluster is restarted. If you don't change it, the promoted downstream (standby) cluster (which is now the upstream (primary) cluster) will run in the downstream mode because it is still a downstream cluster in its definition file .

  • What happens to the original upstream (primary) cluster that experienced a disaster event? It can be brought back as a downstream (standby) cluster for the newly promoted upstream (primary) cluster, it can be promoted back as the upstream (primary) cluster, or it may not be used at all.

  • After promotion, the replicated data on the old downstream (which is effectively the new promoted upstream) can be erased from disk. To explain this point in more detail, an example is: a cluster in Dublin is the upstream (primary) cluster, a cluster in London is the downstream (standby) cluster. The cluster in London gets promoted to be the upstream (primary) cluster. After promotion, you can now remove previous downstream-related data from the cluster in London (as it is now promoted and running as the upstream (primary) cluster) by running the following command:

    rabbitmqctl delete_all_data_on_standby_replication_cluster
    

Running Diagnostics

Important Running diagnostics is a very time consuming operation as it reads and parses all data on disk, it should be used with care. This operation can take a substantial time to run even for medium data sizes.

To inspect the number of messages replicated for each virtual host, exchange, and routing key, run the following command:

rabbitmq-diagnostics inspect_local_data_available_for_standby_replication_recovery

Other Useful Replication Commands

  • If the cluster size changes, the virtual hosts "owned" by every node might change. To delete the data for the virtual hosts that nodes no longer own, run the following command:

    rabbitmqctl delete_orphaned_data_on_standby_replication_downstream_cluster
    
  • To delete the internal streams on the upstream (primary) cluster, run the following command:

    rabbitmqcl delete_internal_streams_on_standby_replication_upstream_cluster
    
  • To inspect the size of the data replicated, run the following command:

    rabbitmqctl display_disk_space_used_by_standby_replication_data
    
  • To disconnect the downstream to stop message replication, run the following command:

    rabbitmqctl disconnect_standby_replication_downstream
    
  • To (re)connect the downstream, to start/resume message replication, run:

    rabbitmqctl connect_standby_replication_downstream
    
  • After promotion, replicated data on the old downstream (standby) cluster (which is now effectively the newly promoted upstream (primary) cluster) can be erased from disk with:

    rabbitmqctl delete_all_data_on_standby_replication_cluster
    

    If the previous command is run on an active downstream (standby) cluster, it deletes all transferred data until the time of deletion, it might also stop the replication process. To ensure it continues, the downstream must be disconnected and connected again using the commands listed above.

Troubleshooting Warm Standby Replication

Learn how to isolate and resolve problems with Warm Standby Replication using the following information.

Messages are collected in the Upstream (Primary) Cluster but are not delivered to the Downstream (Standby) Cluster

Messages and message acknowledgements are continually stored in the upstream (primary) cluster. The downstream (standby) cluster connects to the upstream (primary) cluster. The downstream (standby) cluster reads from the internal stream on the upstream (primary) cluster where the messages are stored, and then stores these messages in an internal stream in the downstream (standby) cluster. Messages transferred to the downstream (standby) cluster are not be visible in the RabbitMQ Management UI until the downstream (standby) cluster is promoted.

To inspect the information about the stored messages in the downstream (standby) cluster, run the following command:

rabbitmq-diagnostics inspect_standby_downstream_metrics
# Inspecting standby downstream metrics related to recovery...
# queue	timestamp	vhost
# ha.qq	1668785252768	/

If the previous command returns the name of a queue called example, it means that the downstream (standby) cluster has messages for queue example ready to be re-published, in the event of Promoting the Downstream (Standby) Cluster for Disaster Recovery.

If the queue you are searching for is not displayed in the list, verify the following items in the upstream (primary) cluster:

  • Does the effective policy for the queue have the definition remote-dc-replicate: true?

  • Is the queue type Quorum or Stream?

  • Can the replicator user for Warm Standby Replication authenticate? Run the following command to verify:

    rabbitmqctl authenticate_user some-user some-password
    

If previous checks are correct, next, check the downstream (standby) cluster RabbitMQ logs for any related errors.

Verify that the Downstream (Standby) Cluster has received the Replicated Messages before Promotion

Important Running the commands in this section are resource intensive. These commands can take a long time to complete when the amount of data to be recovered is substantial.

Before running the promotion command (for more information, refer to Promoting the Downstream (Standby) Cluster for Disaster Recovery), it is possible to verify what queues have messages and acknowledgements recovered. The exact number of messages that are collected can also be verified.

To verify what queues have messages available for recovery, run the following command in the downstream (standby) cluster:

rabbitmq-diagnostics inspect_standby_downstream_metrics
# Inspecting standby downstream metrics related to recovery...
# queue	timestamp	vhost
# ha.qq	1668785252768	/

To inspect the number of messages, their routing key, and the vhost, run the following command in the downstream (standby) cluster:

rabbitmq-diagnostics inspect_local_data_available_for_standby_replication_recovery
# Inspecting local data replicated for multi-DC recovery
# exchange	messages	routing_key	vhost
# myexchange	2	demo	/
# 	2	ha.qq	/

The previous command reports how many messages can be published to a specific exchange, with a specific routing key. If the exchange is empty or missing, it means that the message was published to the default exchange. It is important to note that the routing key might not be the same as the name of the queue.

How to get a License

To get a license for VMware RabbitMQ products, fill out the VMware RabbitMQ support contact form and we will get back to you with a tailored quote.

check-circle-line exclamation-circle-line close-line
Scroll to top icon