Unlocking the Power of On-Demand VMware RabbitMQ for Tanzu Application Service

This topic tells you how to benefit from the on-demand service plans for VMware Tanzu RabbitMQ for Tanzu Application Service.

Introduction

VMware Tanzu RabbitMQ for Tanzu Application Service responds to the demands of operators by offering a RabbitMQ on-demand cluster for their app developer teams, in addition to the single-node on-demand plan.

The on-demand cluster plan is designed for workloads that require the same resilience requirements as the pre-provisioned offering, but also require their workloads be isolated. The platform operations team can configure a Tanzu RabbitMQ for Tanzu Application Service cluster to meet their business requirements and empower app development teams to self-serve their own RabbitMQ cluster.

Tanzu RabbitMQ for Tanzu Application Service also provides smoke tests for the on-demand plans so that operations teams can validate the app developer workflow for on-demand services. See Smoke Tests.

With the on-demand cluster plan, platform operators can offer their app developers three types of Tanzu RabbitMQ for Tanzu Application Service service plans:

On-Demand single node—For app developer teams requiring greater isolation than provided by the pre-provisioned approach. App development teams can have full access to their own message broker to adapt the runtime parameters to their requirements. For more information about these parameters, see Parameters and Policies in the RabbitMQ documentation.
On-Demand cluster—For an increased level of message resilience and cluster availability, as well as the benefits of workload isolation mentioned above.
Pre-provisioned—For light to moderate messaging needs, this service is fully operated and managed by platform operators as a service.

For information about the pre-provisioned plan, see Deploying the Tanzu RabbitMQ for Tanzu Application Service Pre-Provisioned Service. For information about using pre-provisioned plans to isolate workloads, see Creating Isolation with the Tile Replicator.

Deciding Which Service Plan to Use

VMware recommends the on-demand service, which is designed for independent isolated Tanzu RabbitMQ for Tanzu Application Service service instances. The existing pre-provisioned offering has many Tanzu RabbitMQ for Tanzu Application Service service instances on a single VM, in a multi-tenancy model. In this multi-tenancy model, a single misbehaving app can take down the entire cluster for everyone.

From research and feedback on the issues customers had when using the pre-provisioned service, VMware is making different design decisions for the on-demand service.

To provide you enough visibility to decide which service to use, the table below describes the current feature discrepancies between the pre-provisioned and on-demand services, and plans for addressing these discrepancies. Please give feedback on how to meet your use case requirements with the on-demand service.

Feature	Pre-Provisioned Service	On-Demand Service
Configuration	Enabled in base-64 encoded text box	Plan to address
Plug-ins	Tier-1 plug-ins enabled using checkboxes in UI.	A selection of tier-1 plug-ins are enabled by default on all instances. Other tier-1 plug-ins can be optionally enabled by an app developer. See RabbitMQ Server Plug-ins.
RabbitMQ admin credentials to access RabbitMQ Management UI	Can set password using tile UI. For more information, see [Access RabbitMQ Management UI for Pre-Provisioned Service Instances](./managing.html#access-ui-pp).	Can access by creating a service key, see Create an Admin User for a Service Instance.
Erlang cookie	Operator can change. This might cause problems. For more information, see Changing the Erlang Cookie Value Known Issue.	This is managed by the service. No operator intervention needed.
Tanzu RabbitMQ for Tanzu Application Service TLS versions	Available due to security concerns about the TLS packaged with the pre-provisioned service	All instances only have TLS v1.1 and TLS v1.2 available.
Resource Sharing	Yes. Service instances share resources on the same VM and can affect one another.	No. On-Demand ensures isolation between service instances by creating a separate VM per service instance.
External load balancer DNS name	Available	Plan to address IaaS-specific load balancers can still be used.
Disk free alarm limit	Configurable	No plans to address. The default persistent disk size is controlled at the plan level and is set relative to memory. This removes the ability to mis-configure the alarm limit.
Load-balancing	Available using HAProxy	Available using BOSH DNS
RabbitMQ servers static IP	Exists	Evaluating based on customer needs and feedback
Policy for new instances	Configurable	Plan to address
Almost-instant provisioning	Instantly provisioned	No plans to address. Instant provisioning for on-demand is limited by the IaaS. This limitation might be addressed by containers.
Individual instance upgrade	Possible when using tile replicator	Plan to address
Network partition behavior	Defaults to `pause_minority` Configurable	Defaults to `pause_minority` Configurable for each plan
TLS	TLS enabled between client and RabbitMQ broker, possible to configure peer validation	TLS between client and the RabbitMQ broker has been implemented.

On-Demand Single Node Plan

This plan is designed to be simple to configure, deploy, and use. It gives app developer teams fast access to the power of the leading open source message broker backed by BOSH to meet all but the most demanding high availability app messaging requirements.

This plan can suit high-performance workloads requiring messaging resilience and asynchronous messaging replication. RabbitMQ copies messages to disk for resilience and allows asynchronous messaging replication through the RabbitMQ Federation plug-in.

This plan offers:

Fast access to an isolated instance of RabbitMQ scoped for the app developer teams
Org and space admin access to the RabbitMQ Management UI so app developer teams can have full control over the node
Updates and upgrades initiated and controlled by the operator to keep the instance up-to-date with the latest security patches and issue fixes
Message resilience provided through RabbitMQ exchange, queue Federation, and Shovel plug-ins.

On-Demand Cluster Plan

Like the single node plan, this plan is designed to be simple to configure, deploy and use. It gives app developer teams fast access to the power of the leading Open Source message broker backed by BOSH to meet all but the most demanding high availability app messaging requirements.

This plan can suit high performance workloads requiring messaging resilience (copied to disk) and asynchronous messaging replication through the RabbitMQ Federation plug-in. With this plan, however, you also scale out Tanzu RabbitMQ for Tanzu Application Service to multiple nodes.

This plan offers:

Fast access to an isolated, clustered instance of RabbitMQ scoped to the app developer team orgs and spaces
Admin access to the RabbitMQ Management UI to give app developer teams full control over the cluster
Updates and upgrades initiated and controlled by the operator to keep the instance up-to-date with the latest security patches and issue fixes.
Message resilience provided by mirroring queues across RabbitMQ nodes, and the option to use the Federation and Shovel plug-ins.

General Principles of the Cluster Plan

The following are some general principles to be aware of when configuring the cluster plan:

Designed for Consistency

RabbitMQ clustering is not primarily a solution for increased availability. Instead, it is designed for consistency and partition tolerance, as described in the CAP theorem. RabbitMQ clustering provides increased message consistency through queue mirroring. This means that messages accessed in one queue are exactly the same as in another queue. For more information, see Consistency or Availability Trade-Off.

Other options can be used for availability requirements, such as the use of federation between exchanges or queues.

For a detailed description of distributed RabbitMQ brokers, see the RabbitMQ documentation.

Number of Nodes

Every node in the on-demand cluster maintains a complete database of all metadata, and all changes to the metadata are confirmed by every node in the cluster. Therefore, going beyond seven nodes can have a significant negative impact on performance. For optimum resilience and performance, VMware recommends three nodes for most workloads.

Network Latency

RabbitMQ clusters are only recommended for deployment in low latency networks, which normally means that it is not advisable to deploy these clusters across availability zones (AZs). The stability and performance of the RabbitMQ cluster is heavily influenced by the workload on the nodes, replication choices, and network latency.

For this reason, VMware recommends that you deploy RabbitMQ clusters into a single Ops Manager AZ. However, where different AZs are in the same data center, with reliable low latency links, spanning AZs can be used.

For cloud IaaS deployments, VMware does not recommend that deployments span regions. For example, in Amazon Web Services (AWS) terms, deploying a RabbitMQ cluster across AZs within a region should provide high enough network performance to prevent impacting cluster stability. However, deploying across AWS regions is likely to lead to cluster instability. For more information, see the AWS documentation.

Consistency or Availability Trade-Off

In a distributed messaging system, a trade-off must be made between availability or consistency when a network partition event occurs and one or more nodes are not able to communicate with each other. The cluster plan lets operators decide how they want the RabbitMQ cluster to react in the event of a network partition.

VMware recommends keeping the default cluster partition option of pause_minority because this satisfies most use cases. Choosing the pause_minority partition-handling strategy favors message consistency over availability. For more information about the options for handling partitions, see the RabbitMQ documentation. For a detailed description of the options available in Tanzu RabbitMQ for Tanzu Application Service, see Clustering and Network Partitions.

Here is an example of how pause_minority works. If you create a RabbitMQ cluster with three nodes and one node becomes unable to communicate with the other two, this node is in the minority. The node that is in the minority is paused, and the other two nodes continue serving traffic. If each of the nodes loses connectivity with the other two, then the entire cluster is paused to preserve data as no majority can be established. The cluster heals when two or more nodes are able to communicate with each other.

RabbitMQ Queue Availability

It is important to be aware that message queue availability is different from cluster availability. So, having cluster availability does not mean that all of the messages within the queues are also available.

By default, queues within a RabbitMQ cluster are located on a single node—the node on which they were first declared. However, queues can be configured to mirror across multiple nodes, so that any message published to the queue is replicated to all mirrors. Enabling mirroring can have a negative impact on queue performance because messages must be copied to all mirrors before being acknowledged.

Each mirrored queue consists of one leader replica and one or more mirrors, with the oldest mirror being promoted to the new leader replica if the old leader replica disappears for any reason. Consumers are connected to the leader replica regardless of which node they connect to, and mirrors drop messages that have been acknowledged at the leader replica. Queue mirroring enhances queue availability, but does not distribute load across nodes because each of the participating nodes must still do all the work.

App developers must decide if they want to use queue mirroring and determine the policy they want to apply to their queues. These choices have significant impact on the availability of their queues. For more information, see the RabbitMQ documentation.

Unlike the pre-provisioned plan, the cluster plan does not ship with a default load balancer. Therefore, developers must configure their app to use the array of hosts provided in VCAP_SERVICES. If developers enable queue mirroring, they must also ensure their apps have re-try logic and reconnection logic that iterates over the range of hosts provided. Most common RabbitMQ clients have this logic built into them. For more information, see the Spring Advanced Message Queuing Protocol (Spring AMQP) documentation.

Because the cluster plan is designed to enable app developer teams to self-serve, not having a load balancer in front of the RabbitMQ cluster has these benefits:

Manage resources better, as fewer VMs are needed.
Help with troubleshooting. Client IP is now the IP of the source container and not the HAProxy.
Reduce the number of hops between apps and broker. This helps with latency.
Determine queue placement. This makes sense for larger scale deployments.
Empower app developer teams to manage their cluster in the best way for their app.
Require re-try logic in an app if it needs HA access to a queue. Thus, all nodes can route to a queue if it is available.

Managing On-Demand Resources Through Plans

In configuring each plan, there are a number of operational controls that platform operations teams can use to manage the resources consumed by on-demand RabbitMQ:

Control Access—Operators can choose the app development orgs and spaces for which the plans are available and visible. Each plan can be activated or deactivated, and service access and visibility can either be global, or enabled per org and space through the command line.

For example, you might decide to enable the single node on-demand plan across all app developer teams to meet their demand to isolate their workload. You might then choose to offer the on-demand cluster plan only to a subset of app developer teams who require the extra resources.
Set Quotas—You can set a global quota for all on-demand instances that takes precedence over each plan quota. This lets you guard against the risk of over-committing resources, but allows the flexibility of over-committing each plan, so you can meet the fluctuating demands of your app developers.
Control Resource Consumption—Each plan offers more fine-grained control over individual plan resource consumption. At the highest level, you can use the plan quota to control the number of instances that can be deployed within a foundation. For each plan, you can also configure the number of nodes that constitute a cluster (3, 5, or 7), the instance type, and persistent disk storage size to best suit your requirements.
Monitor—You can monitor the number of instances that have been deployed against the quota you have set so that you can plan future resource requirements.

Customizing Plan Options

The Tanzu RabbitMQ for Tanzu Application Service on-demand plans expose a number of configuration options. In most cases, the default configurations meet most app demands. However, it is important for an operations team to consider the options to ensure that they provide the best service to their app developers. This section explains these options.

Configuration Options

Single Node and Cluster Plans

Activate or deactivate the plan
Determine which orgs and spaces can see and access the plan
Set Service Instance Quota
Select AZ placement (where applicable)
Set Tanzu RabbitMQ for Tanzu Application Service service instance size (CPU and Memory)
Set persistent disk size (Persisted Message Store) for the Tanzu RabbitMQ for Tanzu Application Service service instance. Ensure the size of the persistent disk is at least twice as large as the instance memory.

Cluster Plan Only

Set number of nodes to 3, 5, or 7
Determine network partition behavior. See Consistency or Availability Trade-Off above.

Note A load balancer, such as HAProxy, is not deployed with on-demand cluster plans.

Things That Are Preconfigured

The following are preconfigured for both the single node and the cluster plans:

RabbitMQ VM Type—When installing using Ops Manager, each RabbitMQ node is configured to have the following properties:
- CPUs: 2
- RAM: 8 GB
- Ephemeral disk: 16 GB
You can change these settings in the Service Plan Configuration page. Changing these settings affects all nodes.
Persistent Disk Type—When installing using Ops Manager, each RabbitMQ node is configured to have 30 GB of persistent disk space.

You can change this setting in the Service Plan Configuration page. VMware recommends you set this value to be twice the amount of RAM of the selected RabbitMQ VM Type.
Metrics—Emitted to the Loggregator Firehose for all on-demand instances. The polling interval is set in Ops Manager, in the Metrics polling interval field, in the Pre-Provisioned RabbitMQ tab of the Tanzu RabbitMQ for Tanzu Application Service tile. Due to the impact of some of the cluster settings detailed below, VMware strongly recommends that you monitor the exposed metrics and configure alarms as recommended in Monitoring and KPIs for On-Demand VMware Tanzu RabbitMQ for Tanzu Application Service. See also Monitoring On-Demand RabbitMQ Clusters below.
Logs—On-demand Tanzu RabbitMQ for Tanzu Application Service service instance logs are forwarded using the same configuration as contained in the Syslog and Metrics tab of the Tanzu RabbitMQ for Tanzu Application Service tile.
Disk free space limit—The disk free space limit is set to 150% of RAM of the instance type you select. For example, if you select an instance type with 10 GB of RAM, the disk free space limit is set to 15 GB. A cluster-wide alarm is triggered if the amount of free disk space drops below this, and all publishers are blocked. Instances must be configured to have persistent disks that are at least twice the size of instance RAM. For more information, see the RabbitMQ documentation.
Memory threshold for triggering flow control—Threshold at which flow control is triggered is set to 40% of the instance RAM. This means that when the alarm is triggered, all connections publishing messages are blocked cluster-wide until the alarm is cleared.

For example, if you select an instance type with 10 GB of RAM, when more than 4 GB of memory is used, all publishing connections are blocked. For more information, see Memory Alarms in the RabbitMQ documentation.
Memory paging threshold—This is the level at which RabbitMQ tries to free up memory by instructing queues to page their contents out to disk. This is done to try to avoid reaching the high watermark and blocking publishers. This threshold is set to 50% of the configured high watermark, which is 20% of configured memory.

For example, if you select an instance type with 10 GB of RAM, when more than 2 GB of memory is used, all queues start writing all queue contents to disk. For more information, see the RabbitMQ documentation.

Monitoring On-Demand RabbitMQ Clusters

It is important to monitor and compare the number of instances that have been deployed against the quota you set using the metric exposed on the Firehose.
Each instance is pre-configured to emit metrics to the Firehose and can be identified by the deployment tag, which has the service instance ID. It is important to monitor these metrics as recommended in Monitoring and KPIs for On-Demand VMware Tanzu RabbitMQ for Tanzu Application Service.

About Migrating a Pre-provisioned Instance to an On-Demand Instance

VMware recommends the on-demand service for production workloads due to its workload isolation.

For instructions for developers about migrating from a pre-provisioned to an on-demand instance, see Migrating a Tanzu RabbitMQ for Tanzu Application Service Service Instance to Another Service Instance. For how operators can turn off the pre-provisioned service, see Turning Off the Pre-Provisioned Service.

When migrating from a pre-provisioned to an on-demand offering, be aware of the following:

Pre-provisioned service instances have very low resource consumption, that is, a virtual host within an existing cluster. However, every on-demand service consists of dedicated VMs. Therefore, you must select an on-demand service plan that provides adequate resources, but avoids unnecessary resource consumption.
- For example: you might select a single-node plan with a small VM for development purposes, but select a three-node cluster of large VMs for a mission critical system.
VMware recommends that you define all required structures in your app to ensure they get defined if you connect to a new instance. These structures include:
- Exchanges
- Queues
- Bindings
If your pre-provisioned instance uses any of the following, ensure that you apply them to the on-demand instance:

Policies
virtual host-specific parameters, such as max_connections

You can apply them using the RabbitMQ Management UI or using APIs.

You lose messages that have not been consumed when you delete your old service instance. If you do not want to lose messages, do one of the following:
- Switch your producers to the new instance but keep the consumers bound to the old instance until the queues are empty.
- Use shovel or federation plug-ins to consume messages from the old instance.

Differences between Pre-provisioned and On-Demand Services Instances

There are some differences between pre-provisioned and on-demand services instances that you should be aware of:

On-Demand instances are not fronted by a load balancer, therefore, ensure the following:
- That you configure the RabbitMQ client in your app with all available nodes, not just one.
- That your re-connection logic can handle a node failure. VMware recommends this behavior for Spring AMQP clients.
The instance you are migrating to might use a different version of RabbitMQ than your old instance. For more information, see the VMware Tanzu RabbitMQ for Tanzu Application Service Release Notes and the RabbitMQ Changelog.
Critical tier-1 plug-ins are enabled for on-demand. However, on-demand does not yet have the same plug-ins enabled as pre-provisioned. If you are missing a plug-in, contact your VMware representative.
You might have configured your on-demand instance differently. For example, you might have changed:
- VM sizing—CPU, RAM, and disk
- RabbitMQ network partition behavior, both offerings default to pause_minority
If your Tanzu RabbitMQ for Tanzu Application Service service instance uses TLS, ensure that you enable TLS for on-demand instances. See Enable TLS for Your Service Instance.