This topic explains how to manage slow receivers in VMware Tanzu GemFire.
If the receiver fails to receive a message, the sender continues to attempt to deliver the message as long as the receiving member is still in the cluster.
During the retry cycle, Tanzu GemFire throws warnings that include this string:
will reattempt
The warnings are followed by an informational message when the delivery finally succeeds.
For distributed regions, the scope of a region determines whether distribution acknowledgments and distributed synchronization are required. Partitioned regions ignore the scope attribute, but for the purposes of this discussion you should think of them as having an implicit distributed-ack scope.
By default, distribution between system members is performed synchronously. With synchronous communication, when one member is slow to receive, it can cause its producers to slow down as well. This, of course, can lead to general performance problems in the cluster.
If you are experiencing slow performance and are sending large objects (multiple megabytes), before implementing these slow receiver options make sure your socket buffer sizes are appropriate for the size of the objects you distribute. The socket buffer size is set using socket-buffer-size in the gemfire.properties
file.
Managing Slow distributed-no-ack Receivers
You can configure your consumer members so their messages are queued separately when they are slow to respond. The queueing happens in the producer members when the producers detect slow receipt and allows the producers to keep sending to other consumers at a normal rate. Any member that receives data distribution can be configured as described in this section.
The specifications for handling slow receipt primarily affect how your members manage distribution for regions with distributed-no-ack scope, where distribution is asynchronous, but the specifications can affect other distributed scopes as well. If no regions have distributed-no-ack scope, the mechanism is unlikely to kick in at all. When slow receipt handling does kick in, however, it affects all distribution between the producer and that consumer, regardless of scope.
Note: These slow receiver options are deactivated in systems using SSL. See SSL.
Each consumer member determines how its own slow behavior is to be handled by its producers. The settings are specified as distributed system connection properties. This section describes the settings and lists the associated properties.
Configuring Async Queue Conflation
When the scope is distributed-no-ack scope, you can configure the producer to conflate entry update messages in its queues, which may further speed communication. By default, distributed-no-ack entry update messages are not conflated. The configuration is set in the producer at the region level.
Forcing the Slow Receiver to Disconnect
If either of the queue timeout or maximum queue size limits is reached, the producer sends the consumer a high-priority message (on a different TCP connection than the connection used for cache messaging) telling it to disconnect from the cluster. This prevents growing memory consumption by the other processes that are queuing changes for the slow receiver while they wait for that receiver to catch up. It also allows the slow member to start fresh, possibly clearing up the issues that were causing it to run slowly.
When a producer gives up on a slow receiver, it logs one of these types of warnings:
When a process disconnects after receiving a request to do so by a producer, it logs a warning message of this type:
These messages only appear in your logs if logging is enabled and the log level is set to a level that includes warning (which it does by default). See Logging.
If your consumer is unable to receive even high priority messages, only the producer’s warnings will appear in the logs. If you see only producer warnings, you can restart the consumer process. Otherwise, the Tanzu GemFire failure detection code will eventually cause the member to leave the cluster on its own.
Use Cases
These are the main use cases for the slow receiver specifications:
Managing Slow distributed-ack Receivers
When using a distribution scope other than distributed-no-ack, alerts are issued for slow receivers. A member that is not responding to messages may be sick, slow, or missing. Sick or slow members are detected in message transmission and reply-wait processing code, triggering a warning alert first. If a member still is not responding, a severe warning alert is issued, indicating that the member may be disconnected from the cluster. This alert sequence is enabled by setting the ack-wait-threshold and the ack-severe-alert-threshold to some number of seconds.
When ack-severe-alert-threshold is set, regions are configured to use ether distributed-ack or global scope, or use the partition data policy. Tanzu GemFire will wait for a total of ack-wait-threshold seconds for a response to a cache operation, then it logs a warning alert (“Membership: requesting removal of entry(#). Disconnected as a slow-receiver”). After waiting an additional ack-severe-alert-threshold seconds after the first threshold is reached, the system also informs the failure detection mechanism that the receiver is suspect and may be disconnected, as shown in the following figure.
The events occur in this order:
When a member fails suspect processing, its cache is closed and its CacheListeners are notified with the afterRegionDestroyed notification. The RegionEvent passed with this notification has a CACHE_CLOSED operation and a FORCED_DISCONNECT operation, as shown in the FORCED_DISCONNECT example.
public static final Operation FORCED_DISCONNECT
= new Operation("FORCED_DISCONNECT",
true, // isLocal
true, // isRegion
OP_TYPE_DESTROY,
OP_DETAILS_NONE
);
A cache closes due to being expelled from the cluster by other members. Typically, this happens when a member becomes unresponsive and does not respond to heartbeat requests within the member-timeout period, or when ack-severe-alert-threshold has expired without a response from the member.
Note: This is marked as a region operation.
Other members see the normal membership notifications for the departing member. For instance, RegionMembershipListeners receive the afterRemoteRegionCrashed notification, and SystemMembershipListeners receive the memberCrashed notification.