Handling Forced Cache Disconnection Using Auto-reconnect

This topic explains how to use auto-reconnect in VMware Tanzu GemFire to handle forced cache disconnection.

A Tanzu GemFire member may be forcibly disconnected from a Tanzu GemFire cluster if the member is unresponsive for a period of time, or if a network partition separates one or more members into a group that is too small to act as the cluster.

How the Auto-reconnection Process Works

After being disconnected from a cluster, a Tanzu GemFire member shuts down and, by default, automatically restarts into a “reconnecting” state, while periodically attempting to rejoin the cluster by contacting a list of known locators. If the member succeeds in reconnecting to a known locator, the member rebuilds its view of the cluster from existing members and receives a new distributed member ID.

If the member cannot connect to a known locator, the member will then check to see if it itself is a locator (or hosting an embedded locator process). If the member is a locator, then the member does a quorum-based reconnect; it will attempt to contact a quorum of the members that were in the membership view just before it became disconnected. If a quorum of members can be contacted, then startup of the cluster is allowed to begin. Since the reconnecting member does not know which members survived the network partition event, all members that are in a reconnecting state will keep their membership ports open and respond to ping requests.

Membership quorum is determined using the same member weighting system used in network partition detection. See Membership Coordinators, Lead Members, and Member Weighting.

Note that when a locator is in the reconnecting state, it provides no discovery services for the cluster.

The default settings for reconfiguration of the cache once reconnected assume that the cluster configuration service has a valid (XML) configuration. This will not be the case if the cluster was configured using API calls. To handle this case, either deactivate auto-reconnect by setting the property to

disable-auto-reconnect = true

or, deactivate the cluster configuration service by setting the property to

enable-cluster-configuration = false

After the cache has reconnected, applications must fetch a reference to the new Cache, Regions, DistributedSystem and other artifacts. Old references will continue to throw cancellation exceptions like CacheClosedException(cause=ForcedDisconnectException).

See the Tanzu GemFire DistributedSystem and Cache Java API documentation for more information.

Managing the Auto-reconnection Process

By default a Tanzu GemFire member will try to reconnect until it is told to stop by using the DistributedSystem.stopReconnecting() or Cache.stopReconnecting() method. You can deactivate automatic reconnection entirely by setting disable-auto-reconnect Tanzu GemFire property to “true.”

You can use DistributedSystem and Cache callback methods to perform actions during the reconnect process, or to cancel the reconnect process if necessary.

The DistributedSystem and Cache API provide several methods you can use to take actions while a member is reconnecting to the cluster:

DistributedSystem.isReconnecting() returns true if the member is in the process of reconnecting and recreating the cache after having been removed from the system by other members.
DistributedSystem.waitUntilReconnected(long, TimeUnit) waits for a period of time, and then returns a boolean value to indicate whether the member has reconnected to the DistributedSystem. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the member shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
DistributedSystem.getReconnectedSystem() returns the reconnected DistributedSystem.
DistributedSystem.stopReconnecting() stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.
Cache.isReconnecting() returns true if the cache is attempting to reconnect to a cluster.
Cache.waitUntilReconnected(long, TimeUnit) waits for a period of time, and then returns a boolean value to indicate whether the DistributedSystem has reconnected. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the cache shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
Cache.getReconnectedCache() returns the reconnected Cache.
Cache.stopReconnecting() stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.

Operator Intervention

You may need to intervene in the auto-reconnect process if processes or hardware have crashed or are otherwise shut down before the network connection is healed. In this case the members in a “reconnecting” state will not be able to find the lost processes and will not rejoin the system until they are able to contact a locator.