Understanding and Recovering from Network Outages

This topic discusses network outages with VMware Tanzu GemFire and how to recove from a network outage.

The safest response to a network outage is to restart all the processes and bring up a fresh data set.

However, if you know the architecture of your system well, and you are sure you will not resurrect old data, you can do a selective restart. At the very least, you must restart all the members on one side of the network failure, because a network outage causes separate clusters that cannot rejoin automatically.

What Happens During a Network Outage

One member acts in the role of membership coordinator. If that coordinator loses contact with some other members, say, due to a network problem, the coordinator treats those members as crashed.

If the weight of the crashed members is less than half the total weight of members in the cluster before the network partition, then the coordinator will continue to operate and will distribute a new view with the crashed members removed.
If, however, the total weight of the crashed members is greater than or equal to half the weight of members in the cluster before the network partition, then the coordinator will shut down.
If members with weight totaling half or more of the original weight of all the cluster members survive and can communicate with each other, then they will elect a new coordinator and cluster processing can continue.
If there is no surviving group of members with sufficient weight, then no new coordinator will be chosen.

In addition, members that have been disconnected either via network partition or due to unresponsiveness will automatically try to reconnect to the cluster unless configured otherwise. See Handling Forced Cache Disconnection Using Auto-reconnect.

Recovery Procedure

For deployments that have network partition detection or auto-reconnect deactivated, to recover from a network outage:

Decide which applications and cache servers to restart, based on the architecture of the cluster. Assume that any process other than a data source is bad and needs restarting. For example, if an outside data feed is coming in to one member, which then redistributes to all the others, you can leave that process running and restart the other members.
Shut down all the processes that need restarting.
Restart them in the usual order.

The members recreate the data as they return to active work. For details, see Recovering from Application and Cache Server Crashes.

Effect of Network Failure on Partitioned Regions

Both sides of the cluster continue to run as though the members on the other side were not running. If the members that participate in a partitioned region are on both sides of the network failure, both sides of the partitioned region also continue to run as though the data stores on the other side did not exist. In effect, you now have two partitioned regions.

When the network recovers, the members may be able to see each other again, but they are not able to merge back together into a single cluster and combine their buckets back into a single partitioned region. You can be sure that the data is in an inconsistent state. Whether you are configured for data redundancy or not, you do not really know what data was lost and what was not. Even if you have redundant copies and they survived, different copies of an entry may have different values reflecting the interrupted workflow and inaccessible data.

Effect of Network Failure on Distributed Regions

By default, both sides of the cluster continue to run as though the members on the other side were not running. For distributed regions, however, the regions’s reliability policy configuration can change this default behavior.

When the network recovers, the members may be able to see each other again, but they are not able to merge back together into a single cluster.

Effect of Network Failure on Persistent Regions

A network failure when using persistent regions can cause conflicts in your persisted data. When you recover your system, you will likely encounter ConflictingPersistentDataExceptions when members start up.

For this reason, enable-network-partition-detection must be set to true if you are using persistent regions.

For information about how to recover from ConflictingPersistentDataException errors, see Recovering from ConfictingPersistentDataExceptions.

Effect of Network Failure on Client/Server Installations

If a client loses contact with all of its servers, the effect is the same as if it had crashed. You need to restart the client. See Recovering from Client Failure. If a client loses contact with some servers, but not all of them, the effect on the client is the same as if the unreachable servers had crashed. See Recovering from Server Failure.

Servers, like applications, are members of a cluster, so the effect of network failure on a server is the same as for an application. Exactly what happens depends on the configuration of your site.