This topic discusses network outages with VMware GemFire and how to recove from a network outage.
The safest response to a network outage is to restart all the processes and bring up a fresh data set.
However, if you know the architecture of your system well, and you are sure you won’t be resurrecting old data, you can do a selective restart. At the very least, you must restart all the members on one side of the network failure, because a network outage causes separate clusters that can’t rejoin automatically.
One member acts in the role of membership coordinator. If that coordinator loses contact with some other members, say, due to a network problem, the coordinator treats those members as crashed.
In addition, members that have been disconnected either via network partition or due to unresponsiveness will automatically try to reconnect to the cluster unless configured otherwise. See Handling Forced Cache Disconnection Using Auto-reconnect.
For deployments that have network partition detection or auto-reconnect deactivated, to recover from a network outage:
The members recreate the data as they return to active work. For details, see Recovering from Application and Cache Server Crashes.
Both sides of the cluster continue to run as though the members on the other side were not running. If the members that participate in a partitioned region are on both sides of the network failure, both sides of the partitioned region also continue to run as though the data stores on the other side did not exist. In effect, you now have two partitioned regions.
When the network recovers, the members may be able to see each other again, but they are not able to merge back together into a single cluster and combine their buckets back into a single partitioned region. You can be sure that the data is in an inconsistent state. Whether you are configured for data redundancy or not, you don’t really know what data was lost and what wasn’t. Even if you have redundant copies and they survived, different copies of an entry may have different values reflecting the interrupted workflow and inaccessible data.
By default, both sides of the cluster continue to run as though the members on the other side were not running. For distributed regions, however, the regions’s reliability policy configuration can change this default behavior.
When the network recovers, the members may be able to see each other again, but they are not able to merge back together into a single cluster.
A network failure when using persistent regions can cause conflicts in your persisted data. When you recover your system, you will likely encounter
ConflictingPersistentDataExceptions when members start up.
For this reason,
enable-network-partition-detection must be set to true if you are using persistent regions.
For information on how to recover from
ConflictingPersistentDataException errors should they occur, see Recovering from ConfictingPersistentDataExceptions.
If a client loses contact with all of its servers, the effect is the same as if it had crashed. You need to restart the client. See Recovering from Client Failure. If a client loses contact with some servers, but not all of them, the effect on the client is the same as if the unreachable servers had crashed. See Recovering from Server Failure.
Servers, like applications, are members of a cluster, so the effect of network failure on a server is the same as for an application. Exactly what happens depends on the configuration of your site.