This topic describes membership coordinators, lead members, and member weighing in VMware Tanzu GemFire.
Network partition detection uses a designated membership coordinator and a weighting system that accounts for a lead member to determine whether a network partition has occurred.
The membership coordinator is a member that manages entry and exit of other members of the cluster. With network partition detection enabled, the coordinator can be any Tanzu GemFire member but locators are preferred. In a locator-based system, if all locators are in the reconnecting state, the system continues to function, but new members are not able to join until a locator has successfully reconnected. After a locator has reconnected, the reconnected locator will take over the role of coordinator.
When a coordinator is shutting down, it sends out a view that removes itself from the list and the other members must determine who the new coordinator is.
The lead member is determined by the coordinator. Any member that has enabled network partition detection, is not hosting a locator, and is not an administrator interface-only member is eligible to be designated as the lead member by the coordinator. The coordinator chooses the longest-lived member that fits the criteria.
The purpose of the lead member role is to provide extra weight. It does not perform any specific functionality.
By default, individual members are assigned the following weights:
You can modify the default weights for specific members by defining the gemfire.member-weight
system property upon startup.
The weights of members prior to the view change are added together and compared to the total weight of crashed members. For the coordinator to continue in its role, it must maintain communication with the majority (by weight) of members. If the total weight of crashed members is greater than or equal to half of the original member weight, then the majority is not reachable and a network partition is declared. When a network partition is declared, the membership coordinator initiates a shutdown.
This section provides some example calculations. Here are some view-change scenarios from the perspective of the coordinator, in each example the coordinator is a locator:
Example 1: Cluster with 12 members. 2 locators, 10 cache servers. One cache server is designated as lead member (as always). One locator is the coordinator. View total weight equals 111. If members weighing 55.5 or more crash then the majority is lost.
Example 2: Cluster with 4 members. 2 locators, 2 cache servers. One cache server is designated lead member. One locator is the coordinator. View total weight is 31. If members weighing 15.5 or more crash then the majority is lost.
Even if network partitioning is not enabled, if quorum loss is detected due to unresponsive processes, the locator will also log a severe level message to identify the failed processes:
Possible loss of quorum detected due to loss of {0} cache processes: {1}
where {0} is the number of processes that failed and {1} lists the processes.
Enabling network partition detection allows only one subgroup to survive a split. The rest of the system is disconnected and the caches are closed.
When a shutdown occurs, the members that are shut down will log the following alert message:
Exiting due to possible network partition event due to loss of {0} cache processes: {1}
where {0}
is the count of lost members and {1}
is the list of lost member IDs.