The VMware Blockchain Replica Network uses a BFT consensus protocol to ensure that an identical state on all the Replica nodes.
See the SBFT research paper, https://arxiv.org/pdf/1804.01626.pdf, and Concord open source project GitHub repository, https://github.com/vmware/concord-bft.
BFT Protocol
VMware Blockchain implements the Scalable-BFT (SBFT) protocol based on MIT's PBFT State Machine Replication protocol. SBFT addresses the challenges of scalability and decentralization and guarantees that the Client nodes have safety and liveness even if some Replica nodes are faulty or malicious.
A BFT protocol has two functions:
Liveness- Guarantees that the Client node requests are executed promptly, no matter how forcefully the malicious agent controlling f replicas tries to block them.
Safety- Guarantees that all Replica nodes and Client nodes have identical data, no matter how forcefully the malicious agent controlling f replicas tries to cause a disagreement.
The failed (f) parameter captures the maximum number of Replica nodes a malicious agent can fully control or Replica nodes that fail for any other reason. For example, a malicious agent can cause Replica nodes to fail, send fake messages, or even send different messages to different Replica nodes. Besides, Replica nodes can fail without being maliciously attacked. Replica nodes can fail, for example, due to power, network, or hypervisor failure.
To survive failures of up to f Replica nodes, the VMware Blockchain requires n Replica nodes, where n = 3f + 1. For example, if one Replica node fails, then f = 1, so n = 3 * 1 + 1 = 4 Replica nodes must run VMware Blockchain. If two Replica nodes fail simultaneously, then f = 2, so n = 3 * 2 + 1 = 7 Replica nodes must run VMware Blockchain. The Replica nodes are assumed to be in independent fault domains.
The BFT protocol is the communication between the Replica nodes that allows them to remain synchronized to maintain the same state on all the Replica nodes. Thus, the BFT protocol can tolerate some Replica nodes being slower than others, some Replica nodes being disconnected, failing, or even some Replica nodes acting maliciously.
One Replica node in each view is designated as the primary node, and the other Replica nodes in the network are backups. The primary Replica node communicates with the Client node, distributes the request received from the Client node to the backup Replica nodes, and coordinates the agreement amongst all the Replica nodes.
A view determines the primary Replica node and the backup Replica nodes. The view is represented by a unique integer number that identifies each view. If there is a need to replace the primary Replica node, then a view change process is initiated to designate another Replica node as the new primary node, and the unique view number changes.
A primary Replica node is replaced when it is down or slow to respond. Messages exchanged between the Replica nodes are only accepted from the current view. When a Replica node receives messages from previous views, as a precautionary measure, the node discards them.
Checkpoint
Some Replica nodes process commands faster than other Replica nodes, for example, in the case of network latency. Therefore, the local state of some Replica nodes can be different compared to the slower Replica nodes. The faster Replica nodes occasionally stop processing commands until enough Replica nodes have caught up to prevent a substantial gap in the local states between the Replica nodes.
The Replica nodes agree on a checkpoint every 150 sequence numbers to maintain synchronization between the Replica nodes and make sure that there is no significant gap in the local state of the Replica nodes. This checkpoint is a persisted state synchronization point between the Replica nodes.
When a Replica node reaches the checkpoint, the node sends out a checkpoint message containing the last block's hash value to all other Replica nodes. Then the Replica node waits until 2f+1 Replica nodes reach this checkpoint. A checkpoint is considered stable when 2f+1 Replica nodes reach that checkpoint level. After reaching a stable checkpoint, the Replica nodes can continue to process new commands.
Slow Replica Nodes
When a Replica node is taken down and added back, it must be synchronized to the state of the rest of the Replica nodes on the Replica Network at startup.
Depending on the gap the Replica node must close, there are two ways to perform such a synchronization process.
Request execution- When the gap between the Replica node state and the network state is within the same checkpoint, it requests the neighboring Replica nodes for batch requests and proofs to verify that 2f + 1 Replica nodes accepted the message. After the Replica node receives all the requests, it executes them and commits its local store results.
Resulting state- When the gap between the Replica node state and the network state is beyond the last stable checkpoint, the Replica node activates the State Transfer mechanism. The mechanism starts with finding out the last checkpoint of the other Replica nodes and looks for f+1 replicas with the same checkpoint and hash. After the requested checkpoint is identified, the Replica node simultaneously asks several Replica nodes for the last block of that checkpoint and continues until it gets to a familiar block.