The VMware BlockchainReplica Network uses the BFT State Machine Replication protocol and Merkle Tree data structures to ensure that the data in the Replica and Client nodes are identical.

See the SBFT research paper, https://arxiv.org/pdf/1804.01626.pdf, and Concord open source project GitHub repository, https://github.com/vmware/concord-bft.

BFT Protocol

VMware Blockchain implements the Scalable-BFT (SBFT) protocol, which is based on the MIT’s PBFT State Machine Replication protocol. SBFT addresses the challenges of scalability, decentralization, and guarantees that the Client nodes have safety and liveness even if some of the Replica nodes are faulty, or even malicious.

A BFT protocol has two functions:

  • Liveness- Guarantees that the Client node requests are executed promptly, no matter how forcefully the malicious agent controlling f replicas tries to block them.

  • Safety- Guarantees that all Replica nodes and Client nodes have identical data, no matter how forcefully the malicious agent controlling f replicas tries to cause a disagreement.

The failed (f) parameter captures the maximum number of Replica nodes that a malicious agent can fully control or Replica nodes that fail for any other reason. A malicious agent can cause Replica nodes to fail, send fake messages, or even send different messages to different Replica nodes. Besides, Replica nodes can fail without being maliciously attacked. Replica nodes can fail, for example, due to power, network, or hypervisor failure.

To survive failures of up to f Replica nodes, the VMware Blockchain requires n Replica nodes, where n = 3f + 1. For example, if one Replica node fails, then f = 1 so n = 3 * 1 + 1 = 4 Replica nodes must run VMware Blockchain. If two Replica nodes fail simultaneously, then f = 2 so n = 3 * 2 + 1 = 7 Replica nodes must run VMware Blockchain. The Replica nodes are assumed to be in independent fault domains.

The BFT protocol is the communication between the Replica nodes that allows them to remain synchronized to maintain the same state on all the Replica nodes. The BFT protocol can tolerate some Replica nodes being slower than others, some Replica nodes being disconnected, failing, or even some Replica nodes acting maliciously.

One Replica node in each view is designated as the primary node, and the other Replica nodes in the network are backups. The primary Replica node communicates with the Client node, distributes the request it receives from the Client node to the backup Replica nodes and coordinates the agreement between all the Replica nodes.

A view determines the primary Replica node and the backup Replica nodes. The view is represented by a unique integer number that identifies each view. If there is a need to replace the primary Replica node, then a view change process is initiated to agree on designating another Replica node as the new primary node, and the unique view number changes.

A primary Replica node is replaced when it is down or slow to respond. Messages exchanged between the Replica nodes are only accepted from the current view. When a Replica node receives messages from previous views, as a precautionary measure the node discards them.

Checkpoint

A checkpoint is a persisted state synchronization point between the Replica nodes meant to help with synchronizing the Replica nodes. Some Replica nodes are quick to process commands and commit their results, but their local state can be more advanced than the state of the slower Replica nodes. To prevent a large gap in the local states between the Replica nodes, the faster Replica nodes occasionally stop processing commands until enough Replica nodes have caught up.

The Replica nodes agree on a checkpoint of every 150 sequence numbers. Each Replica node maintains a last stable sequence number, which is the sequence number of the block of the last agreed checkpoint.

When a Replica node reaches the checkpoint, the node sends out a checkpoint message to all other Replica nodes and waits until 2f+1 Replica nodes reach this checkpoint. The Replica node identifies the Replica nodes that reached the checkpoint by comparing their sequence number and a hash value representing the last block. After adequate Replica nodes reach the checkpoint, it is considered as a stable checkpoint and the last stable sequence number is updated. Now the Replica node continues to process new requests.

Slow Replica Nodes

When a Replica node is taken down and added back, at startup it must be synchronized to the state of the rest of the Replica nodes on the Replica Network.

Depending on the gap that the Replica node must close, there are two ways to perform such a synchronization process.

  • Request execution- When the gap between the Replica node state and the network state is within the same checkpoint, then it requests the neighboring Replica nodes for batch requests and proofs to verify that the message was accepted by 2f + 1 Replica nodes. After the Replica node receives all the requests, it executes them and commits the results to its local store.

  • Resulting state- When the gap between the Replica node state and the network state is beyond the last stable checkpoint, the Replica node activates the State Transfer mechanism. The mechanism starts with finding out the last checkpoint of the other Replica nodes and looks for f+1 replicas that have the same checkpoint and hash. After the requested checkpoint is identified, then the Replica node simultaneously asks several Replica nodes for the last block of that checkpoint and continues until it gets to a block that it is familiar with.