Troubleshoot a Malfunctioning Replica Node on AWS

In some scenarios, a Replica node might fail.

Problem

A Replica node has failed.

Cause

The problem can be caused due to a connection failure to the Replica Network or a hardware problem.

Solution

When a Replica node fails to connect to the Replica Network but can connect a few hours later.

The Replica node automatically catches up with the lost data after the connection is reestablished. This mechanism is called State Transfer. Therefore, there is no need for the system operator to perform any operation to get the Replica node caught up while offline.
When a Replica node sustains a failure, a system operator can resolve the problem without changing the Replica node IP address and set of cryptographic keys.

The problem of catching up with the lost data can be handled either by the State Transfer mechanism or by restoring the Replica node from backup. The decision of which option to implement depends on how long the Replica node was down and the amount of data that the Replica node needs to catch up.
- Perform a backup when the system is wedged after the last pruning process has been completed. The backup can be performed on any of the Replica nodes.
- Take down one of the Replica nodes while the system is up and executing transactions, and create a backup of the Replica node you took offline.
When a Replica node encounters a hardware problem, the Replica node must be recreated.

The Replica node cannot be restored from backup or catch up automatically as in previous cases due to blockchain integrity issues. In this case, the system operator must remove the failed Replica node from the system's reconfiguration and add it back as a new Replica node. This problem can be resolved using the scaling down and scaling up operations. See VMware Blockchain Node Scaling Operations on AWS.
When you cannot start a failed Replica node quickly.

The Replica node can be started with the N-f working nodes until the failed Replica nodes can be recovered. The blockchain operates seamlessly but is not Byzantine Fault Tolerant. Starting the blockchain with N-f working Replica nodes is possible only during reconfiguration and not during the deployment of a new blockchain.