Intermittent Back-Up for VMware Blockchain Nodes on vSphere

You can use the intermittent backup on large production environments with limited maintenance windows to investigate errors, recreate defects, or identify workflow problems in their working flows.

With intermittent backup, you can run a background job on the Replica nodes. After the baseline is established, the scheduled backup jobs can run every hour to synchronize data with the Replica nodes. The subsequent backup jobs are faster because the delta changes are copied. Stop the VMware Blockchain nodes and manually synchronize data to the backup directory when a maintenance window becomes available. When the maintenance window ends, you can restart the VMware Blockchain.

Note:

You can schedule either a RocksDB checkpoint-based backup or intermittent back up. You cannot configure both types of backup processes to run simultaneously.

Restore the backup data to a clone VMware Blockchain node on a separate server, and run analytics on the cloned VMware Blockchain node.

Note:

You must create unique backup directories for each Replica node so that the data is not overwritten by the backup job of another Replica node.

The intermittent backup process applies to Replica nodes only. Client node data is not backed up.

Prerequisites

Verify that you have the deployed blockchain ID information.
Familiarize yourself with the backup and restore consideration for your VMware Blockchain nodes. See VMware Blockchain Node Backup and Restore Considerations on vSphere.
Verify that you have access to the latest version of VMware Blockchain.
Verify that you have captured the IP addresses of all the Client and Replica node VMs and have access to them. You can find the information in the VMware Blockchain Orchestrator descriptor file.

Procedure

Configure SSH without a password between the agent and backup server.

sudo docker exec -ti agent bash 

ssh-keygen -t rsa -b 4096 -C "vmbc" -N '' -f /vmbc/.ssh/id_rsa 

ssh-copy-id vmbc@<backup_server>

Run the intermittent backup API to schedule a backup job on each Replica node.
```
curl -vX POST localhost:8546/api/backup?action=schedule -H 
"Content-Type:application/json" -d '{"backup_destination”: "10.72.216.118:/<backup_dir_name>", 
"schedule_frequency": "HOURLY", "rsync_user_name": "vmbc", "backup_time_zone": "time_zone"}'
```
The backup destination is the location of the backup directory. The backup server hosts the Replica node backups. The backup directory name is the source database directory. The default value is /config/concord/rocksdbdata. The rsync_user is the user name to log in to the backup server. The backup time zone is the time zone the customer uses.
Stop all the applications that invoke connection requests to the Daml Ledger.

Stop the Client node components.

curl -X POST 127.0.0.1:8546/api/node/management?action=stop

vmbc@localhost [ ~ ]# curl -X POST 127.0.0.1:8546/api/node/management?action=stop vmbc@localhost [ ~ ]# sudo docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 218a1bdaddd6 vmwaresaas.jfrog.io/vmwblockchain/operator:1.7.0.0.55  "/operator/operator_…" 18 hours ago Up 18 hours operator cd476a6b3d6c vmwaresaas.jfrog.io/vmwblockchain/agent:1.7.0.0.55  "java -jar node-agen…" 18 hours ago Up 18 hours 127.0.0.1:8546->8546/tcp agent vmbc@localhost [ ~ ]#

Repeat the stop operation on each Client node in the Client group.
Verify that all the containers except the agent and deployed operator container are stopped.

sudo docker ps -a

If the sudo docker ps -a command shows that some containers, with the exception agent and deployed operator container, are still running, rerun the command or use the sudo docker stop <container_name> command to stop the containers.

Verify that the following metrics indicate that your blockchain network is operating properly.

Option	Description
Metrics	Description
Blocks per second metrics	All the blockchain nodes must process blocks because time blocks are constantly being added. The nodes should be a positive number to be considered in a healthy state.
FastPaths	All Replica nodes must report in a fast path, and none reporting in a slow path. When the Blocks per second metrics indicate an unhealthy state, the wedge status is always false until all the nodes have stopped at the same checkpoint.

Pause all the Replica nodes at the same checkpoint from the operator container and check the status periodically until all the Replica nodes' status is true.
Any blockchain node or nodes in state transfer or down for other reasons cause the wedge status command to return false. The wedge status command returns true when state transfer completes and all Replica nodes are healthy, allowing all Replica nodes to stop at the same checkpoint successfully.

Wedge command might take some time to complete. The metrics dashboards indicate nodes that have stopped processing blocks as they have been wedged. If you notice a false report in the dashboard, contact the VMware Blockchain support to diagnose the Replica nodes experiencing the problem. If the Wedge command times out, the system operator must execute the Wedge command again.
```
sudo docker exec -it operator sh -c './concop wedge stop' {"succ":true} 
sudo docker exec -it operator sh -c './concop wedge status' {"192.168.100.107":true,"192.168.100.108":true,"192.168.100.109":true,"192.168.100.110":true} 
```

Stop the Replica node.

curl -X POST 127.0.0.1:8546/api/node/management?action=stop

vmbc@localhost [ ~ ]# curl -X POST 127.0.0.1:8546/api/node/management?action=stop
vmbc@localhost [ ~ ]# sudo docker ps -a
CONTAINER ID        IMAGE                                                                  COMMAND                  CREATED             STATUS              PORTS                      NAMES
3b7135c677cf        vmwaresaas.jfrog.io/vmwblockchain/agent:1.7.0.0.55    "java -jar node-agen…"   20 hours ago        Up 20 hours         127.0.0.1:8546->8546/tcp   agent

Repeat the stop operation on each Replica node in the Replica Network.
Verify that all the containers except for the agent are stopped.

sudo docker ps -a

If the sudo docker ps -a command shows that some containers beside the agent are running, then rerun the command or use the sudo docker stop <container_name> command to stop the containers.
Check that all the Replica nodes are stopped in the same state.
Verifying that the LastReacheableBlockID and LastBlockID sequence number of each Replica node stopped helps determine if any nodes lag.

If there is a lag when you power on the Replica Network, some Replica nodes in the state-transfer mode might have to catch up. Otherwise, it can result in a failed consensus and require restoring each Replica node from the latest single copy.
```
sudo docker run -it --rm --entrypoint="" --mount type=bind,source=/mnt/data/rocksdbdata,target=/concord/rocksdbdata <ImageName> /concord/kv_blockchain_db_editor /concord/rocksdbdata getLastBlockID
sudo docker run -it --rm --entrypoint="" --mount type=bind,source=/mnt/data/rocksdbdata,target=/concord/rocksdbdata <image_name> /concord/kv_blockchain_db_editor /concord/rocksdbdata getLastReachableBlockID
```
The <image_name> is the Concord-core image name in the blockchain.
vmwaresaas.jfrog.io/vmwblockchain/concord-core:1.7.0.0.55
On the destination Client node, change ownership.

sudo chown -R vmbc /mnt/data/db

Back up the data on each of the Client nodes.

sudo tar cvzf <backup_name> /config/daml-ledger-api/environment-vars /config/daml-index-db/environment-vars /config/telegraf/telegraf.conf
#For data greater than 64GB 
cd /mnt/data/ sudo nohup tar cvzf <BackupName> db & tail -f nohup.out 
#wait for the tar to complete

The <backup_name> must end in .tar.gz, and this file can be located on the VM or externally mounted storage. For example, db-backup.tar.gz.

The rsync command might time out due to SSH inactivity. Incrementally rerun the command.

Run the backup job API to finish the daily backup.

curl -vX POST localhost:8546/api/backup?action=run_now -H "Content-Type:application/json" -d '{"backup_destination": "10.72.216.118:/backup_dir_name", "rsync_user_name": "vmbc", "backup_time_zone": "time_zone"}'

Start all the Replica nodes.

curl -X POST 127.0.0.1:8546/api/node/management?action=start

From the operator container, unwedge the system.

./concop unwedge 
# unwedge all replicas {'succ': True} 
./concop wedge status 
# Check the wedge status of the replicas

Start all the Client node components.

curl -X POST 127.0.0.1:8546/api/node/management?action=start

Start all the applications that invoke connection requests the Daml Ledger

Results

The intermittent backup process creates a daily backup of each Replica node on the backup destination server.

What to do next

Restore the backup data on another system. See Restore the VMware Blockchain Intermittent Backup Data on vSphere.