To perform operations such as troubleshooting, backup, or scaling, you can gracefully shut down and start up VMware Blockchain nodes.

Prerequisites

  • Identify the following details from the VMware Blockchain Orchestrator output directory, /home/blockchain/output.

    Blockchain ID

    Current blockchain version

    Replica node IP address

    Client node IP address

    VMware Blockchain vmbc user password for all the Replica and Client node VMs

  • If the Concord operator containers were deployed, verify that the Concord operator container is running. See Instantiate the Concord Operator Container for AWS.

Procedure

  1. Stop all the applications that invoke connection requests to the Daml Ledger.
  2. SSH into the Client node.
  3. Stop all the Client node components.
    curl -X POST 127.0.0.1:8546/api/node/management?action=stop
  4. Verify that all the containers except the agent and deployed Concord operator container are running on the selected Client node.
    docker ps -a

    If the docker ps -a command shows that some containers, with the exception of agent and deployed Concord operator container, are still running, rerun the command or use the docker stop <container_name> command to stop the containers.

  5. (Optional) If the Concord operator containers were deployed, pause all the Replica nodes at the same checkpoint from the Concord operator container and check the status periodically until all the Replica nodes status is true before proceeding.

    Any blockchain node or nodes in state transfer or down for other reasons cause the wedge status command status to return false. The wedge status command returns true when state transfer completes and all Replica nodes are healthy, allowing all Replica nodes to stop at the same checkpoint successfully.

    Wedge command might take some time to complete. If the Wedge command times out, the system operator must execute the Wedge command again.

    docker exec -it operator sh -c './concop unwedge' 
    docker exec -it operator sh -c './concop wedge status' 
  6. SSH into the Replica node.
  7. Stop all the Replica nodes.
    curl -X POST 127.0.0.1:8546/api/node/management?action=stop
  8. Verify that all the containers except the agent and deployed Concord operator container are running on the selected Replica node.
    docker ps -a

    If the docker ps -a command shows that some containers, with the exception of agent and deployed Concord operator container, are still running, rerun the command or use the docker stop <container_name> command to stop the containers.

  9. Check that all the Replica nodes are stopped in the same state.

    Verifying whether the LastReacheableBlockID and LastBlockID sequence number of each Replica node stopped helps determine if any nodes are lagging. As best practice, check if the LastReachableBlockID and LastBlockID are same for at least five Replica nodes and can be sources for recovery.

    If there is a lag when you power on the Replica Network, some Replica nodes in the state-transfer mode might have to catch up. Otherwise, it can result in a failed consensus and require restoring each Replica node from the latest single copy.

    image=$(docker images --format "{{.Repository}}:{{.Tag}}" | grep "concord");docker run --rm --entrypoint="" --mount type=bind,source=/mnt/data/rocksdbdata,target=/concord/rocksdbdata $image /concord/kv_blockchain_db_editor /concord/rocksdbdata getLastBlockID
    image=$(docker images --format "{{.Repository}}:{{.Tag}}" | grep "concord");docker run --rm --entrypoint="" --mount type=bind,source=/mnt/data/rocksdbdata,target=/concord/rocksdbdata $image /concord/kv_blockchain_db_editor /concord/rocksdbdata getLastReachableBlockID

    The <image_name> is the Concord-core image name in the blockchain.

    vmwaresaas.jfrog.io/vmwblockchain/concord-core:1.8.0.0.53

  10. Remove the stale Docker containers.
    docker rm $(docker ps --filter status=exited -q)
  11. Start all the applications that invoke connection requests to the Daml Ledger.
  12. Start all the Replica nodes in the Replica Network.
    curl -X POST 127.0.0.1:8546/api/node/management?action=start 
    1. (Optional) If the containers do not appear, remove the old containers.
      curl -X POST 127.0.0.1:8546/api/node/management?action=remove 
    2. Restart all the Replica nodes in the Replica Network.
      curl -X POST 127.0.0.1:8546/api/node/management?action=start 
  13. From the operator container, unwedge the system.
    ./concop unwedge 
    # unwedge all replicas {'succ': True} 
    ./concop wedge status 
    # Check the wedge status of the replica nodes
  14. SSH into all Replica nodes and verify that the nodes are functional.
    docker exec -it telegraf curl -s http://concord:9891/metrics | grep -ia last_block | tail -1
    docker exec -it concord sh -c './concord-ctl status get state-transfer' | grep fetchingState
    docker exec -it concord sh -c './concord-ctl status get replica' | grep -ia lastStableSeqNum
  15. Start all the Client nodes.
    curl -X POST 127.0.0.1:8546/api/node/management?action=start
    1. (Optional) If the containers do not appear, remove the old containers.
      curl -X POST 127.0.0.1:8546/api/node/management?action=remove 
    2. Restart all the Client nodes in the Client node group.
      curl -X POST 127.0.0.1:8546/api/node/management?action=start 
  16. Perform test transactions on each Client node and verify that the node is functional.