When a vRealize Automation appliance in a multiple-node, high availability (HA) configuration has failed, you might need to replace the faulty node.

Caution: Before proceeding, VMware recommends that you contact technical support to troubleshoot the HA issue and verify that the problem is isolated to one node.

If technical support determines that you need to replace the node, take the following steps.

  1. In vCenter, take backup snapshots of every appliance in the HA configuration.

    In the backup snapshots, don't include virtual machine memory.

  2. Shut down the faulty node.
  3. Make note of the faulty node vRealize Automation software build number, and network settings.

    Note the FQDN, IP address, gateway, DNS servers, and especially MAC address. Later, you assign the same values to the replacement node.

  4. The primary database node must be one of the healthy nodes. Follow these steps:
    1. Log in as root to the command line of a healthy node.
    2. Find the name of the primary database node by running the following command.

      vracli status | grep primary -B 1

      The result should be similar to this example, where postgres-1 is the primary database node.

      "Conninfo":
      "host=postgres-1.postgres.prelude.svc.cluster.local
      dbname=repmgr-db user=repmgr-db passfile=/scratch/repmgr-db.cred
      connect_timeout=10",
      "Role": "primary",
    3. Verify that the primary database node is healthy by running the following command.

      kubectl -n prelude get pods -o wide | grep postgres

      The result should be similar to this example, where postgres-1 is in the list as running and healthy.

      postgres-1 1/1 Running 0 39h 12.123.2.14 vc-vm-224-84.company.com <none> <none>
      postgres-2 1/1 Running 0 39h 12.123.1.14 vc-vm-224-85.company.com <none> <none>
      Important: If the primary database node is faulty, contact technical support instead of proceeding.
  5. From the root command line of the healthy node, remove the faulty node.

    vracli cluster remove faulty-node-FQDN

  6. Use vCenter to deploy a new, replacement vRealize Automation node.

    Deploy the same vRealize Automation software build number, and apply the network settings from the faulty node. Include the FQDN, IP address, gateway, DNS servers, and especially MAC address that you noted earlier.

  7. Power on the replacement node.
  8. Log in as root to the command line of the replacement node.
  9. Verify that the initial boot sequence has finished by running the following command.

    vracli status first-boot

    Look for a First boot complete message.

  10. From the replacement node, join the vRealize Automation cluster.

    vracli cluster join primary-DB-node-FQDN

  11. Log in as root to the command line of the primary database node.
  12. Deploy the repaired cluster by running the following script.

    /opt/scripts/deploy.sh