Your VMware Cloud Director Appliance Standby Node Becomes Unreachable

VMware Cloud Director maintains synchronous streaming replication between the nodes. If a standby node becomes unreachable, you must determine the cause and resolve the problem.

Problem

The VMware Cloud Director appliance management UI shows the cluster health as DEGRADED and the status of one of the standby nodes is ? unreachable.

The /nodes API returns information that the localClusterHealth is DEGRADED, the node status is ? unreachable, and the nodeHealth is UNHEALTHY.

For example, the /nodes API might return the following information for the node.

{
    "localClusterFailover": "MANUAL",
    "localClusterHealth": "DEGRADED",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "* running",
            "upstream": ""
        },
        {
            "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover state unknown - unable to ssh to failed or unreachable node",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = n/a",
                    "status": "UNKNOWN"
                }
            },
            "id": unreachable_standby_node_ID,
            "location": "default",
            "name": "unreachable_standby_host_name",
            "nodeHealth": "UNHEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "? unreachable",
            "upstream": "primary_host_name"
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_IP): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "primary_host_name"
        }
    ],
    "warnings": [
        "unable to connect to node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID)",
        "node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID) is registered as an active standby but is unreachable"
    ]
}

Cause

To ensure data integrity, the PostgreSQL database uses Write-Ahead Logging (WAL). The primary node streams the WAL constantly to the active standby nodes for replication and recovery purposes. The standby nodes process the WAL when they receive it. If a standby node is unreachable, it stops receiving the WAL and cannot be a candidate for promotion to become a new primary.

Solution

Verify that the virtual machine of the unreachable standby node is running.
Verify that the network connection to the standby node is working.
Verify that there is no SSH problem that might prevent the standby node from communicating with the other nodes.
Verify that the vpostgres service on the standby node is running.

What to do next

To verify that there are no network or SSH problems, see Check the Connectivity Status of Your VMware Cloud Director Database High Availability Cluster.