VMware Cloud Director maintains synchronous streaming replication between the nodes. If a standby node becomes unreachable, you must determine the cause and resolve the problem.
Problem
The VMware Cloud Director appliance management UI shows the cluster health as DEGRADED and the status of one of the standby nodes is ? unreachable.
The /nodes
API returns information that the localClusterHealth
is DEGRADED, the node status
is ? unreachable, and the nodeHealth
is UNHEALTHY.
/nodes
API might return the following information for the node.
{ "localClusterFailover": "MANUAL", "localClusterHealth": "DEGRADED", "localClusterState": [ { "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": primary_node_ID, "location": "default", "name": "primary_host_name", "nodeHealth": "HEALTHY", "nodeRole": "PRIMARY", "role": "primary", "status": "* running", "upstream": "" }, { "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover state unknown - unable to ssh to failed or unreachable node", "mode": "UNKNOWN", "repmgrd": { "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = n/a", "status": "UNKNOWN" } }, "id": unreachable_standby_node_ID, "location": "default", "name": "unreachable_standby_host_name", "nodeHealth": "UNHEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "? unreachable", "upstream": "primary_host_name" }, { "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node running_standby_node_ID (running_standby_host_IP): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": running_standby_node_ID, "location": "default", "name": "running_standby_host_name", "nodeHealth": "HEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "running", "upstream": "primary_host_name" } ], "warnings": [ "unable to connect to node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID)", "node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID) is registered as an active standby but is unreachable" ] }
Cause
To ensure data integrity, the PostgreSQL database uses Write-Ahead Logging (WAL). The primary node streams the WAL constantly to the active standby nodes for replication and recovery purposes. The standby nodes process the WAL when they receive it. If a standby node is unreachable, it stops receiving the WAL and cannot be a candidate for promotion to become a new primary.
Solution
- Verify that the virtual machine of the unreachable standby node is running.
- Verify that the network connection to the standby node is working.
- Verify that there is no SSH problem that might prevent the standby node from communicating with the other nodes.
- Verify that the vpostgres service on the standby node is running.