VMware Cloud Director maintains synchronous streaming replication between the nodes. If a standby node becomes unattached, you must determine the cause and resolve the problem.
Problem
The VMware Cloud Director appliance management UI shows the cluster health as DEGRADED, the status of one of the unattached standby nodes is running, and there is an exclamation point (!) before the name of the upstream node for the standby.
2020-10-08 04:10:50.064 UTC [13390] LOG: started streaming WAL from primary at 21/80000000 on timeline 17 2020-10-08 04:10:50.064 UTC [13390] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000110000002100000080 has already been removed 2020-10-08 04:10:55.047 UTC [13432] LOG: started streaming WAL from primary at 21/80000000 on timeline 17 2020-10-08 04:10:55.047 UTC [13432] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000110000002100000080 has already been removed
The /nodes
API returns information that the localClusterHealth
is DEGRADED, the node status
is running, the nodeHealth
is HEALTHY. There is an exclamation point (!) before the name of the upstream node for the standby and the /nodes
API returns a warning that the standby is not attached to its upstream node.
/nodes
API might return the following information for the node.
{ "localClusterFailover": "MANUAL", "localClusterHealth": "DEGRADED", "localClusterState": [ { "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": primary_node_ID, "location": "default", "name": "primary_host_name", "nodeHealth": "HEALTHY", "nodeRole": "PRIMARY", "role": "primary", "status": "* running", "upstream": "" }, { "connectionString": "host=unattached_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node unattached_standby_node_ID (unattached_standby_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": unattached_standby_node_ID, "location": "default", "name": "unattached_standby_host_name", "nodeHealth": "HEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "running", "upstream": "! upstream_host_name" }, { "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": running_standby_node_ID, "location": "default", "name": "running_standby_host_name", "nodeHealth": "HEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "running", "upstream": "upstream_host_name" } ], "warnings": [ "node \"unattached_standby_host_name\" (ID: unattached_standby_node_ID) is not attached to its upstream node \"upstream_host_name\" (ID: upstream_node_id)" ] }
If a standby node becomes unattached, you must reattach it as soon as possible. If the node stays unattached for too long, it might fall behind in processing the continuously streaming WAL records from the primary to such an extent that it might not be possible for it to resume replication.
Cause
To ensure data integrity, the PostgreSQL database uses Write-Ahead Logging (WAL). The primary node streams the WAL constantly to the active standby nodes for replication and recovery purposes. The standby nodes process the WAL when they receive it. If a standby node becomes unattached, it stops receiving the WAL and cannot be a candidate for promotion to become a new primary.
Solution
- Deploy a new standby node.
- Unregister the unattached standby node.