A VMware Cloud Director Appliance Standby Node Becomes Unattached

VMware Cloud Director maintains synchronous streaming replication between the nodes. If a standby node becomes unattached, you must determine the cause and resolve the problem.

Problem

The VMware Cloud Director appliance management UI shows the cluster health as DEGRADED, the status of one of the unattached standby nodes is running, and there is an exclamation point (!) before the name of the upstream node for the standby.

The PostgreSQL log shows that the primary deleted a WAL segment.

2020-10-08 04:10:50.064 UTC [13390] LOG:  started streaming WAL from primary at 21/80000000 on timeline 17
2020-10-08 04:10:50.064 UTC [13390] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000110000002100000080 has already been removed
2020-10-08 04:10:55.047 UTC [13432] LOG:  started streaming WAL from primary at 21/80000000 on timeline 17
2020-10-08 04:10:55.047 UTC [13432] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000110000002100000080 has already been removed

The /nodes API returns information that the localClusterHealth is DEGRADED, the node status is running, the nodeHealth is HEALTHY. There is an exclamation point (!) before the name of the upstream node for the standby and the /nodes API returns a warning that the standby is not attached to its upstream node.

For example, the /nodes API might return the following information for the node.

{
    "localClusterFailover": "MANUAL",
    "localClusterHealth": "DEGRADED",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "* running",
            "upstream": ""
        },
        {
            "connectionString": "host=unattached_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node unattached_standby_node_ID (unattached_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": unattached_standby_node_ID,
            "location": "default",
            "name": "unattached_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "! upstream_host_name"
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "upstream_host_name"
        }
    ],
    "warnings": [
        "node \"unattached_standby_host_name\" (ID: unattached_standby_node_ID) is not attached to its upstream node \"upstream_host_name\" (ID: upstream_node_id)"
    ]
}

If a standby node becomes unattached, you must reattach it as soon as possible. If the node stays unattached for too long, it might fall behind in processing the continuously streaming WAL records from the primary to such an extent that it might not be possible for it to resume replication.

Cause

To ensure data integrity, the PostgreSQL database uses Write-Ahead Logging (WAL). The primary node streams the WAL constantly to the active standby nodes for replication and recovery purposes. The standby nodes process the WAL when they receive it. If a standby node becomes unattached, it stops receiving the WAL and cannot be a candidate for promotion to become a new primary.

Solution

Deploy a new standby node.
Unregister the unattached standby node.

What to do next

See Recover from a Standby Cell Failure in a High Availability Cluster.