VMware Cloud Director 维护节点之间的同步流式传输复制。如果备用节点变成未连接状态,则必须确定原因并解决问题。

问题

VMware Cloud Director 设备管理 UI 显示集群运行状况为 DEGRADED,其中一个未连接的备用节点的状态为正在运行,且在备用节点的上游节点名称前面有一个感叹号 (!)。

PostgreSQL 日志显示主节点删除了 WAL 分段。
2020-10-08 04:10:50.064 UTC [13390] LOG:  started streaming WAL from primary at 21/80000000 on timeline 17
2020-10-08 04:10:50.064 UTC [13390] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000110000002100000080 has already been removed
2020-10-08 04:10:55.047 UTC [13432] LOG:  started streaming WAL from primary at 21/80000000 on timeline 17
2020-10-08 04:10:55.047 UTC [13432] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000110000002100000080 has already been removed

/nodes API 返回以下信息:localClusterHealthDEGRADED,节点的 statusrunningnodeHealthHEALTHY。在备用节点的上游节点名称前面有一个感叹号 (!),/nodes API 会返回一条警告,指出备用节点未连接到其上游节点。

例如, /nodes API 可能会返回以下节点信息。
{
    "localClusterFailover": "MANUAL",
    "localClusterHealth": "DEGRADED",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "* running",
            "upstream": ""
        },
        {
            "connectionString": "host=unattached_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node unattached_standby_node_ID (unattached_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": unattached_standby_node_ID,
            "location": "default",
            "name": "unattached_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "! upstream_host_name"
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "upstream_host_name"
        }
    ],
    "warnings": [
        "node \"unattached_standby_host_name\" (ID: unattached_standby_node_ID) is not attached to its upstream node \"upstream_host_name\" (ID: upstream_node_id)"
    ]
}

如果备用节点变成未连接状态,则必须尽快重新连接。如果该节点处于未连接状态的时间过长,则在处理来自主节点的连续流式传输的 WAL 记录时可能会落后,并导致无法继续复制。

原因

为确保数据完整性,PostgreSQL 数据库使用预写式日志记录 (WAL)。主节点不断地将 WAL 流式传输到活动备用节点,以便进行复制和恢复。备用节点在接收到 WAL 后对其进行处理。如果备用节点变成未连接状态,则将停止接收 WAL,并且无法作为候选节点提升为新的主节点。

解决方案

  1. 部署新的备用节点。
  2. 取消注册未连接的备用节点。

下一步做什么

请参见 从高可用性集群中的备用单元故障中恢复