VMware Cloud Director 维护节点之间的同步流式传输复制。如果备用节点变得无法访问,则必须确定原因并解决问题。
问题
VMware Cloud Director 设备管理 UI 显示集群运行状况为 DEGRADED,其中一个备用节点的状态为 ? unreachable。
/nodes
API 返回以下信息:localClusterHealth
为 DEGRADED,节点的 status
为 ? unreachable,nodeHealth
为 UNHEALTHY。
例如,
/nodes
API 可能会返回以下节点信息。
{ "localClusterFailover": "MANUAL", "localClusterHealth": "DEGRADED", "localClusterState": [ { "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": primary_node_ID, "location": "default", "name": "primary_host_name", "nodeHealth": "HEALTHY", "nodeRole": "PRIMARY", "role": "primary", "status": "* running", "upstream": "" }, { "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover state unknown - unable to ssh to failed or unreachable node", "mode": "UNKNOWN", "repmgrd": { "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = n/a", "status": "UNKNOWN" } }, "id": unreachable_standby_node_ID, "location": "default", "name": "unreachable_standby_host_name", "nodeHealth": "UNHEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "? unreachable", "upstream": "primary_host_name" }, { "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node running_standby_node_ID (running_standby_host_IP): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": running_standby_node_ID, "location": "default", "name": "running_standby_host_name", "nodeHealth": "HEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "running", "upstream": "primary_host_name" } ], "warnings": [ "unable to connect to node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID)", "node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID) is registered as an active standby but is unreachable" ] }
原因
为确保数据完整性,PostgreSQL 数据库使用预写式日志记录 (WAL)。主节点不断地将 WAL 流式传输到活动备用节点,以便进行复制和恢复。备用节点在接收到 WAL 后对其进行处理。如果备用节点无法访问,则将停止接收 WAL,并且无法作为候选节点提升为新的主节点。
解决方案
- 验证无法访问的备用节点的虚拟机是否正在运行。
- 验证备用节点的网络连接是否正常工作。
- 验证是否存在可能阻止备用节点与其他节点通信的 SSH 问题。
- 验证备用节点上的 vpostgres 服务是否正在运行。