VMware Cloud Director 維護節點之間的同步串流複寫。如果待命節點變成未連結狀態,您必須確定原因並解決問題。

問題

VMware Cloud Director 應用裝置管理使用者介面將叢集健全狀況顯示為 DEGRADED,其中一個未連結的待命節點的狀態為執行中,並且在待命節點之上游節點的名稱前面有一個驚嘆號 (!)。

PostgreSQL 記錄顯示主要節點刪除了 WAL 區段。
2020-10-08 04:10:50.064 UTC [13390] LOG:  started streaming WAL from primary at 21/80000000 on timeline 17
2020-10-08 04:10:50.064 UTC [13390] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000110000002100000080 has already been removed
2020-10-08 04:10:55.047 UTC [13432] LOG:  started streaming WAL from primary at 21/80000000 on timeline 17
2020-10-08 04:10:55.047 UTC [13432] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000110000002100000080 has already been removed

/nodes API 傳回的資訊指出 localClusterHealthDEGRADED,節點 status執行中nodeHealthHEALTHY。在待命節點之上游節點的名稱前面有一個驚嘆號 (!),並且 /nodes API 傳回的警告指出待命節點未連結至其上游節點。

例如, /nodes API 可能會針對節點傳回下列資訊。
{
    "localClusterFailover": "MANUAL",
    "localClusterHealth": "DEGRADED",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "* running",
            "upstream": ""
        },
        {
            "connectionString": "host=unattached_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node unattached_standby_node_ID (unattached_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": unattached_standby_node_ID,
            "location": "default",
            "name": "unattached_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "! upstream_host_name"
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "upstream_host_name"
        }
    ],
    "warnings": [
        "node \"unattached_standby_host_name\" (ID: unattached_standby_node_ID) is not attached to its upstream node \"upstream_host_name\" (ID: upstream_node_id)"
    ]
}

如果待命節點變成未連結狀態,您必須盡快重新連結。如果節點處於未連結狀態的時間過長,則可能會在處理主要節點持續串流 WAL 記錄方面落後,以至於無法恢復複寫。

原因

為確保資料完整性,PostgreSQL 資料庫使用預寫式記錄 (WAL)。主要節點持續將 WAL 串流至作用中的待命節點,以進行複寫和復原。待命節點會在收到 WAL 後對其進行處理。如果待命節點變成未連結狀態,它會停止接收 WAL,且無法成為升階為新主要節點的候選節點。

解決方案

  1. 部署新的待命節點。
  2. 解除登錄未連結的待命節點。

下一步

請參閱 從高可用性叢集中的待命儲存格故障復原