VMware Cloud Director 維護節點之間的同步串流複寫。如果待命節點變成未連結狀態,您必須確定原因並解決問題。
問題
VMware Cloud Director 應用裝置管理使用者介面將叢集健全狀況顯示為 DEGRADED,其中一個未連結的待命節點的狀態為執行中,並且在待命節點之上游節點的名稱前面有一個驚嘆號 (!)。
PostgreSQL 記錄顯示主要節點刪除了 WAL 區段。
2020-10-08 04:10:50.064 UTC [13390] LOG: started streaming WAL from primary at 21/80000000 on timeline 17 2020-10-08 04:10:50.064 UTC [13390] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000110000002100000080 has already been removed 2020-10-08 04:10:55.047 UTC [13432] LOG: started streaming WAL from primary at 21/80000000 on timeline 17 2020-10-08 04:10:55.047 UTC [13432] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000110000002100000080 has already been removed
/nodes
API 傳回的資訊指出 localClusterHealth
為 DEGRADED,節點 status
為執行中,nodeHealth
為 HEALTHY。在待命節點之上游節點的名稱前面有一個驚嘆號 (!),並且 /nodes
API 傳回的警告指出待命節點未連結至其上游節點。
例如,
/nodes
API 可能會針對節點傳回下列資訊。
{ "localClusterFailover": "MANUAL", "localClusterHealth": "DEGRADED", "localClusterState": [ { "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": primary_node_ID, "location": "default", "name": "primary_host_name", "nodeHealth": "HEALTHY", "nodeRole": "PRIMARY", "role": "primary", "status": "* running", "upstream": "" }, { "connectionString": "host=unattached_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node unattached_standby_node_ID (unattached_standby_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": unattached_standby_node_ID, "location": "default", "name": "unattached_standby_host_name", "nodeHealth": "HEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "running", "upstream": "! upstream_host_name" }, { "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2", "failover": { "details": "failover = manual", "mode": "MANUAL", "repmgrd": { "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable", "status": "NOT APPLICABLE" } }, "id": running_standby_node_ID, "location": "default", "name": "running_standby_host_name", "nodeHealth": "HEALTHY", "nodeRole": "STANDBY", "role": "standby", "status": "running", "upstream": "upstream_host_name" } ], "warnings": [ "node \"unattached_standby_host_name\" (ID: unattached_standby_node_ID) is not attached to its upstream node \"upstream_host_name\" (ID: upstream_node_id)" ] }
如果待命節點變成未連結狀態,您必須盡快重新連結。如果節點處於未連結狀態的時間過長,則可能會在處理主要節點持續串流 WAL 記錄方面落後,以至於無法恢復複寫。
原因
為確保資料完整性,PostgreSQL 資料庫使用預寫式記錄 (WAL)。主要節點持續將 WAL 串流至作用中的待命節點,以進行複寫和復原。待命節點會在收到 WAL 後對其進行處理。如果待命節點變成未連結狀態,它會停止接收 WAL,且無法成為升階為新主要節點的候選節點。
解決方案
- 部署新的待命節點。
- 解除登錄未連結的待命節點。