在具有資料庫 HA 組態的 VMware Cloud Director 應用裝置部署中,postgres 使用者無法透過 SSH 連線至其對等資料庫節點。

問題

當資料庫節點之間出現 SSH 問題時,VMware Cloud Director 會顯示 localClusterHealthSSH_PROBLEM。必須盡快修正此嚴重問題。

可以使用 VMware Cloud Director 應用裝置管理使用者介面來檢視 localClusterHealth,也可以執行 /nodes VMware Cloud Director 應用裝置 API。請參閱 VMware Cloud Director 應用裝置 API 說明文件。

在出現 SSH 問題的某個節點的對等節點上執行 /nodes API 時,/nodes API 會傳回以下資訊:localClusterHealthSSH_PROBLEMlocalClusterFailoverINDETERMINATE。容錯移轉模式為 INDETERMINATE,因為執行 /nodes API 的節點無法透過 SSH 連線到其對等節點之一。對於出現 SSH 問題的節點,其回應本文的 "failover" 輸出部分中的 "details" 顯示:ssh failed.command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf

例如,如果待命節點出現 SSH 問題,並且您執行 GET https://primary_host_IP:5480/api/1.0.0/nodes,則 /nodes API 可能會傳回下列資訊。
{
    "localClusterFailover": "INDETERMINATE",
    "localClusterHealth": "SSH_PROBLEM",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "* running",
            "upstream": ""
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "primary_host_name"
        },
        {
            "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not running",
                    "status": "NOT RUNNING"
                }
            },
            "id": unreachable_standby_node_ID,
            "location": "default",
            "name": "unreachable_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "primary_host_name"
        }
    ],
    "warnings": []
}

如果執行 GET https://unreachable_standby_host_IP:5480/api/1.0.0/nodes,由於節點不受信任,則 localClusterFailoverlocalClusterState 資訊可能不正確。/nodes API 傳回警告訊息,指出 unreachable_standby_host_name 無法連線至其對等節點。

例如, /nodes API 可能會傳回下列資訊。
{
    "localClusterFailover": "MANUAL",
    "localClusterHealth": "SSH_PROBLEM",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh primary_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = n/a",
                    "status": "UNKNOWN"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "UNHEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "? running",
            "upstream": ""
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh running_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = n/a",
                    "status": "UNKNOWN"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "UNHEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "? running",
            "upstream": "primary_host_name"
        },
        {
            "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": unreachable_standby_node_ID,
            "location": "default",
            "name": "unreachable_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "? primary_host_name"
        }
    ],
    "warnings": [
        "unable to connect to node \"primary_host_name\" (ID: primary_node_ID)",
        "unable to connect to node \"running_standby_host_name\" (ID: running_standby_node_ID)",
        "unable to connect to node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID)'s upstream node \"primary_host_name\" (ID: primary_node_ID)",
        "unable to determine if node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID) is attached to its upstream node \"primary_host_name\" (ID: primary_node_ID)"
    ]
}

原因

VMware Cloud Directorpostgres 使用者的 SSH 憑證儲存在 NFS 共用傳輸伺服器儲存區中。所有資料庫節點都必須具有共用傳輸伺服器儲存區的存取權。如果資料庫節點變得不受信任,即 postgres 使用者的 SSH 憑證不再有效或無法再存取,則該節點無法使用 SSH 用戶端在其對等節點上執行命令。VMware Cloud Director 應用裝置必須具有此功能,才能在 HA 模式下正確執行。

解決方案

  1. 確定節點之間是否存在連線問題並進行更正。請參閱檢查 VMware Cloud Director 資料庫高可用性叢集的連線狀態
  2. 透過執行下列命令,確認 appliance-sync.timer 服務是否正在出現 SSH 問題的節點上執行。
    systemctl status appliance-sync.timer
    例如,命令可能會傳回:
    * appliance-sync.timer - Periodic check and sync of needed files for Cloud Appliance functionality
       Loaded: loaded (/lib/systemd/system/appliance-sync.timer; enabled; vendor preset: enabled)
       Active: active (waiting) since Sat 2020-09-05 23:22:49 UTC; 1 months 9 days ago
     
    Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
  3. 如果 appliance-sync.timer 服務的狀態不是作用中,請執行下列命令以重新啟動服務。
    systemctl start appliance-sync.timer
  4. 等待約 90 秒,然後使用 VMware Cloud Director 管理使用者介面確認叢集健全狀況是否為 HEALTHY,或呼叫 /nodes API。