VMware Cloud Director 클러스터 상태가 SSH 문제를 나타냄

데이터베이스 HA 구성을 사용하는 VMware Cloud Director 장치 배포에서 postgres 사용자는 SSH를 통해 피어 데이터베이스 노드에 연결할 수 없습니다.

문제

데이터베이스 노드 간에 SSH 문제가 있는 경우 VMware Cloud Director에 localClusterHealth가 SSH_PROBLEM으로 표시됩니다. 심각한 문제는 가능한 한 빨리 해결해야 합니다.

localClusterHealth는 VMware Cloud Director 장치 관리 사용자 인터페이스를 사용하거나 /nodes VMware Cloud Director 장치 API를 실행하여 볼 수 있습니다. VMware Cloud Director 장치 API 설명서를 참조하십시오.

SSH 문제가 있는 피어 노드에서 /nodes API를 실행하면 /nodes API는 localClusterHealth가 SSH_PROBLEM이고 localClusterFailover가 INDETERMINATE라는 정보를 반환합니다. /nodes API를 실행하는 노드가 SSH를 통해 피어 노드 중 하나에 연결할 수 없기 때문에 페일오버 모드는 INDETERMINATE입니다. SSH 문제가 있는 노드에 대한 응답 본문의 "failover" 출력 부분에 있는 "details"에는 ssh failed. command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf가 표시됩니다.

예를 들어 대기 노드에 SSH 문제가 있고 GET https://primary_host_IP:5480/api/1.0.0/nodes를 실행하면 /nodes API는 다음 정보를 반환할 수 있습니다.

{
    "localClusterFailover": "INDETERMINATE",
    "localClusterHealth": "SSH_PROBLEM",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "* running",
            "upstream": ""
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "primary_host_name"
        },
        {
            "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not running",
                    "status": "NOT RUNNING"
                }
            },
            "id": unreachable_standby_node_ID,
            "location": "default",
            "name": "unreachable_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "primary_host_name"
        }
    ],
    "warnings": []
}

GET https://unreachable_standby_host_IP:5480/api/1.0.0/nodes를 실행하면, 노드를 신뢰할 수 없기 때문에 localClusterFailover 및 localClusterState 정보가 올바르지 않을 수 있습니다. /nodes API 는 unreachable_standby_host_name이 피어 노드에 연결할 수 없다는 경고 메시지를 반환합니다.

예를 들어 /nodes API는 다음 정보를 반환할 수 있습니다.

{
    "localClusterFailover": "MANUAL",
    "localClusterHealth": "SSH_PROBLEM",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh primary_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = n/a",
                    "status": "UNKNOWN"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "UNHEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "? running",
            "upstream": ""
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh running_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = n/a",
                    "status": "UNKNOWN"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "UNHEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "? running",
            "upstream": "primary_host_name"
        },
        {
            "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": unreachable_standby_node_ID,
            "location": "default",
            "name": "unreachable_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "? primary_host_name"
        }
    ],
    "warnings": [
        "unable to connect to node \"primary_host_name\" (ID: primary_node_ID)",
        "unable to connect to node \"running_standby_host_name\" (ID: running_standby_node_ID)",
        "unable to connect to node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID)'s upstream node \"primary_host_name\" (ID: primary_node_ID)",
        "unable to determine if node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID) is attached to its upstream node \"primary_host_name\" (ID: primary_node_ID)"
    ]
}

원인

VMware Cloud Director는 postgres 사용자의 SSH 인증서를 NFS 공유 전송 서버 스토리지에 저장합니다. 모든 데이터베이스 노드는 공유 전송 서버 스토리지에 액세스할 수 있어야 합니다. 데이터베이스 노드를 신뢰할 수 없는 경우, 즉 postgres 사용자의 SSH 인증서가 더 이상 유효하지 않거나 액세스할 수 없는 경우, 해당 노드는 SSH 클라이언트를 사용하여 피어 노드에서 명령을 실행할 수 없습니다. VMware Cloud Director 장치가 HA 모드에서 제대로 작동하려면 이 기능이 있어야 합니다.

해결책

노드 간에 연결 문제가 있는지 확인한 후 문제를 해결하십시오. VMware Cloud Director 데이터베이스 고가용성 클러스터에서 연결 상태 확인의 내용을 참조하십시오.

다음 명령을 실행하여 SSH 문제가 있는 노드에서 appliance-sync.timer 서비스가 실행 중인지 확인합니다.

systemctl status appliance-sync.timer

예를 들어 명령이 다음을 반환할 수 있습니다.

* appliance-sync.timer - Periodic check and sync of needed files for Cloud Appliance functionality
   Loaded: loaded (/lib/systemd/system/appliance-sync.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Sat 2020-09-05 23:22:49 UTC; 1 months 9 days ago
 
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

appliance-sync.timer 서비스의 상태가 활성이 아니면 다음 명령을 실행하여 서비스를 다시 시작하십시오.
```
systemctl start appliance-sync.timer
```
약 90초 동안 기다렸다가 VMware Cloud Director 관리 UI를 사용하거나 /nodes API를 호출하여 클러스터 상태가 정상인지 확인합니다.