The Cluster Health Indicates an SSH Problem

In a VMware Cloud Director appliance deployment with database HA configuration, the postgres user cannot connect to its peer database nodes over SSH.

Problem

When there is an SSH problem between the database nodes, VMware Cloud Director shows the localClusterHealth as SSH_PROBLEM. You must fix this critical problem as soon as possible.

You can view the localClusterHealth by using the VMware Cloud Director appliance management user interface or run the /nodes VMware Cloud Director appliance API. See the VMware Cloud Director Appliance API documentation.

When you run the /nodes API on a peer node of the one with the SSH problem, the /nodes API returns information that the localClusterHealth is SSH_PROBLEM, the localClusterFailover is INDETERMINATE. The failover mode is INDETERMINATE because the node on which you run the /nodes API cannot connect to one of its peer nodes over SSH. The "details" in the "failover" output part of the response body for the node with SSH problem displays: ssh failed. command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf.

For example, if a standby node has an SSH problem and you run GET https://primary_host_IP:5480/api/1.0.0/nodes, the /nodes API might return the following information.

{
    "localClusterFailover": "INDETERMINATE",
    "localClusterHealth": "SSH_PROBLEM",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "* running",
            "upstream": ""
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "primary_host_name"
        },
        {
            "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not running",
                    "status": "NOT RUNNING"
                }
            },
            "id": unreachable_standby_node_ID,
            "location": "default",
            "name": "unreachable_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "primary_host_name"
        }
    ],
    "warnings": []
}

If you run GET https://unreachable_standby_host_IP:5480/api/1.0.0/nodes, because the node is untrusted, the localClusterFailover and localClusterState information might not be correct. The /nodes API returns warning messages that the unreachable_standby_host_name is unable to connect to its peer nodes.

For example, the /nodes API might return the following information.

{
    "localClusterFailover": "MANUAL",
    "localClusterHealth": "SSH_PROBLEM",
    "localClusterState": [
        {
            "connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh primary_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node primary_node_ID (primary_host_name): repmgrd = n/a",
                    "status": "UNKNOWN"
                }
            },
            "id": primary_node_ID,
            "location": "default",
            "name": "primary_host_name",
            "nodeHealth": "UNHEALTHY",
            "nodeRole": "PRIMARY",
            "role": "primary",
            "status": "? running",
            "upstream": ""
        },
        {
            "connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "ssh failed. command: ssh running_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
                "mode": "UNKNOWN",
                "repmgrd": {
                    "details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = n/a",
                    "status": "UNKNOWN"
                }
            },
            "id": running_standby_node_ID,
            "location": "default",
            "name": "running_standby_host_name",
            "nodeHealth": "UNHEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "? running",
            "upstream": "primary_host_name"
        },
        {
            "connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
            "failover": {
                "details": "failover = manual",
                "mode": "MANUAL",
                "repmgrd": {
                    "details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not applicable",
                    "status": "NOT APPLICABLE"
                }
            },
            "id": unreachable_standby_node_ID,
            "location": "default",
            "name": "unreachable_standby_host_name",
            "nodeHealth": "HEALTHY",
            "nodeRole": "STANDBY",
            "role": "standby",
            "status": "running",
            "upstream": "? primary_host_name"
        }
    ],
    "warnings": [
        "unable to connect to node \"primary_host_name\" (ID: primary_node_ID)",
        "unable to connect to node \"running_standby_host_name\" (ID: running_standby_node_ID)",
        "unable to connect to node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID)'s upstream node \"primary_host_name\" (ID: primary_node_ID)",
        "unable to determine if node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID) is attached to its upstream node \"primary_host_name\" (ID: primary_node_ID)"
    ]
}

Cause

VMware Cloud Director stores the SSH certificates of the postgres user on the NFS shared transfer server storage. All database nodes must have access to the shared transfer server storage. If a database node becomes untrusted, that is, the SSH certificates of the postgres user are either no longer valid or accessible, that node is unable to run commands on its peer nodes by using an SSH client. The VMware Cloud Director appliance must have this capability to perform properly when in HA mode.

Solution

Determine whether there is a connectivity problem between the nodes and correct the problem. See Check the Connectivity Status of a Database High Availability Cluster.

Verify that the appliance-sync.timer service is running on the nodes that have the SSH problem by running the following command.

systemctl status appliance-sync.timer

For example, the command might return:

* appliance-sync.timer - Periodic check and sync of needed files for Cloud Appliance functionality
   Loaded: loaded (/lib/systemd/system/appliance-sync.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Sat 2020-09-05 23:22:49 UTC; 1 months 9 days ago
 
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

If the status of the appliance-sync.timer service is not Active, restart the service by running the following command.
```
systemctl start appliance-sync.timer
```
Wait for approximately 90 seconds and verify that the cluster health is HEALTHY by using the VMware Cloud Director management UI or call the /nodes API.