在具有数据库 HA 配置的 VMware Cloud Director 设备部署中,postgres 用户无法通过 SSH 连接到其对等数据库节点。
问题
当数据库节点之间存在 SSH 问题时,VMware Cloud Director 会将 localClusterHealth 显示为 SSH_PROBLEM。必须尽快修复此严重问题。
可以使用 VMware Cloud Director 设备管理用户界面查看 localClusterHealth,也可以运行 /nodes VMware Cloud Director 设备 API 进行查看。请参见 VMware Cloud Director 设备 API 文档。
在存在 SSH 问题的节点的对等节点上运行 /nodes API 时,/nodes API 会返回以下信息:localClusterHealth 为 SSH_PROBLEM,localClusterFailover 为 INDETERMINATE。故障切换模式为 INDETERMINATE 是因为,运行 /nodes API 的节点无法通过 SSH 连接到其对等节点之一。存在 SSH 问题的节点响应正文的 "failover" 输出部分中的 "details" 将显示:ssh failed.command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf.
GET https://primary_host_IP:5480/api/1.0.0/nodes 时,
/nodes API 可能会返回以下信息。
{
"localClusterFailover": "INDETERMINATE",
"localClusterHealth": "SSH_PROBLEM",
"localClusterState": [
{
"connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
"failover": {
"details": "failover = manual",
"mode": "MANUAL",
"repmgrd": {
"details": "On node primary_node_ID (primary_host_name): repmgrd = not applicable",
"status": "NOT APPLICABLE"
}
},
"id": primary_node_ID,
"location": "default",
"name": "primary_host_name",
"nodeHealth": "HEALTHY",
"nodeRole": "PRIMARY",
"role": "primary",
"status": "* running",
"upstream": ""
},
{
"connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
"failover": {
"details": "failover = manual",
"mode": "MANUAL",
"repmgrd": {
"details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = not applicable",
"status": "NOT APPLICABLE"
}
},
"id": running_standby_node_ID,
"location": "default",
"name": "running_standby_host_name",
"nodeHealth": "HEALTHY",
"nodeRole": "STANDBY",
"role": "standby",
"status": "running",
"upstream": "primary_host_name"
},
{
"connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
"failover": {
"details": "ssh failed. command: ssh unreachable_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
"mode": "UNKNOWN",
"repmgrd": {
"details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not running",
"status": "NOT RUNNING"
}
},
"id": unreachable_standby_node_ID,
"location": "default",
"name": "unreachable_standby_host_name",
"nodeHealth": "HEALTHY",
"nodeRole": "STANDBY",
"role": "standby",
"status": "running",
"upstream": "primary_host_name"
}
],
"warnings": []
}
如果运行 GET https://unreachable_standby_host_IP:5480/api/1.0.0/nodes,由于节点不受信任,localClusterFailover 和 localClusterState 信息可能会不正确。/nodes API 会返回警告消息,指出 unreachable_standby_host_name 无法连接到其对等节点。
/nodes API 可能会返回以下信息。
{
"localClusterFailover": "MANUAL",
"localClusterHealth": "SSH_PROBLEM",
"localClusterState": [
{
"connectionString": "host=primary_host_IP user=repmgr dbname=repmgr connect_timeout=2",
"failover": {
"details": "ssh failed. command: ssh primary_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
"mode": "UNKNOWN",
"repmgrd": {
"details": "On node primary_node_ID (primary_host_name): repmgrd = n/a",
"status": "UNKNOWN"
}
},
"id": primary_node_ID,
"location": "default",
"name": "primary_host_name",
"nodeHealth": "UNHEALTHY",
"nodeRole": "PRIMARY",
"role": "primary",
"status": "? running",
"upstream": ""
},
{
"connectionString": "host=running_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
"failover": {
"details": "ssh failed. command: ssh running_standby_host_IP /usr/bin/grep failover=manual /opt/vmware/vpostgres/10/etc/repmgr.conf",
"mode": "UNKNOWN",
"repmgrd": {
"details": "On node running_standby_node_ID (running_standby_host_name): repmgrd = n/a",
"status": "UNKNOWN"
}
},
"id": running_standby_node_ID,
"location": "default",
"name": "running_standby_host_name",
"nodeHealth": "UNHEALTHY",
"nodeRole": "STANDBY",
"role": "standby",
"status": "? running",
"upstream": "primary_host_name"
},
{
"connectionString": "host=unreachable_standby_host_IP user=repmgr dbname=repmgr connect_timeout=2",
"failover": {
"details": "failover = manual",
"mode": "MANUAL",
"repmgrd": {
"details": "On node unreachable_standby_node_ID (unreachable_standby_host_name): repmgrd = not applicable",
"status": "NOT APPLICABLE"
}
},
"id": unreachable_standby_node_ID,
"location": "default",
"name": "unreachable_standby_host_name",
"nodeHealth": "HEALTHY",
"nodeRole": "STANDBY",
"role": "standby",
"status": "running",
"upstream": "? primary_host_name"
}
],
"warnings": [
"unable to connect to node \"primary_host_name\" (ID: primary_node_ID)",
"unable to connect to node \"running_standby_host_name\" (ID: running_standby_node_ID)",
"unable to connect to node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID)'s upstream node \"primary_host_name\" (ID: primary_node_ID)",
"unable to determine if node \"unreachable_standby_host_name\" (ID: unreachable_standby_node_ID) is attached to its upstream node \"primary_host_name\" (ID: primary_node_ID)"
]
}
原因
VMware Cloud Director 将 postgres 用户的 SSH 证书存储在 NFS 共享传输服务器存储上。所有数据库节点都必须能够访问共享传输服务器存储。如果数据库节点变得不受信任(即 postgres 用户的 SSH 证书不再有效或无法访问),则该节点将无法通过使用 SSH 客户端在其对等节点上运行命令。VMware Cloud Director 设备必须具有此功能才能在 HA 模式下正常执行。