What Is a Healthy Cluster

Learn how Tanzu Mission Control determines the health of attached and provisioned clusters.

The cluster detail page for each cluster in the Tanzu Mission Control console shows current overall health of the cluster at the top of the page. This health status is also displayed in the list of all clusters on the Clusters page. Additionally, further down the cluster detail page, more health information is broken out into detailed aspects of the overall health. Tanzu Mission Control continuously monitors each cluster and updates the console with changes.

The cluster agent extensions that are deployed on your cluster (both provisioned and attached) send change events from nodes and ports as they occur, and a regularly occuring component status event for each component to Tanzu Mission Control. These events are regarded collectively as the heartbeat, which Tanzu Mission Control uses to determine the health of the cluster.

Cluster Health

The overall health of the cluster is an aggregation of health of the components and nodes in the cluster. The health status of the cluster can be one of the following values.

HEALTHY
A cluster is healthy when all nodes and components are healthy, and a heartbeat for the cluster is received every minute.
UNHEALTHY
A cluster is unhealthy if either of the following are reported as unhealthy:
- one or more of the cluster's control plane nodes
- one or more of the cluster's components
WARNING
A cluster can have a warning status if any of its worker nodes are in an unhealthy or unknown state.
A cluster can also have a warning status if any nodes (worker or control plane) are in a warning state.
UNKNOWN
The health status of a cluster in unknown if either of the following are reported as unknown:
- one or more of the cluster's control plane nodes
- one or more of the cluster's components
DISCONNECTED
A cluster is considered disconnected if no heartbeat is received from the cluster for more than 3 minutes.

Node Health

The title of the Worker nodes section shows you how many worker nodes you have in the cluster, and below that the number of worker nodes that are healthy. To see all the nodes (including the control plane), click the Nodes tab, which shows the health of each individual node in the Status column.

The information received in the change event for a node consists of the NodeReady condition, and a number of other conditions like MemoryPressure, DiskPressure, and OutOfDisk. Tanzu Mission Control uses the value of these conditions to determine the health of the node. The status of each condition is assumed to be unchanged until the node reports a change. Based on the reported conditions, the health status of a node can be one of the following:

HEALTHY
If NodeReady is True, and all other conditions are healthy, then the node is healthy.
UNHEALTHY
The node is unhealthy if NodeReady is False. The node is also unhealthy if NodeReady is True and more than half of the other conditions are in an unhealthy state.
WARNING
The warning status indicates that NodeReady is True, but some (less than half) of the other conditions are in an unhealthy state.
UNKNOWN
If NodeReady has any value other than True or False, the health status of the node is unknown. The node can also have an unknown status if no heartbeat has been received from the cluster for more than three minutes.

Component Health

Tanzu Mission Control monitors the health of the following components running in the cluster:

kube-apiserver
scheduler
controller-manager
one or more etcd components (etcd-0, etcd-1, etcd-2, and so on)

The component status event reports the Healthy condition for each of these components every 45 seconds. The health status of each component can be one of the following:

HEALTHY
If the last reported value of the Healthy condition of the component is True, then the component is healthy.
UNHEALTHY
If the last reported value of the Healthy condition of the component is False, then the component is unhealthy.
UNKNOWN
If the last reported value of the Healthy condition of the component is Unknown, or it is something other than True or False, then the health status of the component is unknown. The component can also be in this state if no heartbeat has been received from the cluster for more than three minutes.