When you configure machine health checks on your management cluster, Cluster API starts to detect unhealthy machines on a specified workload cluster and remediate them. You must perform the configuration on each workload cluster.
  • If the feature is not configured in a cluster and a cluster node running StatefulSet is unresponsive, the StatefulSet pods are stuck in Terminating phase. Then StatefulSet pods are not scheduled to a different node.
  • If the feature is configured on the cluster, Cluster API ignores the non-terminated pods and re-creates the failed machine. Then the StatefulSet pods start on the new machine.
The followng code is an example of a MachineHealthCheck API object.
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: capi-quickstart-node-unhealthy-5m
spec:
  # clusterName is required to associate this MachineHealthCheck with a particular cluster
  clusterName: capi-quickstart
  # (Optional) maxUnhealthy prevents further remediation if the cluster is already partially unhealthy
  maxUnhealthy: 40%
  # (Optional) nodeStartupTimeout determines how long a MachineHealthCheck should wait for
  # a Node to join the cluster, before considering a Machine unhealthy.
  # Defaults to 10 minutes if not specified.
  # Set to 0 to disable the node startup timeout.
  # Disabling this timeout will prevent a Machine from being considered unhealthy when
  # the Node it created has not yet registered with the cluster. This can be useful when
  # Nodes take a long time to start up or when you only want condition based checks for
  # Machine health.
  nodeStartupTimeout: 10m
  # selector is used to determine which Machines should be health checked
  selector:
    matchLabels:
      nodepool: nodepool-0
  # Conditions to check on Nodes for matched Machines, if any condition is matched for the duration of its timeout, the Machine is considered unhealthy
  unhealthyConditions:
  - type: Ready
    status: Unknown
    timeout: 300s
  - type: Ready
    status: "False"
    timeout: 300s

For more information about Machine Health Check, see Healthchecking in The Cluster API Book.