Machine Health Check

Machine Health Check is a controller that provides node health monitoring and node auto-repair for Tanzu Kubernetes clusters.

You can enable Machine Health Check and define the unhealthy conditions for the controller to monitor when creating the node pool cluster template. You can also edit the Machine Health Check conditions on an existing node pool under a Workload cluster. Machine Health Check monitors the node pools for any unhealthy nodes and tries to remediate by recreating them. For example, set the maximum duration a node can remain in the not ready state to 15 minutes after which, the Machine Health Check controller triggers a remediation. For more details on machine health check, see https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html.

By default, the Machine Health Check controller is disabled.

For steps to enable and configure Machine Health Check when creating a Workload cluster template, see Create a Workload Cluster Template.

For steps to configure Machine Health Check on an existing node pool, see Edit a Kubernetes Cluster Node Pool.

There may be instances when you want to power down a virtual machine to perform certain maintenance activities. To avoid Machine Health Check remediating during the down time, you can place the node pools in Maintenance Mode. For steps to place the Worker node in Maintenance Mode, see Place Nodes in Maintenance Mode.