Manage Machine Health Checks for Workload Clusters

This topic explains how to use the Tanzu Command Line Interface (CLI) to create, update, retrieve, and delete MachineHealthCheck objects for workload clusters created by Tanzu Kubernetes Grid.

For more information, see tanzu cluster machinehealthcheck in the Tanzu CLI Command Reference.

Note

To support machine health checks for both control plane and workload nodes, the Tanzu CLI v1.6 and later replaces the tanzu cluster machinehealthcheck set/get/delete commands with the tanzu cluster machinehealthcheck control-plane set/get/delete and tanzu cluster machinehealthcheck node set/get/delete commands. The tanzu cluster machinehealthcheck set/get/delete commands are deprecated and will be removed in a future release.

About MachineHealthCheck

MachineHealthCheck is a controller that provides health monitoring and auto-repair for machines. It is automatically enabled in all management and workload clusters, for both control plane and worker nodes. If the controller is enabled when you deploy a cluster, Tanzu Kubernetes Grid creates two default MachineHealthCheck objects in the cluster, one for the control plane nodes and one for the worker nodes. These objects are created in the same namespace as the cluster.

If you deactivate the controller, you can re-enable it by using the commands documented in the Create or Update a MachineHealthCheck Object. You can also use the commands to update existing MachineHealthCheck objects.

Create or Update a MachineHealthCheck Object

To create a default MachineHealthCheck object,

  • For the control plane of a cluster, run:

    tanzu cluster machinehealthcheck control-plane set CLUSTER-NAME --mhc-name MHC-NAME
    
  • For the worker nodes of a cluster, run:

    tanzu cluster machinehealthcheck node set CLUSTER-NAME --mhc-name MHC-NAME
    

Where:

  • CLUSTER-NAME is the name of the target cluster.
  • MHC-NAME is a name you choose for the MachineHealthCheck object. If not specified, the name is set to CLUSTER-NAME. If you are running both of these commands, specifying --mhc-name is required.

You can also use the above commands to create customized MachineHealthCheck objects or to update existing MachineHealthCheck objects. To customize or update a MachineHealthCheck object, you can specify one or more of the following flags:

  • --match-labels: This option filters machines by label keys and values. You can specify one or more label constraints. The MachineHealthCheck object is applied to all machines that satisfy the specified constraints. Format the key-value pairs as follows:

    tanzu cluster machinehealthcheck control-plane set CLUSTER-NAME --mhc-name MHC-NAME --match-labels "key1:value1,key2:value2"
    tanzu cluster machinehealthcheck node set CLUSTER-NAME --mhc-name MHC-NAME --match-labels "key1:value1,key2:value2"
    
  • --node-startup-timeout: This option controls the amount of time that the MachineHealthCheck controller waits for a machine to join the cluster before considering the machine unhealthy. For example, the commands below set the --node-startup-timeout option to 21m:

    tanzu cluster machinehealthcheck control-plane set my-cluster --mhc-name my-control-plane-mhc --node-startup-timeout 21m
    tanzu cluster machinehealthcheck node set my-cluster --mhc-name my-worker-mhc --node-startup-timeout 21m
    

    If a machine fails to join the cluster within the specified amount of time, the MachineHealthCheck controller recreates the machine.

  • --unhealthy-conditions: This option can set the Ready, MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable conditions. The MachineHealthCheck controller uses the conditions that you set to monitor the health of your control plane and worker nodes. To set the status of a condition, use True, False, or Unknown. For example:

    tanzu cluster machinehealthcheck control-plane set my-cluster --mhc-name my-control-plane-mhc --unhealthy-conditions "Ready:False:5m,Ready:Unknown:5m"
    tanzu cluster machinehealthcheck node set my-cluster --mhc-name my-worker-mhc --unhealthy-conditions "Ready:False:5m,Ready:Unknown:5m"
    

    The example above sets the Ready condition to False:5m and Unknown:5m. If a machine remains in the Unknown or False status for longer than 5m, the MachineHealthCheck controller considers the machine unhealthy and recreates it.

Retrieve a MachineHealthCheck Object

To retrieve a MachineHealthCheck object,

  • For the control plane of the target cluster, run:

    tanzu cluster machinehealthcheck control-plane get CLUSTER-NAME --mhc-name MHC-NAME
    

    You can omit the --mhc-name flag if the object was created with the default name.

  • For the worker nodes of the target cluster, run:

    tanzu cluster machinehealthcheck node get CLUSTER-NAME --mhc-name MHC-NAME
    

    You can omit the --mhc-name flag if the object was created with the default name.

Delete a MachineHealthCheck Object

To delete a MachineHealthCheck object,

  • For the control plane of the target cluster, run:

    tanzu cluster machinehealthcheck control-plane delete CLUSTER-NAME --mhc-name MHC-NAME
    

    You can omit the --mhc-name flag if the object was created with the default name.

  • For the worker nodes of the target cluster, run:

    tanzu cluster machinehealthcheck node delete CLUSTER-NAME --mhc-name MHC-NAME
    

    You can omit the --mhc-name flag if the object was created with the default name.

check-circle-line exclamation-circle-line close-line
Scroll to top icon