Manage Machine Health Checks for Workload Clusters

This topic explains how to use the Tanzu Command Line Interface (CLI) to create, update, retrieve, and delete MachineHealthCheck objects for control plane and worker nodes.

For more information, see tanzu cluster machinehealthcheck in the Tanzu CLI Command Reference.

About MachineHealthCheck

MachineHealthCheck is a controller that provides health monitoring and auto-repair for machines. It is automatically enabled in all management and workload clusters, for both control plane and worker nodes. If the controller is enabled when you deploy a class-based cluster with a single machine deployment or a legacy cluster, Tanzu Kubernetes Grid creates two default MachineHealthCheck objects in the cluster, one for the control plane nodes and one for the worker nodes. For class-based clusters with multiple machine deployments, Tanzu Kubernetes Grid creates one MachineHealthCheck object for the control plane and one for each machine deployment. These objects are created in the same namespace as the cluster.

If you deactivate the controller, you can re-enable it by using the commands documented in the Create or Update a MachineHealthCheck Object. You can also use the commands to update existing MachineHealthCheck objects.

Create or Update a MachineHealthCheck Object

Follow the steps below to create or update MachineHealthCheck objects for your clusters.

  • Class-based clusters:

    • To create the default MachineHealthCheck object for the control plane of a class-based cluster:

      tanzu cluster machinehealthcheck control-plane set CLUSTER-NAME
      
    • To create the default MachineHealthCheck object for the worker nodes of a class-based cluster:

      • If the cluster has a single machine deployment, run:

        tanzu cluster machinehealthcheck node set CLUSTER-NAME
        
      • If the cluster has multiple machine deployments, run the following command for each machine deployment. This will create the default MachineHealthCheck object for each machine deployment.

        tanzu cluster machinehealthcheck node set CLUSTER-NAME --machine-deployment MACHINE-DEPLOYMENT-NAME
        

    Where:

    • CLUSTER-NAME is the name of the target cluster.
    • MACHINE-DEPLOYMENT-NAME is the name of the machine deployment. For example, md-0. To retrieve the machine deployment name, run kubectl get cluster CLUSTER-NAME -o yaml and then locate spec.topology.workers.machineDeployments.name in the output.
  • Legacy clusters:

    • To create the default MachineHealthCheck object for the control plane of a legacy cluster:

      tanzu cluster machinehealthcheck control-plane set CLUSTER-NAME --mhc-name MHC-NAME
      
    • To create the default MachineHealthCheck object for the worker nodes of a legacy cluster:

      tanzu cluster machinehealthcheck node set CLUSTER-NAME --mhc-name MHC-NAME
      

    Where:

    • CLUSTER-NAME is the name of the target cluster.
    • MHC-NAME is a name you choose for the MachineHealthCheck object. If not specified, the name is set to CLUSTER-NAME. If you are running both of these commands, specifying --mhc-name is required. The --mhc-name flag is ignored for class-based clusters.

You can also use the above commands to create customized MachineHealthCheck objects or to update existing MachineHealthCheck objects. To customize or update a MachineHealthCheck object, you can specify one or more of the flags below.

Note

These examples assume that you are customizing or updating your MachineHealthCheck settings for a class-based cluster with a single machine deployment. When customizing or updating the MachineHealthCheck object for the worker nodes of a class-based cluster with multiple machine deployments, you must specify the --machine-deployment flag. For legacy clusters, specify --mhc-name as described above.

  • --match-labels: This option filters machines by label keys and values. You can specify one or more label constraints. The MachineHealthCheck object is applied to all machines that satisfy the specified constraints. Format the key-value pairs as follows:

    tanzu cluster machinehealthcheck control-plane set CLUSTER-NAME --match-labels "key1:value1,key2:value2"
    tanzu cluster machinehealthcheck node set CLUSTER-NAME --match-labels "key1:value1,key2:value2"
    
  • --max-unhealthy: If the number of unhealthy machines exceeds the value you set using this flag, the MachineHealthCheck controller does not perform remediation. The --max-unhealthy setting defaults to 100%. You can specify either an absolute number or percentage for this flag.

    tanzu cluster machinehealthcheck control-plane set CLUSTER-NAME --max-unhealthy "60%"
    tanzu cluster machinehealthcheck node set CLUSTER-NAME --max-unhealthy "60%"
    
  • --node-startup-timeout: This option controls the amount of time that the MachineHealthCheck controller waits for a machine to join the cluster before considering the machine unhealthy. For example, the commands below set the --node-startup-timeout option to 21m:

    tanzu cluster machinehealthcheck control-plane set my-cluster --node-startup-timeout 21m
    tanzu cluster machinehealthcheck node set my-cluster --node-startup-timeout 21m
    

    If a machine fails to join the cluster within the specified amount of time, the MachineHealthCheck controller recreates the machine.

  • --unhealthy-conditions: This option can set the Ready, MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable conditions. The MachineHealthCheck controller uses the conditions that you set to monitor the health of your control plane and worker nodes. To set the status of a condition, use True, False, or Unknown. For example:

    tanzu cluster machinehealthcheck control-plane set my-cluster --unhealthy-conditions "Ready:False:5m,Ready:Unknown:5m"
    tanzu cluster machinehealthcheck node set my-cluster --unhealthy-conditions "Ready:False:5m,Ready:Unknown:5m"
    

    The example above sets the Ready condition to False:5m and Unknown:5m. If a machine remains in the Unknown or False status for longer than 5m, the MachineHealthCheck controller considers the machine unhealthy and recreates it.

Retrieve a MachineHealthCheck Object

Follow the steps below to retrieve MachineHealthCheck objects for your clusters. The --mhc-name flag is ignored for class-based clusters.

  • To retrieve the MachineHealthCheck object for the control plane of the target cluster, run:

    tanzu cluster machinehealthcheck control-plane get CLUSTER-NAME --mhc-name MHC-NAME
    

    Omit the --mhc-name flag if the object was created with the default name or if you are targeting a class-based cluster.

  • To retrieve the MachineHealthCheck object for the worker nodes of the target cluster, run:

    tanzu cluster machinehealthcheck node get CLUSTER-NAME --mhc-name MHC-NAME
    

    Omit the --mhc-name flag if the object was created with the default name or if you are targeting a class-based cluster.

Delete a MachineHealthCheck Object

Follow the steps below to delete MachineHealthCheck objects for your clusters.

  • Class-based clusters:

    • To delete the MachineHealthCheck object for the control plane of a class-based cluster:

      tanzu cluster machinehealthcheck control-plane delete CLUSTER-NAME
      
    • To delete the MachineHealthCheck object or objects for the worker nodes of a class-based cluster:

      • If the cluster has a single machine deployment, run:

        tanzu cluster machinehealthcheck node delete CLUSTER-NAME
        
      • If the cluster has multiple machine deployments, run the following command for each machine deployment:

        tanzu cluster machinehealthcheck node delete CLUSTER-NAME --machine-deployment MACHINE-DEPLOYMENT-NAME
        
  • Legacy clusters:

    • To delete the MachineHealthCheck object for the control plane of a legacy cluster:

      tanzu cluster machinehealthcheck control-plane delete CLUSTER-NAME --mhc-name MHC-NAME
      

      Omit the --mhc-name flag if the object was created with the default name.

    • To delete the MachineHealthCheck object for the worker nodes of a legacy cluster, run:

      tanzu cluster machinehealthcheck node delete CLUSTER-NAME --mhc-name MHC-NAME
      

      Omit the --mhc-name flag if the object was created with the default name.

check-circle-line exclamation-circle-line close-line
Scroll to top icon