Configure MachineHealthCheck for v1beta1 Clusters

This topic describes how to configure MachineHealthCheck for TKG Service clusters provisioned using the v1beta1 API.

MachineHealthCheck for v1beta1 Clusters

MachineHealthCheck is a Kubernetes Cluster API resource that defines conditions for remediating unhealthy machines. In Kubernetes a machine is a custom resource that can run kubelet. In the vSphere IaaS control plane, a Kubernetes machine resource is backed by a vSphere virtual machine. For more information, refer to the upstream documentation.

When you provision a cluster using the TKG Service, the system creates default MachineHealthCheck objects, one for all the control planes and one for each machine deployment. Starting with vSphere 8 Update 3, machine health checks are configurable for v1beta1 clusters. Supported settings include the following:

maxUnhealthy
nodeStartupTimeout
unhealthyConditions
unhealthyRange

The table describes the supported machine health check operations.

Table 1. Machine Health Checks
Field	Value	Description
`maxUnhealthy`	string Absolute number or a percentage	Remediation will not be performed when the number of unhealthy machines exceeds the value.
`nodeStartupTimeout`	string Duration in the form `XhXmXs` (hours, minutes, seconds)	Any machine being created that takes longer than the duration to join the cluster is considered failed and will be remediated.
`unhealthyConditions`	array [] of unhealthyConditions types Available condition types: [`Ready`, `MemoryPressure`,`DiskPressure`, `PIDPressure`, `NetworkUnavailable`] Available condition status: [`True`, `False`, `Unknown`]	List of conditions that determine whether a control plane node is considered unhealthy.
`unhealthyRange`	string	Any further remediation is only allowed if the number of machines selected by "selector" as not healthy is within the range of the `unhealthyRange`. Takes precedence over `maxUnhealthy`. For example: "[3-5]" means that remediation will be allowed only when (a) there are at least 3 unhealthy machines (and) (b) there are at most 5 unhealthy machines.

Note: MachineHealthCheck objects are deployed for v1alpha3 clusters, but not configurable. For details see Check TKG Cluster Machine Health Using Kubectl.

MachineHealthCheck Example

The following example configures a machineHealthCheck for a given machineDeployment.

...
  topology:
    class: tanzukubernetescluster
    version: v1.28.8---vmware.1-fips.1-tkg.2
    controlPlane:
      machineHealthCheck:
        enable: true
        maxUnhealthy: 100%
        nodeStartupTimeout: 4h0m0s
        unhealthyConditions:
        - status: Unknown
          timeout: 5m0s
          type: Ready
       - status: "False"
          timeout: 12m0s
          type: Ready
      ...
    workers:
      machineDeployments:
      - class: node-pool
        failureDomain: np1
        machineHealthCheck:
          enable: true
          maxUnhealthy: 100%
          nodeStartupTimeout: 4h0m0s
          unhealthyConditions:
          - status: Unknown
            timeout: 5m0s
            type: Ready
          - status: "False"
            timeout: 12m0s
            type: Ready

Patch MachineHealthCheck Using Kubectl

To update the MachineHealthCheck for a v1beta1 cluster after it has been provisioned, use the patch method.

Caution: These instructions provide general guidance on patching an existing cluster. The values you use depend on your environment and the deployed cluster you are patching. Consider using the Tanzu CLI to patch the MachineHealthCheck for an existing cluster.

Get the machineDeployment from the cluster resource definition.
```
kubectl get cluster CLUSTER_NAME -o yaml
```
In the section spec.topology.workers.machineDeployments, you should see the value identifying each machineDeployment.

Delete the worker node MachineHealthCheck.

kubectl patch cluster <Cluster Name> -n <cluster namespace>  --type json -p='{"op": "replace", "path": "/spec/topology/workers/machineDeployments/<index>/machineHealthCheck", "value":{"enable":false}}'

Delete the control plane MachineHealthCheck.

kubectl patch cluster <cluster-name> -n <cluster-namespace> --type json -p='{"op": "replace", "path": "/spec/topology/controlPlane/machineHealthCheck", "value":{"enable":false}}'

Create or update the control plane MachineHealthCheck with the desired settings.

kubectl patch cluster <cluster-name> -n <cluster-namespace> --type json -p='[{"op": "replace", "path": "/spec/topology/controlPlane/machineHealthCheck", "value":{"enable":true,"nodeStartupTimeout":"1h58m","unhealthyConditions":[{"status":"Unknown","timeout":"5m10s","type":"Unknown"},{"status":"Unknown","timeout":"5m0s","type":"Ready"}],"maxUnhealthy":"100%"}}]'

Create or update the worker node MachineHealthCheck with the desired settings.

kubectl patch cluster <cluster-name> -n <cluster-namespace> --type json -p='[{"op": "replace", "path": "/spec/topology/workers/machineDeployments/<index>/machineHealthCheck", "value":{"enable":true,"nodeStartupTimeout":"1h58m","unhealthyConditions":[{"status":"Unknown","timeout":"5m10s","type":"Unknown"},{"status":"Unknown","timeout":"5m0s","type":"Ready"}],"maxUnhealthy":"100%"}}]'

Configure MachineHealthCheck Using Tanzu CLI

You can use the Tanzu CLI to configure MachineHealthCheck for a v1beta1 cluster.

For example, run the following command to create or update the control plane MachineHealthCheck settings.

tanzu cluster mhc control-plane set <cluster-name> --node-startup-timeout 2h7m10s

Run the following command to check if the setting is updated and not reconciled.

tanzu cluster mhc control-plane get <cluster-name>

Run the following command to create or update the machine deployment MachineHealthCheck settings.

tanzu cluster mhc node set <cluster-name> --machine-deployment node-pool-1 --node-startup-timeout 1h59m0s

Run the following command to check if the setting is updated and not reconciled.

tanzu cluster mhc node get <cluster-name> -m <cluster-name>-node-pool-1-nr7r5

Besides get and set, the system supports the delete operation. For example:

For the control plane, you can use the following command:

tanzu cluster mhc control-plane delete <cluster-name>

For node, you can use the following command:

tanzu cluster mhc <cluster-name> --machine-deployment <machine deployment name>