You can configure, activate, and deactivate the Node Health Check parameters in Tanzu Kubernetes Grid clusters through the Kubernetes Container Clusters 4.1 UI plug-in.

The Node Health Check feature comprises of two parts:
  • Detection
  • Remediation
Note: Node Health Check and Auto Repair on Errors are different in functionality. Node Health Check detects and remediates unhealthy nodes in the cluster only after the cluster goes to an Available status, while Auto repair on errors reattempts cluster creation if cluster goes to error state before cluster status becomes Available.
Note: Node Health Check is deactivated by default in VMware Cloud Director Container Service Extension 4.1.

Node Failure Detection

VMware Cloud Director Container Service Extension 4.1 can detect when a node in a Tanzu Kubernetes Grid cluster becomes unhealthy. When a node is in an unhealthy state, the Kubernetes Container Clusters 4.1 UI plug-in reflects the available and desired node count in the cluster information page, and also the failure appears in the Events section of the same page.

A node can become unhealthy for the following reasons but not limited to
  • Network outages
  • Power interruptions
  • Low node speed due to high memory, CPU or disk utilization
  • Node startup failure
  • Failure to join the cluster

Node Remediation

From VMware Cloud Director Container Service Extension 4.1, the Node Health Check feature detects node failure in Tanzu Kubernetes Grid clusters, and automatically replaces unhealthy Kubernetes nodes with new nodes. The Node Health Check parameters are required global settings for the VMware Cloud Director Container Service Extension server setup, and server update workflows, which are used by Kubernetes Container Clusters UI plug-in to create clusters, or update settings for clusters in all organizations. For more information, see Update the VMware Cloud Director Container Service Extension Server. Service providers can return to the Update Server tab at any time to reconfigure Node Health Check parameters. If service providers do not specifically configure the Node Health Check parameters, the following default values are set:
Table 1. Node Health Check Configuration
Node Health Check Parameter Default Value Description
Max Unhealthy Nodes 100%

Remediation is suspended when the percentage of unhealthy nodes exceeds this value. When the default value is 100%, this means the cluster is always remediated. When the default value is 0%, this means the cluster does not remediate.

Node Startup Timeout 900 seconds

If a node does not start in this time frame, it is considered unhealthy and is remediated. For a given VMware Cloud Director environment, it is recommended for service providers to set Node Health Check parameter to be at least twice the time for a VM to be created and bootstrapped.

Node Status "Not Ready" Timeout 300 seconds If a newly joined node cannot host workloads for longer than this timeout, it is considered unhealthy and is remediated.
Node Status "Unknown" Timeout 300 seconds If a healthy node is unreachable for longer than this timeout, it is considered unhealthy and is remediated.
Tenant users use the Node Health Check parameters set by the service provider for their organization when they create clusters. For more information, see Create a Tanzu Kubernetes Grid Cluster.
Note: When service providers update the Node Health Check parameters, the existing Node Health Check parameters on the Tanzu Kubernetes Grid clusters that are already deployed are not modified.

Activate or Deactivate Node Health Check in a VMware Cloud Director Container Service Extension 4.0.x Cluster

Tenant users can also activate or deactivate Node Health Check on clusters that were created in VMware Cloud Director Container Service Extension 4.0.x.

The following steps outline how tenant users can perform this action:

  1. Log in to VMware Cloud Director portal, and from the top navigation bar, select More > Kubernetes Container Clusters.
  2. Click the cluster name, and in the cluster information page, click Settings.
  3. Activate or deactivate Node Health Check toggle, and click Save.