Troubleshooting Workload Cluster Issues

This section includes tips to help you troubleshoot workload clusters.

For information about troubleshooting standalone management cluster deployments, see Troubleshooting Management Cluster Issues. You can find additional workaround for known issues in this release in the Release Notes or in Tanzu Knowledge Base articles.

Common Tasks

Delete Users, Contexts, and Clusters with `kubectl`

To clean up your kubectl state by deleting some or all of its users, contexts, and clusters:

Open your ~/.kube/config file.
For the user objects that you want to delete, run:
```
kubectl config unset users.USER-NAME
```
Where USER-NAME is the name property of each top-level user object, as listed in the config files.
For the context objects that you want to delete, run:
```
kubectl config unset contexts.CONTEXT-NAME
```
Where CONTEXT-NAME is the name property of each top-level context object, as listed in the config files, typically of the form contexts.mycontext-admin@mycontext.
For the cluster objects that you want to delete, run:
```
kubectl config unset clusters.CLUSTER-NAME
```
Where CLUSTER-NAME is the name property of each top-level cluster object, as listed in the config files.
If the config files list the current context as a cluster that you deleted, unset the context:
```
kubectl config unset current-context
```

Packages

Secret not created when installing Grafana from default YAML file

Problem

If you attempt to install Grafana by generating a default Grafana configuration file, the installation fails with error: Secret in version "v1" cannot be handled as a Secret: illegal base64 data at input byte 4 (reason: BadRequest).

Solution

Create the secret manually and use the same YAML file without the secret tag to install Grafana.

Perform the steps in Deploy Grafana into the Workload Cluster to create the configuration file for your Grafana configuration.
Remove the grafana.secret.* entries from the generated configuration file.

Create a secret manually.

kubectl create secret generic grafana -n tanzu-system-dashboards  --from-literal=admin=admin

Deploy the package.

tanzu package install grafana \
--package grafana.tanzu.vmware.com \
--version AVAILABLE-PACKAGE-VERSION \
--values-file grafana-data-values.yaml \
--namespace TARGET-NAMESPACE

Perform the remaining steps in Deploy Grafana into the Workload Cluster.

Error when running `tanzu package repository`

Problem

The tanzu package repository command fails with an error.

Solution

Run kubectl get pkgr REPOSITORY-NAME -n NAMESPACE -o yaml to get details on the error.

Where:

REPOSITORY-NAME is the name of the package repository.
NAMESPACE is the target namespace of the package repository.

The tanzu package repository command can fail with an error similar to the following:

Error	Description	Solution
`NOT_FOUND`	The repository URL path is invalid.	Ensure that the URL of the package repository is reachable from your cluster.
`UNKNOWN` or `UNAUTHORIZED`	This error can occur when attempting to connect to the repository.
`Ownership`	A repository with the same package repository URL is already installed in the cluster.	Do one of the following: Run `tanzu package available list -n NAMESPACE` to see if the package you want to install is already available for installation. To revert the current failed attempt to add the repository, run `tanzu package repository delete REPOSITORY-NAME -n NAMESPACE`. Run `tanzu package repository list -A` to retrieve an existing package repository with the same URL. If you retrieve the package repository, you can proceed with deleting it at your own risk.

Error when running `tanzu package installed`

Problem

The tanzu package installed command fails with an error.

Solution

Run kubectl get pkgi INSTALLED-PACKAGE-NAME -n NAMESPACE -o yaml to get details on the error.

Where:

INSTALLED-PACKAGE-NAME is the name of the installed package.
NAMESPACE is the namespace of the installed package.

The tanzu package installed command can fail with an error similar to the following:

Error	Description	Solution
`Ownership`	A package with the same name is already installed in the cluster.	Run `tanzu package installed list -A` to check if the package you want to install is already installed. If so, you might want to use the already installed package, update its version, or delete the package to be able to proceed with the installation.
`Evaluating starlark template`	This error can occur when the listed configuration value is missing.	Run `tanzu package available get AVAILABLE-PACKAGE-NAME/VERSION -n NAMESPACE –values-schema` to see all available configuration values and provide the required configuration values to the `tanzu package install` command.
`Failed to find a package with name PACKAGE-NAME in namespace NAMESPACE`	The specified package and package metadata are not available in the target namespace.	Ensure the specified package is listed in the output of `tanzu package available list AVAILABLE-PACKAGE-NAME -n NAMESPACE`. If not, add the package repository that contains the package to the target namespace.
`Namespace NAMESPACE not found`	The namespace in which you want to install the package does not exist.	In TKG v2.1 and later, `tanzu package` commands are based on `kctrl` and no longer support the `—create-namespace` flag. Before you install a package or package repository, the target namespace must already exist.
`Provided service account SERVICE-ACCOUNT-NAME is already used by another package in namespace NAMESPACE`	The service account provided with the `service-account-name` flag is already used by another installed package.	Either allow the package plugin create the service account for you or choose another service account name.

Pods

Pods Are Stuck in Pending on Cluster Due to vCenter Connectivity

Problem

When you run kubectl get pods -A on the created cluster, some pods remain in pending state.

You run kubectl describe pod -n pod-namespace pod-name on an affected pod and see the following event:

n node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate

Solution

Ensure there is connectivity and firewall rules in place to ensure communication between the cluster and vCenter. For firewall ports and protocols requirements, see the vSphere 8 listings in VMware Ports and Protocols.

Storage

Changing default `StorageClass` object causes reconcile failure in workload clusters

Problem

Modifying the properties of a default StorageClass object included in TKG causes a package reconcile failure in workload clusters that use the storage class.

Solution:

To customize a storage class, create a new StorageClass definition with a different name instead of modifying the default object definition, and reconfigure the cluster to use the new storage class.

Workload Clusters

Deploying a Cluster Times Out, but the Cluster Is Created

Problem

Running tanzu cluster create fails with a timeout error similar to the following:

I0317 11:11:16.658433 clusterclient.go:341] Waiting for resource my-cluster of type *v1beta1.Cluster to be up and running
E0317 11:26:16.932833 common.go:29]
Error: unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be created (this may take a few minutes): cluster control plane is still being initialized
E0317 11:26:16.933251 common.go:33]

Solution

Use the --timeout flag to specify the time to wait for the cluster creation to complete. The default waiting time is 30 minutes.

tanzu cluster create --timeout TIME

Where TIME is the length of time, in minutes, to wait for the completion of cluster creation. For example, 60m.

Workload Cluster is Stuck in Deletion

Problem

tanzu cluster delete fails to delete workload cluster.

To delete the cluster manually, see the two solutions below.

Solution 1

On the target cluster, delete the StatefulSet object for AKO, which runs in the avi-system namespace:
```
kubectl delete sts ako -n avi-system
```

Solution 2

Log in to the cluster and delete the worker machines:
```
kubectl delete machine worker1 worker2
```
From vCenter, power off and delete the worker node VMs.
Edit control plane machines and remove the finalizer link:
```
finalizers:
 - machine.cluster.x-k8s.io
```

Delete the control plane machines:

kubectl delete machine controlplane1 controlplane2

From vCenter, power off and delete the control plane VMs

Cluster Worker Nodes in NotReady Status Due to Mismatched MTUs

Problem

Different Maximum Transmission Unit (MTU) settings on the worker nodes in a cluster result in TLS handshake timeout.

Logs from journalctl -u kubelet on a node show communication failure with the API server. Running kubectl get nodes shows that worker nodes have moved to the NotReady status.

You can reconfirm the issue by doing the following:

On the control plane node and the worker node machines, run ip link and compare the MTU values of the eth0 interface. If they do not match, it is indicative of this issue.
Run Crash Diagnostics (Crashd) and review the kubelet logs to determine that the connections are timed out, or the worker nodes are in the NotReady status. For more information on running Crashd, see Troubleshooting Workload Clusters with Crash Diagnostics
Confirm that the following commands fail when you run them on a machine, which is in the NotReady node status:
- openssl s_client -connect IP:PORT
- curl IP:PORT -k /healthz
Where IP and PORT is the IP address and port number of the Kubernetes API server control plane endpoint. By default, PORT is set to 6443.

Solution

Review the privileged daemonsets deployed on the cluster, and review any daemonsets from third-party vendors that might modify the network configurations of the host operating system. You might need to consult the software vendor to find this out. The daemonsets that can modify the host operating system will either have .spec.template.spec.hostNetwork: true or have either privileged: true or NET_ADMIN in the capabilities field of any container security context.
If you want to configure large MTU settings, provision the cluster with control plane with a higher MTU value.
Ensure that the cluster network either allows Path MTU discovery or has TCP MSS clamping in place to allow correct MTU sizing to external services, such as vCenter or container registries.
Ensure that you configure the same MTU settings for all the nodes in a cluster.
The network firewall settings must allow for packets of the configured MTU size.

With NSX ALB, you cannot create clusters with identical names

Problem

If you are using NSX Advanced Load Balancer for workloads (AVI_ENABLE) or the control plane (AVI_CONTROL_PLANE_HA_PROVIDER) the Avi Controller may fail to distinguish between identically-named clusters.

Solution:

Set a unique CLUSTER_NAME value for each cluster. Do not create multiple workload clusters that have the same CLUSTER_NAME and are also in the same management cluster namespace, as set by their NAMESPACE value.

vSphere CSI volume deletion may fail on AVS

Problem

On Azure vSphere Solution (AVS), vSphere CSI Persistent Volumes (PVs) deletion may fail. Deleting a PV requires the cns.searchable permission. The default admin account for AVS, <[email protected]>, is not created with this permission. For more information, see vSphere Roles and Privileges.

Solution

To delete a vSphere CSI PV on AVS, contact Azure support.

Deleting cluster on AWS fails if cluster uses networking resources not deployed with Tanzu Kubernetes Grid

Problem

The tanzu cluster delete and tanzu management-cluster delete commands may hang with clusters that use networking resources created by the AWS Cloud Controller Manager independently from the Tanzu Kubernetes Grid deployment process. Such resources may include load balancers and other networking services, as listed in The Service Controller in the Kubernetes AWS Cloud Provider documentation.

For more information, see the Cluster API issue Drain workload clusters of service Type=Loadbalancer on teardown.

Solution

Use kubectl delete to delete services of type LoadBalancer from the cluster. Or if that fails, use the AWS console to manually delete any LoadBalancer and SecurityGroup objects created for this service by the Cloud Controller manager.

Caution
Do not to delete load balancers or security groups managed by Tanzu, which have the tags key: sigs.k8s.io/cluster-api-proider-aws/cluster/CLUSTER-NAME,value: owned.

Cluster delete fails when storage volume uses account with private endpoint

Problem

With an Azure workload cluster in an unmanaged resource group, when the Azure CSI driver creates a persistent volume (PV) that uses a storage account with private endpoint, it creates a privateEndpoint and vNet resources that are not deleted when the PV is deleted. As a result, deleting the cluster fails with an error like subnets failed to delete. err: failed to delete resource ... Subnet management-cluster-node-subnet is in use.

Solution:

Before deleting the Azure cluster, manually delete the network interface for the storage account private endpoint:

From a browser, log in to Azure Resource Explorer.
Click subscriptions at left, and expand your subscription.
Under your subscription, expand resourceGroups at left, and expand your TKG deployment’s resource group.
Under the resource group, expand providers > Microsoft.Network > networkinterfaces.
Under networkinterfaces, select the NIC resource that is failing to delete.
Click the Read/Write button at the top, and then the Actions(POST, DELETE) tab just underneath.
Click Delete.
Once the NIC is deleted, delete the Azure cluster.

Troubleshooting Workload Cluster Issues

Common Tasks

Delete Users, Contexts, and Clusters with kubectl

Packages

Secret not created when installing Grafana from default YAML file

Error when running tanzu package repository

Error when running tanzu package installed

Pods

Pods Are Stuck in Pending on Cluster Due to vCenter Connectivity

Storage

Changing default StorageClass object causes reconcile failure in workload clusters

Workload Clusters

Deploying a Cluster Times Out, but the Cluster Is Created

Workload Cluster is Stuck in Deletion

Cluster Worker Nodes in NotReady Status Due to Mismatched MTUs

With NSX ALB, you cannot create clusters with identical names

vSphere CSI volume deletion may fail on AVS

Deleting cluster on AWS fails if cluster uses networking resources not deployed with Tanzu Kubernetes Grid

Cluster delete fails when storage volume uses account with private endpoint

Delete Users, Contexts, and Clusters with `kubectl`

Error when running `tanzu package repository`

Error when running `tanzu package installed`

Changing default `StorageClass` object causes reconcile failure in workload clusters