This section includes tips to help you troubleshoot workload clusters.
For information about troubleshooting standalone management cluster deployments, see Troubleshooting Management Cluster Issues. You can find additional workaround for known issues in this release in the Release Notes or in Tanzu Knowledge Base articles.
kubectl
To clean up your kubectl
state by deleting some or all of its users, contexts, and clusters:
Open your ~/.kube/config
file.
For the user
objects that you want to delete, run:
kubectl config unset users.USER-NAME
Where USER-NAME
is the name
property of each top-level user
object, as listed in the config
files.
For the context
objects that you want to delete, run:
kubectl config unset contexts.CONTEXT-NAME
Where CONTEXT-NAME
is the name
property of each top-level context
object, as listed in the config
files, typically of the form contexts.mycontext-admin@mycontext
.
For the cluster
objects that you want to delete, run:
kubectl config unset clusters.CLUSTER-NAME
Where CLUSTER-NAME
is the name
property of each top-level cluster
object, as listed in the config
files.
If the config
files list the current context as a cluster that you deleted, unset the context:
kubectl config unset current-context
Problem
If you attempt to install Grafana by generating a default Grafana configuration file, the installation fails with error: Secret in version "v1" cannot be handled as a Secret: illegal base64 data at input byte 4 (reason: BadRequest)
.
Solution
Create the secret manually and use the same YAML file without the secret tag to install Grafana.
grafana.secret.*
entries from the generated configuration file.Create a secret manually.
kubectl create secret generic grafana -n tanzu-system-dashboards --from-literal=admin=admin
Deploy the package.
tanzu package install grafana \
--package grafana.tanzu.vmware.com \
--version AVAILABLE-PACKAGE-VERSION \
--values-file grafana-data-values.yaml \
--namespace TARGET-NAMESPACE
tanzu package repository
Problem
The tanzu package repository
command fails with an error.
Solution
Run kubectl get pkgr REPOSITORY-NAME -n NAMESPACE -o yaml
to get details on the error.
Where:
REPOSITORY-NAME
is the name of the package repository.NAMESPACE
is the target namespace of the package repository.The tanzu package repository
command can fail with an error similar to the following:
Error | Description | Solution |
---|---|---|
NOT_FOUND |
The repository URL path is invalid. | Ensure that the URL of the package repository is reachable from your cluster. |
UNKNOWN or UNAUTHORIZED |
This error can occur when attempting to connect to the repository. | |
Ownership |
A repository with the same package repository URL is already installed in the cluster. | Do one of the following:
|
tanzu package installed
Problem
The tanzu package installed
command fails with an error.
Solution
Run kubectl get pkgi INSTALLED-PACKAGE-NAME -n NAMESPACE -o yaml
to get details on the error.
Where:
INSTALLED-PACKAGE-NAME
is the name of the installed package.NAMESPACE
is the namespace of the installed package.The tanzu package installed
command can fail with an error similar to the following:
Error | Description | Solution |
---|---|---|
Ownership |
A package with the same name is already installed in the cluster. | Run tanzu package installed list -A to check if the package you want to install is already installed. If so, you might want to use the already installed package, update its version, or delete the package to be able to proceed with the installation. |
Evaluating starlark template |
This error can occur when the listed configuration value is missing. | Run tanzu package available get AVAILABLE-PACKAGE-NAME/VERSION -n NAMESPACE –values-schema to see all available configuration values and provide the required configuration values to the tanzu package install command. |
Failed to find a package with name PACKAGE-NAME in namespace NAMESPACE |
The specified package and package metadata are not available in the target namespace. | Ensure the specified package is listed in the output of tanzu package available list AVAILABLE-PACKAGE-NAME -n NAMESPACE . If not, add the package repository that contains the package to the target namespace. |
Namespace NAMESPACE not found |
The namespace in which you want to install the package does not exist. | In TKG v2.1 and later, tanzu package commands are based on kctrl and no longer support the —create-namespace flag. Before you install a package or package repository, the target namespace must already exist. |
Provided service account SERVICE-ACCOUNT-NAME is already used by another package in namespace NAMESPACE |
The service account provided with the service-account-name flag is already used by another installed package. |
Either allow the package plugin create the service account for you or choose another service account name. |
Problem
When you run kubectl get pods -A
on the created cluster, some pods remain in pending state.
You run kubectl describe pod -n pod-namespace pod-name
on an affected pod and see the following event:
n node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate
Solution
Ensure there is connectivity and firewall rules in place to ensure communication between the cluster and vCenter. For firewall ports and protocols requirements, see the vSphere 8 listings in VMware Ports and Protocols.
StorageClass
object causes reconcile failure in workload clustersProblem
Modifying the properties of a default StorageClass
object included in TKG causes a package reconcile failure in workload clusters that use the storage class.
Solution:
To customize a storage class, create a new StorageClass
definition with a different name
instead of modifying the default object definition, and reconfigure the cluster to use the new storage class.
Problem
Running tanzu cluster create
fails with a timeout error similar to the following:
I0317 11:11:16.658433 clusterclient.go:341] Waiting for resource my-cluster of type *v1beta1.Cluster to be up and running
E0317 11:26:16.932833 common.go:29]
Error: unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be created (this may take a few minutes): cluster control plane is still being initialized
E0317 11:26:16.933251 common.go:33]
Solution
Use the --timeout
flag to specify the time to wait for the cluster creation to complete. The default waiting time is 30 minutes.
tanzu cluster create --timeout TIME
Where TIME
is the length of time, in minutes, to wait for the completion of cluster creation. For example, 60m
.
Problem
tanzu cluster delete
fails to delete workload cluster.
To delete the cluster manually, see the two solutions below.
Solution 1
On the target cluster, delete the StatefulSet object for AKO, which runs in the avi-system
namespace:
kubectl delete sts ako -n avi-system
Solution 2
Log in to the cluster and delete the worker machines:
kubectl delete machine worker1 worker2
From vCenter, power off and delete the worker node VMs.
Edit control plane machines and remove the finalizer link:
finalizers:
- machine.cluster.x-k8s.io
Delete the control plane machines:
kubectl delete machine controlplane1 controlplane2
From vCenter, power off and delete the control plane VMs
Problem
Different Maximum Transmission Unit (MTU) settings on the worker nodes in a cluster result in TLS handshake timeout.
Logs from journalctl -u kubelet
on a node show communication failure with the API server. Running kubectl get nodes
shows that worker nodes have moved to the NotReady status.
You can reconfirm the issue by doing the following:
ip link
and compare the MTU values of the eth0
interface. If they do not match, it is indicative of this issue.Confirm that the following commands fail when you run them on a machine, which is in the NotReady node status:
openssl s_client -connect IP:PORT
curl IP:PORT -k /healthz
Where IP
and PORT
is the IP address and port number of the Kubernetes API server control plane endpoint. By default, PORT
is set to 6443
.
Solution
Review the privileged daemonsets deployed on the cluster, and review any daemonsets from third-party vendors that might modify the network configurations of the host operating system. You might need to consult the software vendor to find this out. The daemonsets that can modify the host operating system will either have .spec.template.spec.hostNetwork: true
or have either privileged: true
or NET_ADMIN
in the capabilities field of any container security context.
If you want to configure large MTU settings, provision the cluster with control plane with a higher MTU value.
Ensure that the cluster network either allows Path MTU discovery or has TCP MSS clamping in place to allow correct MTU sizing to external services, such as vCenter or container registries.
Ensure that you configure the same MTU settings for all the nodes in a cluster.
The network firewall settings must allow for packets of the configured MTU size.
Problem
If you are using NSX Advanced Load Balancer for workloads (AVI_ENABLE
) or the control plane (AVI_CONTROL_PLANE_HA_PROVIDER
) the Avi Controller may fail to distinguish between identically-named clusters.
Solution:
Set a unique CLUSTER_NAME
value for each cluster. Do not create multiple workload clusters that have the same CLUSTER_NAME
and are also in the same management cluster namespace, as set by their NAMESPACE
value.
Problem
On Azure vSphere Solution (AVS), vSphere CSI Persistent Volumes (PVs) deletion may fail. Deleting a PV requires the cns.searchable
permission. The default admin account for AVS, <[email protected]>
, is not created with this permission. For more information, see vSphere Roles and Privileges.
Solution
To delete a vSphere CSI PV on AVS, contact Azure support.
Problem
The tanzu cluster delete
and tanzu management-cluster delete
commands may hang with clusters that use networking resources created by the AWS Cloud Controller Manager independently from the Tanzu Kubernetes Grid deployment process. Such resources may include load balancers and other networking services, as listed in The Service Controller in the Kubernetes AWS Cloud Provider documentation.
For more information, see the Cluster API issue Drain workload clusters of service Type=Loadbalancer on teardown.
Solution
Use kubectl delete
to delete services of type LoadBalancer
from the cluster. Or if that fails, use the AWS console to manually delete any LoadBalancer
and SecurityGroup
objects created for this service by the Cloud Controller manager.
CautionDo not to delete load balancers or security groups managed by Tanzu, which have the tags
key: sigs.k8s.io/cluster-api-proider-aws/cluster/CLUSTER-NAME
,value: owned
.
Problem
With an Azure workload cluster in an unmanaged resource group, when the Azure CSI driver creates a persistent volume (PV) that uses a storage account with private endpoint, it creates a privateEndpoint
and vNet
resources that are not deleted when the PV is deleted. As a result, deleting the cluster fails with an error like subnets failed to delete. err: failed to delete resource ... Subnet management-cluster-node-subnet is in use
.
Solution:
Before deleting the Azure cluster, manually delete the network interface for the storage account private endpoint:
networkinterfaces
, select the NIC resource that is failing to delete.