This topic includes tips to help you troubleshoot standalone management cluster deployments.
For information about troubleshooting workload clusters, see Troubleshooting Workload Cluster Issues. You can find additional workaround for known issues in this release in the Release Notes or in Tanzu Knowledge Base articles.
Some of the procedures below use the kind
CLI. To install kind
, see Installation in the kind
documentation.
You can use SSH to connect to individual nodes of standalone management clusters or workload clusters. To do so, the SSH key pair that you created when you deployed the management cluster must be available on the machine on which you run the SSH command. Consequently, you must run ssh
commands on the machine on which you run tanzu
commands.
The SSH keys that you register with the management cluster and that are used by any workload clusters that you deploy from the management cluster are associated with the capv
user accounts, for nodes running on both Photon OS and Ubuntu.
To connect to a node by using SSH, run the following command from the machine that you use as the bootstrap machine:
ssh capv@NODE_ADDRESS
Because the SSH key is present on the system on which you are running the ssh
command, no password is required.
kubectl
To clean up your kubectl
state by deleting some or all of its users, contexts, and clusters:
Open your ~/.kube-tkg/config
file.
For the user
objects that you want to delete, run:
kubectl config unset users.USERNAME --kubeconfig ~/.kube-tkg/config
Where USERNAME
is the name
property of each top-level user
object, as listed in the config
file.
For the context
objects that you want to delete, run:
kubectl config unset contexts.CONTEXT-NAME --kubeconfig ~/.kube-tkg/config
Where CONTEXT-NAME
is the name
property of each top-level context
object, as listed in the config
file, typically of the form contexts.mycontext-admin@mycontext
.
For the cluster
objects that you want to delete, run:
kubectl config unset clusters.CLUSTER-NAME --kubeconfig ~/.kube-tkg/config
Where CLUSTER-NAME
is the name
property of each top-level cluster
object, as listed in the config
file.
If the config
files list the current context as a cluster that you deleted, unset the context:
kubectl config unset current-context --kubeconfig ~/.kube-tkg/config
If you deleted management clusters that are tracked by the tanzu
CLI, delete them from the tanzu
CLI’s state by running tanzu context delete
as described in Delete Management Clusters from Your Tanzu CLI Configuration.
nfs-utils
on Photon OS NodesProblem
In Tanzu Kubernetes Grid v1.1.2 and later, nfs-utils
is enabled by default. If you do not require nfs-utils
, you can remove it from cluster node VMs.
Solution
To deactivate nfs-utils
on clusters that you deploy with Tanzu Kubernetes Grid v1.1.2 or later, use SSH to log in to the cluster node VMs and run the following command:
tdnf erase nfs-utils
Problem
An unsuccessful attempt to deploy a standalone management cluster leaves orphaned objects in your cloud infrastructure and on your bootstrap machine.
Solution
tanzu mc create
command output either in the terminal or Tanzu Kubernetes Grid installer interface. If the command fails, it prints a help message that includes the following: “Failure while deploying management cluster… To clean up the resources created by the management cluster: tkg delete mc….”tanzu mc delete YOUR-CLUSTER-NAME
. This command removes the objects that it created in your infrastructure and locally.You can also use the alternative methods described below:
Bootstrap machine cleanup:
To remove a kind
cluster, use the kind
CLI. For example:
kind get clusters
kind delete cluster --name tkg-kind-example1234567abcdef
To remove Docker objects, use the docker
CLI. For example, docker rm
, docker rmi
, and docker system prune -a --volumes
.
CautionIf you are running Docker processes that are not related to Tanzu Kubernetes Grid on your system, remove unneeded Docker objects individually.
Problem
Running tanzu mc delete
removes the management cluster, but fails to delete the local kind
cluster from the bootstrap machine.
Solution
List all running kind
clusters and remove the one that looks like tkg-kind-unique_ID
:
kind delete cluster --name tkg-kind-unique_ID
List all running clusters and identify the kind
cluster.
docker ps -a
Copy the container ID of the kind
cluster and remove it.
docker kill container-ID
Problem
Your stanalone management cluster fails to deploy because machines are stuck, waiting for remediation.
Solution
For a management cluster that you deployed with the dev
plan, which has only one control plane node, you must redeploy the management cluster. For management clusters with more than one control plane node, you can identify and delete the stuck machines.
Retrieve the status of the management cluster. For example:
kubectl -n tkg-system get cluster my-mgmt-cluster -o yaml
Find the names of the stuck machines from the output of the previous step. A stuck machine is marked as WaitingForRemediation
. For example, the name of the stuck machine is my-mgmt-cluster-zpc7t
in the following output:
status:
conditions:
- lastTransitionTime: "2021-08-25T15:44:23Z"
message: KCP can't remediate if current replicas are less or equal then 1
reason: WaitingForRemediation @ Machine/my-mgmt-cluster-zpc7t
severity: Warning
status: "False"
type: Ready
Increase the machine health check timeout values for the control plane nodes to greater than the default, 5m
. For example:
tanzu cluster machinehealthcheck control-plane set my-cluster --mhc-name my-control-plane-mhc --unhealthy-conditions "Ready:False:10m,Ready:Unknown:10m"
For more information about updating a MachineHealthCheck
object, see Create or Update a MachineHealthCheck
Object in Configure Machine Health Checks for Workload Clusters.
Set kubectl
to the context of your management cluster. For example:
kubectl config use-context mgmt-cluster-admin@mgmt-cluster
Delete the stuck machines.
kubectl delete machine MACHINE-NAME
Where MACHINE-NAME
is the name of the machine you located in an earlier step.
Wait for the KubeadmControlPlane
controller to redeploy the machine.
~/.config/tanzu
DirectoryProblem
The ~/.config/tanzu
directory on the bootstrap machine has been accidentally deleted or corrupted. The Tanzu CLI creates and uses this directory, and cannot function without it.
Solution
To restore the contents of the ~/.config/tanzu
directory:
To identify existing Tanzu Kubernetes Grid management clusters, run:
kubectl --kubeconfig ~/.kube-tkg/config config get-contexts
The command output lists names and contexts of all management clusters created or added by the Tanzu CLI.
For each management cluster listed in the output, restore it to the ~/.config/tanzu
directory and CLI by running:
tanzu context create --management-cluster --kubeconfig ~/.kube-tkg/config --context MGMT-CLUSTER-CONTEXT --name MGMT-CLUSTER
tanzu management-cluster
create on macOS Results in kubectl Version ErrorProblem
If you run the tanzu management-cluster create
or tanzu mc create
command on macOS with the latest stable version of Docker Desktop, it fails with the error message like:
Error: : kubectl prerequisites validation failed: kubectl client version v1.26.5 is less than minimum supported kubectl client version 1.28.11
This happens because Docker Desktop symlinks an older version of kubectl
into the path.
Solution
Place a newer supported version of kubectl
in the path before Docker’s version.
If you have lost the credentials for a standalone management cluster, for example, by inadvertently deleting the .kube-tkg/config
file on the system on which you run tanzu
commands, you can recover the credentials from the management cluster control plane node.
tanzu mc create
to recreate the .kube-tkg/config
file.Use SSH to log in to the management cluster control plane node.
See Connect to Cluster Nodes with SSH above for the credentials to use for each target platform.
Access the admin.conf
file for the management cluster.
sudo vi /etc/kubernetes/admin.conf
The admin.conf
file contains the cluster name, the cluster user name, the cluster context, and the client certificate data.
.kube-tkg/config
file on the system on which you run tanzu
commands.VSPHERE_CONTROL_PLANE_ENDPOINT
valueProblem
Integrating an external identity provider with an existing TKG deployment may require setting a dummy VSPHERE_CONTROL_PLANE_ENDPOINT
value in the management cluster configuration file used to create the add-on secret, as described in Generate the Pinniped Add-on Secret for the Management Cluster
Secret
delete on legacy clustersProblem
When you deactivate external identity management on a management cluster, the unused Pinniped Secret
object remains present on legacy workload clusters.
If a user then tries to access the cluster using an old kubeconfig
, a login popup will appear and fail.
Workaround
Manually delete the legacy cluster’s Pinniped Secret
as described in Deactivate Identity Management.
Problem
Upgrading to Tanzu Kubernetes Grid v2.5.x returns an error similar to the following:
Operation cannot be fulfilled on certificates.cert-manager.io "pinniped-cert": the object has been modified; please apply your changes to the latest version and try again
Solution
This error may occur if the Pinniped post-deploy job conflicts with a component’s upgrade process. Follow these steps to delete and redeploy the job.
Delete the Pinniped post-deploy job.
kubectl delete jobs.batch -n pinniped-supervisor pinniped-post-deploy-job
Wait about 5 minutes for kapp-controller
to redeploy the post-deploy job.
Check the status of the Pinniped app.
kubectl get app -n tkg-system pinniped
NAME DESCRIPTION SINCE-DEPLOY AGE
pinniped Reconcile succeeded 5s 49m
If the DESCRIPTION
shows Reconciling
, wait a few minutes, then check again. Once it shows Reconcile succeeded
continue to the next step.
Check the status of the Pinniped post-deploy job.
kubectl get jobs -n pinniped-supervisor
NAME COMPLETIONS DURATION AGE
pinniped-post-deploy-job-ver-1 1/1 9s 62s
Problem
You recently upgraded your management cluster. When attempting to authenticate to a workload cluster associated with this management cluster, you receive an error message similar to the following:
Error: could not complete Pinniped login: could not perform OIDC discovery for "https://IP:PORT": Get "https://IP:PORT/.well-known/openid-configuration": x509: certificate signed by unknown authority
Solution
This happens because the copy of the Pinniped supervisor CA bundle that the workload cluster is using is out of date. To update the supervisor CA bundle, follow the steps below:
Set the kubectl
context to the management cluster.
Obtain the base64-encoded CA bundle and the issuer
endpoint from the pinniped-info
ConfigMap:
kubectl get configmap pinniped-info -n kube-public -o jsonpath={.data.issuer_ca_bundle_data} > /tmp/ca-bundle && kubectl get configmap pinniped-info -n kube-public -o jsonpath={.data.issuer} > /tmp/supervisor-endpoint
Obtain the values.yaml
section from the Pinniped add-on secret for the workload cluster:
kubectl get secret WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -o jsonpath="{.data.values\.yaml}" | base64 -d > values.yaml
This secret is located on the management cluster.
In the values.yaml
file created above, update the supervisor_ca_bundle_data
key to match the CA bundle from the pinniped-info
ConfigMap. Additionally, ensure that the supervisor_svc_endpoint
matches the issuer
endpoint.
Apply your update by base64 encoding the edited values.yaml
file and replacing it in the workload cluster secret. This command differs depending on the OS of your environment. For example:
Linux:
kubectl patch secret/WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -p "{\"data\":{\"values.yaml\":\"$(base64 -w 0 < values.yaml)\"}}" --type=merge
macOS:
kubectl patch secret/WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -p "{\"data\":{\"values.yaml\":\"$(base64 < values.yaml)\"}}" --type=merge
On the workload cluster, confirm that the Pinniped app successfully reconciled the changes:
kubectl get app pinniped -n tkg-system
Authenticate to the cluster. For example:
kubectl get pods -A --kubeconfig my-cluster-credentials
Problem
When you run kubectl get pods -A
on the created cluster, some pods remain in pending state.
You run kubectl describe pod -n pod-namespace pod-name
on an affected pod and see the following event:
n node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate
Solution
Ensure there is connectivity and firewall rules in place to ensure communication between the cluster and vCenter. For firewall ports and protocols requirements, see the vSphere listings in VMware Ports and Protocols.
Problem
Running tanzu
CLI commands returns an error similar to the following:
Failed to invoke API on cluster : the server has asked for the client to provide credentials, retrying
Solution
See Update Management Cluster Certificate in Your Tanzu CLI Configuration and Cannot access the clusters using tkg/tanzu cli commands in Tanzu Kubernetes Grid.
Problem
In the Windows command prompt (CMD), Tanzu CLI command output that is formatted in columns includes extraneous characters in column headings. The issue does not occur in Windows Terminal or PowerShell.
Solution
On Windows bootstrap machines, run the Tanzu CLI from Windows Terminal.
Problem
When machine health checks (MHCs) are deactivated, then Tanzu CLI commands such as tanzu cluster status
may not report up-to-date node state while infrastructure is being recreated.
Solution
None.
Problem
When you run the tanzu management-cluster create --ui
or tanzu mc create --ui
command on a Windows system, the UI opens in your default browser, but the graphics and styling are not applied. This happens because a Windows registry is set to application/x-css
.
Solution
regedit
to open the Registry Editor utility.HKEY_CLASSES_ROOT
and select .js
.application/javascript
and click OK.tanzu mc create --ui
command again to relaunch the UI.Problem
If you are using NSX Advanced Load Balancer for workloads (AVI_ENABLE
) or the control plane (AVI_CONTROL_PLANE_HA_PROVIDER
) the Avi Controller may fail to distinguish between identically-named clusters.
Solution:
Set a unique CLUSTER_NAME
value for each cluster. Do not create multiple management clusters with the same CLUSTER_NAME
value, even from different bootstrap machines.
Problem
If the total number of LoadBalancer
type Service is large, and if all of the Service Engines are deployed in the same L2 network, requests to the NSX Advanced Load Balancer VIP can fail with the message no route to host
.
This occurs because the default ARP rate limit on Service Engines is 100.
Solution
Set the ARP rate limit to a larger number. This parameter is not tunable in NSX Advanced Load Balancer Essentials, but it is tunable in NSX Advanced Load Balancer Enterprise Edition.
AKODeploymentConfig
error during management cluster creationProblem
Running tanzu management-cluster create
to create a management cluster with NSX ALB outputs the error no matches for kind AKODeploymentConfig in version networking.tkg.tanzu.vmware.com/v1alpha1
.
Solution
The error can be ignored. For more information, see this KB article.
medium
and smaller pods with NSX Advanced Load BalancerProblem
On vSphere, workload clusters with medium
or smaller worker nodes running the Multus CNI package with NSX ALB can fail with Insufficient CPU
or other errors.
Solution
To use Multus CNI with NSX ALB, deploy workload clusters with worker nodes of size large
or extra-large
.
Problem
In TKG v2.5, the components that turn a generic cluster into a TKG standalone management cluster are packaged in a Carvel package tkg-pkg
. Standalone management clusters that were originally created in TKG v1.3 or earlier lack a configuration secret that the upgrade process requires in order to install tkg-pkg
, causing upgrade to fail.
Solution
Perform the additional steps listed in Upgrade Standalone Management Clusters for standalone management clusters created in TKG v1.3 or earlier.
goss
test failures during image-build processProblem
When you run Kubernetes Image Builder to create a custom Linux custom machine image, the goss
tests python-netifaces
, python-requests
, and ebtables
fail. Command output reports the failures.
Solution
The errors can be ignored; they do not prevent a successful image build.