This topic includes tips to help you troubleshoot standalone management cluster deployments. Some of the procedures below use the kind CLI. To install kind, see Installation in the kind documentation. For information about troubleshooting workload clusters, see Troubleshooting Workload Cluster Issues.
You can use SSH to connect to individual nodes of standalone management clusters or workload clusters. To do so, the SSH key pair that you created when you deployed the management cluster must be available on the machine on which you run the SSH command. Consequently, you must run ssh commands on the machine on which you run tanzu commands.
The SSH keys that you register with the management cluster and that are used by any workload clusters that you deploy from the management cluster are associated with the following user accounts:
capvubuntuubuntuec2-usercapiTo connect to a node by using SSH, run one of the following commands from the machine that you use as the bootstrap machine:
ssh capv@node-addressssh ubuntu@node-addressssh ec2-user@node-addressssh capi@node-addressBecause the SSH key is present on the system on which you are running the ssh command, no password is required.
kubectlTo clean up your kubectl state by deleting some or all of its users, contexts, and clusters:
Open your ~/.kube-tkg/config file.
For the user objects that you want to delete, run:
kubectl config unset users.USERNAME --kubeconfig ~/.kube-tkg/config
Where USERNAME is the name property of each top-level user object, as listed in the config file.
For the context objects that you want to delete, run:
kubectl config unset contexts.CONTEXT-NAME --kubeconfig ~/.kube-tkg/config
Where CONTEXT-NAME is the name property of each top-level context object, as listed in the config file, typically of the form contexts.mycontext-admin@mycontext.
For the cluster objects that you want to delete, run:
kubectl config unset clusters.CLUSTER-NAME --kubeconfig ~/.kube-tkg/config
Where CLUSTER-NAME is the name property of each top-level cluster object, as listed in the config file.
If the config files list the current context as a cluster that you deleted, unset the context:
kubectl config unset current-context --kubeconfig ~/.kube-tkg/config
If you deleted management clusters that are tracked by the tanzu CLI, delete them from the tanzu CLI’s state by running tanzu config server delete as described in Delete Management Clusters from Your Tanzu CLI Configuration.
nfs-utils on Photon OS NodesProblem
In Tanzu Kubernetes Grid v1.1.2 and later, nfs-utils is enabled by default. If you do not require nfs-utils, you can remove it from cluster node VMs.
Solution
To deactivate nfs-utils on clusters that you deploy with Tanzu Kubernetes Grid v1.1.2 or later, use SSH to log in to the cluster node VMs and run the following command:
tdnf erase nfs-utils
Problem
Running tanzu management-cluster create or tanzu mc create fails with an error similar to the following:
Validating the pre-requisites...
Looking for AWS credentials in the default credentials provider chain
Error: : Tkg configuration validation failed: failed to get AWS client: NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: AWS_ACCESS_KEY_ID or AWS_ACCESS_KEY not found in environment
SharedCredsLoad: failed to load shared credentials file
caused by: FailedRead: unable to open file
caused by: open /root/.aws/credentials: no such file or directory
EC2RoleRequestError: no EC2 instance role found
caused by: EC2MetadataError: failed to make EC2Metadata request
Solution
Tanzu Kubernetes Grid uses the default AWS credentials provider chain. Before creating a management or a workload cluster on AWS, you must configure your AWS account credentials as described in Configure AWS Credentials.
Problem
Before creating a standalone management or workload cluster on Azure, you must accept the legal terms that cover the VM image used by cluster nodes. Running tanzu mc create or tanzu cluster create without having accepted the license fails with an error like:
User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '*********' for this plan. Before the subscription can be used, you need to accept the legal terms of the image.
Solution
If this happens, accept the legal terms and try again:
Problem
An unsuccessful attempt to deploy a standalone management cluster leaves orphaned objects in your cloud infrastructure and on your bootstrap machine.
Solution
tanzu mc create command output either in the terminal or Tanzu Kubernetes Grid installer interface. If the command fails, it prints a help message that includes the following: “Failure while deploying management cluster… To clean up the resources created by the management cluster: tkg delete mc….”tanzu mc delete YOUR-CLUSTER-NAME. This command removes the objects that it created in your infrastructure and locally.You can also use the alternative methods described below:
Bootstrap machine cleanup:
To remove a kind cluster, use the kind CLI. For example:
kind get clusters
kind delete cluster --name tkg-kind-example1234567abcdef
To remove Docker objects, use the docker CLI. For example, docker rm, docker rmi, and docker system prune -a --volumes.
CautionIf you are running Docker processes that are not related to Tanzu Kubernetes Grid on your system, remove unneeded Docker objects individually.
Target platform cleanup:
AZURE_RESOURCE_GROUP. Use checkboxes to select and Delete the resources that were created by Tanzu Kubernetes Grid, which contain a timestamp in their names.Problem
After running tanzu mc delete on AWS, tanzu mc get and other Tanzu CLI commands no longer list the deleted management cluster, but:
cluster infrastructure is still being provisioned: VpcReconciliationFailed.Solution
This behavior occurs when TKG uses expired or otherwise invalid AWS account credentials. To prevent or recover from this situation:
Prevent: Update AWS account credentials as described in Configure AWS Account Credentials using either AWS Credential Profiles or local, static environment variables.
Recover using the EC2 Dashboard: Delete the management cluster nodes from the EC2 dashboard manually
Recover using the CLI:
In the kind cluster that remains on the bootstrap machine due to failed management cluster deletion, correct the AWS credential secret:
kubectl get secret capa-manager-bootstrap-credentials -n capa-system -ojsonpath="{.data.credentials}"| base64 -d
Edit the secret to include the AWS credentials:
[default]
aws_access_key_id = <key id>
aws_secret_access_key = <access_key>
region = <region>
Run tanzu mc delete again.
Problem
Running tanzu mc delete removes the management cluster, but fails to delete the local kind cluster from the bootstrap machine.
Solution
List all running kind clusters and remove the one that looks like tkg-kind-unique_ID:
kind delete cluster --name tkg-kind-unique_ID
List all running clusters and identify the kind cluster.
docker ps -a
Copy the container ID of the kind cluster and remove it.
docker kill container-ID
Problem
Your stanalone management cluster fails to deploy because machines are stuck, waiting for remediation.
Solution
For a management cluster that you deployed with the dev plan, which has only one control plane node, you must redeploy the management cluster. For management clusters with more than one control plane node, you can identify and delete the stuck machines.
Retrieve the status of the management cluster. For example:
kubectl -n tkg-system get cluster my-mgmt-cluster -o yaml
Find the names of the stuck machines from the output of the previous step. A stuck machine is marked as WaitingForRemediation. For example, the name of the stuck machine is my-mgmt-cluster-zpc7t in the following output:
status:
conditions:
- lastTransitionTime: "2021-08-25T15:44:23Z"
message: KCP can't remediate if current replicas are less or equal then 1
reason: WaitingForRemediation @ Machine/my-mgmt-cluster-zpc7t
severity: Warning
status: "False"
type: Ready
Increase the machine health check timeout values for the control plane nodes to greater than the default, 5m. For example:
tanzu cluster machinehealthcheck control-plane set my-cluster --mhc-name my-control-plane-mhc --unhealthy-conditions "Ready:False:10m,Ready:Unknown:10m"
For more information about updating a MachineHealthCheck object, see Create or Update a MachineHealthCheck Object in Configure Machine Health Checks for Workload Clusters.
Set kubectl to the context of your management cluster. For example:
kubectl config use-context mgmt-cluster-admin@mgmt-cluster
Delete the stuck machines.
kubectl delete machine MACHINE-NAME
Where MACHINE-NAME is the name of the machine you located in an earlier step.
Wait for the KubeadmControlPlane controller to redeploy the machine.
~/.config/tanzu DirectoryProblem
The ~/.config/tanzu directory on the bootstrap machine has been accidentally deleted or corrupted. The Tanzu CLI creates and uses this directory, and cannot function without it.
Solution
To restore the contents of the ~/.config/tanzu directory:
To identify existing Tanzu Kubernetes Grid management clusters, run:
kubectl --kubeconfig ~/.kube-tkg/config config get-contexts
The command output lists names and contexts of all management clusters created or added by the Tanzu CLI.
For each management cluster listed in the output, restore it to the ~/.config/tanzu directory and CLI by running:
tanzu login --kubeconfig ~/.kube-tkg/config --context MGMT-CLUSTER-CONTEXT --name MGMT-CLUSTER
tanzu management-cluster create on macOS Results in kubectl Version ErrorProblem
If you run the tanzu management-cluster create or tanzu mc create command on macOS with the latest stable version of Docker Desktop, it fails with the error message like:
Error: : kubectl prerequisites validation failed: kubectl client version v1.15.5 is less than minimum supported kubectl client version 1.24.10
This happens because Docker Desktop symlinks an older version of kubectl into the path.
Solution
Place a newer supported version of kubectl in the path before Docker’s version.
If you have lost the credentials for a standalone management cluster, for example, by inadvertently deleting the .kube-tkg/config file on the system on which you run tanzu commands, you can recover the credentials from the management cluster control plane node.
tanzu mc create to recreate the .kube-tkg/config file.Use SSH to log in to the management cluster control plane node.
See Connect to Cluster Nodes with SSH above for the credentials to use for each target platform.
Access the admin.conf file for the management cluster.
sudo vi /etc/kubernetes/admin.conf
The admin.conf file contains the cluster name, the cluster user name, the cluster context, and the client certificate data.
.kube-tkg/config file on the system on which you run tanzu commands.Problem
Upgrading to Tanzu Kubernetes Grid v2.1.1 returns an error similar to the following:
Operation cannot be fulfilled on certificates.cert-manager.io "pinniped-cert": the object has been modified; please apply your changes to the latest version and try again
Solution
This error may occur if the Pinniped post-deploy job conflicts with a component’s upgrade process. Follow these steps to delete and redeploy the job.
Delete the Pinniped post-deploy job.
kubectl delete jobs.batch -n pinniped-supervisor pinniped-post-deploy-job
Wait about 5 minutes for kapp-controller to redeploy the post-deploy job.
Check the status of the Pinniped app.
kubectl get app -n tkg-system pinniped
NAME DESCRIPTION SINCE-DEPLOY AGE
pinniped Reconcile succeeded 5s 49m
If the DESCRIPTION shows Reconciling, wait a few minutes, then check again. Once it shows Reconcile succeeded continue to the next step.
Check the status of the Pinniped post-deploy job.
kubectl get jobs -n pinniped-supervisor
NAME COMPLETIONS DURATION AGE
pinniped-post-deploy-job-ver-1 1/1 9s 62s
Problem
You recently upgraded your management cluster. When attempting to authenticate to a workload cluster associated with this management cluster, you receive an error message similar to the following:
Error: could not complete Pinniped login: could not perform OIDC discovery for "https://IP:PORT": Get "https://IP:PORT/.well-known/openid-configuration": x509: certificate signed by unknown authority
Solution
This happens because the copy of the Pinniped supervisor CA bundle that the workload cluster is using is out of date. To update the supervisor CA bundle, follow the steps below:
Set the kubectl context to the management cluster.
Obtain the base64-encoded CA bundle and the issuer endpoint from the pinniped-info ConfigMap:
kubectl get configmap pinniped-info -n kube-public -o jsonpath={.data.issuer_ca_bundle_data} > /tmp/ca-bundle && kubectl get configmap pinniped-info -n kube-public -o jsonpath={.data.issuer} > /tmp/supervisor-endpoint
Obtain the values.yaml section from the Pinniped add-on secret for the workload cluster:
kubectl get secret WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -o jsonpath="{.data.values\.yaml}" | base64 -d > values.yaml
This secret is located on the management cluster.
In the values.yaml file created above, update the supervisor_ca_bundle_data key to match the CA bundle from the pinniped-info ConfigMap. Additionally, ensure that the supervisor_svc_endpoint matches the issuer endpoint.
Apply your update by base64 encoding the edited values.yaml file and replacing it in the workload cluster secret. This command differs depending on the OS of your environment. For example:
Linux:
kubectl patch secret/WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -p "{\"data\":{\"values.yaml\":\"$(base64 -w 0 < values.yaml)\"}}" --type=merge
macOS:
kubectl patch secret/WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -p "{\"data\":{\"values.yaml\":\"$(base64 < values.yaml)\"}}" --type=merge
On the workload cluster, confirm that the Pinniped app successfully reconciled the changes:
kubectl get app pinniped -n tkg-system
Authenticate to the cluster. For example:
kubectl get pods -A --kubeconfig my-cluster-credentials
Problem
When you run kubectl get pods -A on the created cluster, some pods remain in pending state.
You run kubectl describe pod -n pod-namespace pod-name on an affected pod and see the following event:
n node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate
Solution
Ensure there is connectivity and firewall rules in place to ensure communication between the cluster and vCenter. For firewall ports and protocols requirements, see the vSphere listings in VMware Ports and Protocols.
Problem
Running tanzu CLI commands returns an error similar to the following:
Failed to invoke API on cluster : the server has asked for the client to provide credentials, retrying
Solution
This is a known issue that affects TKG v1.2 and later. For a workaround, see Update Management Cluster Certificate in Your Tanzu CLI Configuration and Cannot access the clusters using tkg/tanzu cli commands in Tanzu Kubernetes Grid.
Problem
When you run the tanzu management-cluster create --ui or tanzu mc create --ui command on a Windows system, the UI opens in your default browser, but the graphics and styling are not applied. This happens because a Windows registry is set to application/x-css.
Solution
regedit to open the Registry Editor utility.HKEY_CLASSES_ROOT and select .js.application/javascript and click OK.tanzu mc create --ui command again to relaunch the UI.Problem
If the total number of LoadBalancer type Service is large, and if all of the Service Engines are deployed in the same L2 network, requests to the NSX Advanced Load Balancer VIP can fail with the message no route to host.
This occurs because the default ARP rate limit on Service Engines is 100.
Solution
Set the ARP rate limit to a larger number. This parameter is not tunable in NSX Advanced Load Balancer Essentials, but it is tunable in NSX Advanced Load Balancer Enterprise Edition.