Troubleshooting Tanzu Kubernetes Grid Issues

This section includes tips to help you to troubleshoot common problems that you might encounter when installing Tanzu Kubernetes Grid and deploying workload clusters.

Many of these procedures use the kind CLI on your bootstrap machine. To install kind, see Installation in the kind documentation.

Common Tasks

Connect to Cluster Nodes with SSH

You can use SSH to connect to individual nodes of management clusters or workload clusters. To do so, the SSH key pair that you created when you deployed the management cluster must be available on the machine on which you run the SSH command. Consquently, you must run ssh commands on the machine on which you run tanzu commands.

The SSH keys that you register with the management cluster, and consequently that are used by any workload clusters that you deploy from the management cluster, are associated with the following user accounts:

  • vSphere management cluster and Tanzu Kubernetes nodes running on both Photon OS and Ubuntu: capv
  • AWS bastion nodes: ubuntu
  • AWS management cluster and Tanzu Kubernetes nodes running on Ubuntu: ubuntu
  • AWS management cluster and Tanzu Kubernetes nodes running on Amazon Linux: ec2-user
  • Azure management cluster and Tanzu Kubernetes nodes (always Ubuntu): capi

To connect to a node by using SSH, run one of the following commands from the machine that you use as the bootstrap machine:

  • vSphere nodes: ssh capv@node_address
  • AWS bastion nodes and management cluster and workload nodes on Ubuntu: ssh ubuntu@node_address
  • AWS management cluster and Tanzu Kubernetes nodes running on Amazon Linux: ssh ec2-user@node_address
  • Azure nodes: ssh capi@node_address

Because the SSH key is present on the system on which you are running the ssh command, no password is required.

Delete Users, Contexts, and Clusters with kubectl

To clean up your kubectl state by deleting some or all of its users, contexts, and clusters:

  1. Open your ~/.kube/config and ~/.kube-tkg/config files.

  2. For the user objects that you want to delete, run:

    kubectl config unset users.USER-NAME
    kubectl config unset users.USER-NAME --kubeconfig ~/.kube-tkg/config
    

    Where USER-NAME is the name property of each top-level user object, as listed in the config files.

  3. For the context objects that you want to delete, run:

    kubectl config unset contexts.CONTEXT-NAME
    kubectl config unset contexts.CONTEXT-NAME --kubeconfig ~/.kube-tkg/config
    

    Where CONTEXT-NAME is the name property of each top-level context object, as listed in the config files, typically of the form contexts.mycontext-admin@mycontext.

  4. For the cluster objects that you want to delete, run:

    kubectl config unset clusters.CLUSTER-NAME
    kubectl config unset clusters.CLUSTER-NAME --kubeconfig ~/.kube-tkg/config
    

    Where CLUSTER-NAME is the name property of each top-level cluster object, as listed in the config files.

  5. If the config files list the current context as a cluster that you deleted, unset the context:

    kubectl config unset current-context
    kubectl config unset current-context --kubeconfig ~/.kube-tkg/config
    
  6. If you deleted management clusters that are tracked by the tanzu CLI, delete them from the tanzu CLI’s state by running tanzu config server delete as described in Delete Management Clusters from Your Tanzu CLI Configuration.

    • To see the management clusters that the tanzu CLI tracks, run tanzu login.

Disable nfs-utils on Photon OS Nodes

Problem

In Tanzu Kubernetes Grid v1.1.2 and later, nfs-utils is enabled by default. If you do not require nfs-utils, you can remove it from cluster node VMs.

Solution

To disable nfs-utils on clusters that you deploy with Tanzu Kubernetes Grid v1.1.2 or later, use SSH to log in to the cluster node VMs and run the following command:

tdnf erase nfs-utils

Infrastructure Provider

Failed Validation, Credentials Error on AWS

Problem

Running tanzu management-cluster create or tanzu mc create fails with an error similar to the following:

Validating the pre-requisites...
Looking for AWS credentials in the default credentials provider chain

Error: : Tkg configuration validation failed: failed to get AWS client: NoCredentialProviders: no valid providers in chain
caused by: EnvAccessKeyNotFound: AWS_ACCESS_KEY_ID or AWS_ACCESS_KEY not found in environment
SharedCredsLoad: failed to load shared credentials file
caused by: FailedRead: unable to open file
caused by: open /root/.aws/credentials: no such file or directory
EC2RoleRequestError: no EC2 instance role found
caused by: EC2MetadataError: failed to make EC2Metadata request

Solution

Tanzu Kubernetes Grid uses the default AWS credentials provider chain. Before creating a management or a workload cluster on AWS, you must configure your AWS account credentials as described in Configure AWS Credentials.

Failed Validation, Legal Terms Error on Azure

Problem

Before creating a management or workload cluster on Azure, you must accept the legal terms that cover the VM image used by cluster nodes. Running tanzu mc create or tanzu cluster create without having accepted the license fails with an error like:

User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '*********' for this plan. Before the subscription can be used, you need to accept the legal terms of the image.

Solution

If this happens, accept the legal terms and try again:

Management Cluster

Clean Up After an Unsuccessful Management Cluster Deployment

Problem

An unsuccessful attempt to deploy a Tanzu Kubernetes Grid management cluster leaves orphaned objects in your cloud infrastructure and on your bootstrap machine.

Solution

  1. Monitor your tanzu mc create command output either in the terminal or Tanzu Kubernetes Grid installer interface. If the command fails, it prints a help message that includes the following: “Failure while deploying management cluster… To clean up the resources created by the management cluster: tkg delete mc….”
  2. Run tanzu mc delete YOUR-CLUSTER-NAME. This command removes the objects that it created in your infrastructure and locally.

You can also use the alternative methods described below:

  • Bootstrap machine cleanup:

    • To remove a kind cluster, use the kind CLI. For example:

      kind get clusters
      kind delete cluster --name tkg-kind-example1234567abcdef
      
    • To remove Docker objects, use the docker CLI. For example, docker rm, docker rmi, and docker system prune.

      Caution: If you are running Docker processes that are not related to Tanzu Kubernetes Grid on your system, remove unneeded Docker objects individually.

  • Infrastructure provider cleanup:

    • vSphere: Locate, power off, and delete the VMs and other resources that were created by Tanzu Kubernetes Grid.
    • AWS: Log in to your Amazon EC2 dashboard and delete the resources manually or use an automated solution.
    • Azure: In Resource Groups, open your AZURE_RESOURCE_GROUP. Use checkboxes to select and Delete the resources that were created by Tanzu Kubernetes Grid, which contain a timestamp in their names.

Kind Cluster Remains after Deleting Management Cluster

Problem

Running tanzu mc delete removes the management cluster, but fails to delete the local kind cluster from the bootstrap machine.

Solution

  1. List all running kind clusters and remove the one that looks like tkg-kind-unique_ID

    kind delete cluster --name tkg-kind-unique_ID
    
  2. List all running clusters and identify the kind cluster.

    docker ps -a
    
  3. Copy the container ID of the kind cluster and remove it.

    docker kill container_ID
    

Machines Stuck After Management Cluster Fails to Deploy

Problem

Your management cluster fails to deploy because machines are stuck, waiting for remediation.

Solution

For a management cluster that you deployed with the dev plan, that only has one control plane node, you must redeploy the management cluster. For management clusters with more than one control plane node, you can identify and delete the stuck machines.

  1. Retrieve the status of the management cluster. For example:

    kubectl -n tkg-system get cluster my-mgmt-cluster -o yaml
    
    1. Find the names of the stuck machines from the output of the previous step. A stuck machine is WaitingForRemediation. For example, the name of the stuck machine is my-mgmt-cluster-zpc7t in the following output:
    status:
      conditions:
      - lastTransitionTime: "2021-08-25T15:44:23Z"
        message: KCP can't remediate if current replicas are less or equal then 1
        reason: WaitingForRemediation @ Machine/my-mgmt-cluster-zpc7t
        severity: Warning
        status: "False"
        type: Ready
    
    1. Increase the machine health check (MHC) timeout values for the control plane nodes to greater than the default, 5m. For example:
    tanzu cluster machinehealthcheck control-plane set my-cluster --mhc-name my-control-plane-mhc --unhealthy-conditions "Ready:False:10m,Ready:Unknown:10m"
    

    For more information about updating a MachineHealthCheck object, see Create or Update a MachineHealthCheck Object in Configure Machine Health Checks for Workload Clusters.

    1. Set kubectl to the context of your management cluster. For example:
    kubectl config use-context mgmt-cluster-admin@mgmt-cluster
    
    1. Delete the stuck machines.
    kubectl delete machine MACHINE-NAME
    

    Where MACHINE-NAME is the name of the machine you located in an earlier step.

    1. Wait for the KubeadmControlPlane controller to redeploy the machine.

Restore ~/.config/tanzu Directory

Problem

The ~/.config/tanzu directory on the bootstrap machine has been accidentally deleted or corrupted. The tanzu CLI creates and uses this directory, and cannot function without it.

Solution

To restore the contents of the ~/.config/tanzu directory:

  1. To identify existing Tanzu Kubernetes Grid management clusters, run:

    kubectl --kubeconfig ~/.kube-tkg/config config get-contexts
    

    The command output lists names and contexts of all management clusters created or added by the tkg (v1.2) or tanzu CLI.

  2. For each management cluster listed in the output, restore it to the ~/.config/tanzu directory and CLI by running:

    tanzu login --kubeconfig ~/.kube-tkg/config --context MGMT-CLUSTER-CONTEXT --name MGMT-CLUSTER
         ```
    
    

### Running tanzu management-cluster create on macOS Results in kubectl Version Error

Problem

If you run the tanzu management-cluster create or tanzu mc create command on macOS with the latest stable version of Docker Desktop, it fails with the error message:

Error: : kubectl prerequisites validation failed: kubectl client version v1.15.5 is less than minimum supported kubectl client version 1.17.0

This happens because Docker Desktop symlinks kubectl 1.15 into the path.

Solution

Place a newer supported version of kubectl in the path before Docker’s version.

Recover Management Cluster Credentials

If you have lost the credentials for a management cluster, for example by inadvertently deleting the .kube-tkg/config file on the system on which you run tanzu commands, you can recover the credentials from the management cluster control plane node.

  1. Run tanzu mc create to recreate the .kube-tkg/config file.
  2. Obtain the public IP address of the management cluster control plane node, from vSphere, AWS, or Azure.
  3. Use SSH to log in to the management cluster control plane node.

    See Connect to Cluster Nodes with SSH above for the credentials to use for each infrastructure provider.

  4. Access the admin.conf file for the management cluster.

    sudo vi /etc/kubernetes/admin.conf
    

    The admin.conf file contains the cluster name, the cluster user name, the cluster context, and the client certificate data.

  5. Copy the cluster name, the cluster user name, the cluster context, and the client certificate data into the .kube-tkg/config file on the system on which you run tanzu commands.

NSX Advanced Load Balancer

Requests to NSX Advanced Load Balancer VIP fail with the message no route to host

Problem

If the total number of LoadBalancer type Service is large, and if all of the Service Engines are deployed in the same L2 network, requests to the NSX Advanced Load Balancer VIP can fail with the message no route to host.

This occurs because the default ARP rate limit on Service Engines is 100.

Solution

Set the ARP rate limit to a larger number. This parameter is not tunable in NSX Advanced Load Balancer Essentials, but it is tunable in NSX Advanced Load Balancer Enterprise Edition.

Packages

Error when runs tanzu package repository command

Problem

If you run the tanzu package repository command, sometimes it fails with an error.

Solution

Run kubectl get pkgr REPOSITORY-NAME -n NAMESPACE -o yaml to get the detailed reason for the error.

Where:

  • REPOSITORY-NAME: It is the name of the package repository.
  • NAMESPACE: It is the target namespace of the package repository.

The tanzu package repository command can fail with an error similar to the following:

Error Description Solution
NOT_FOUND The repository URL path is invalid. Ensure that the URL of package repository is reachable from your cluster.
UNKNOWN or UNAUTHORIZE This error can occur when attempting to connect to the repository.
Ownership A repository with the same package repository URL is already installed in the cluster. Do one of the followings:
  • Run tanzu package available list -n NAMESPACE to see if the package you want to install is already available for installation. If yes, then there is no requirement to add the package repository. You can skip repository installation and proceed with package installation. To revert the current failed attempt for repository creation, run tanzu package repository delete REPOSITORY-NAME -n NAMESPACE.
  • Run tanzu package repository list -A to retrieve an existing package repository with the same URL. If you retrieve the package repository, you can proceed with deleting it at your own risk.

Error when runs tanzu package installed command

Problem

If you run the tanzu package installed command, sometimes it fails with an error.

Solution

Run kubectl get pkgi INSTALLED-PACKAGE-NAME -n NAMESPACE -o yaml to get the detailed reason for the error.

Where:

  • INSTALLED-PACKAGE-NAME: It is the name of the installed package.
  • NAMESPACE: It is the namespace of the installed package.

The tanzu package installed command can fail with an error similar to the following:


Error Description Solution
Ownership A package with the same name is already installed in the cluster. Run tanzu package installed list -A to check if the package you want to install is already installed. If yes, you might either want to use the already installed package, update the version, or delete it to be able to proceed with a new installation.
Evaluating starlark template This error can occur when the mentioned configuration value is missing. Run tanzu package available get AVAILABLE-PACKAGE-NAME -n NAMESPACE --values-schema to find the available configuration values and provide the required configuration values using the values-file flag when you run the tanzu package installed command.

Note: The ytt directives are not supported in the values.yaml file.

Failed to find a package with name PACKAGE-NAME in namespace NAMESPACE The provided package and package metadata are not available in the given namespace. Ensure that you can see the provided package in the output of tanzu package available list AVAILABLE-PACKAGE-NAME -n NAMESPACE. Otherwise, add the package repository to the target namespace, or use a global-scoped repository. Also, ensure that the output from tanzu package available list AVAILABLE-PACKAGE-NAME matches the spec.packageRef.refName from kubectl get pkgi INSTALLED-PACKAGEREPOSITORY-NAME -n NAMESPACE -o yaml. If the output does not match, it means that you have installed the package previously and deleted the package resources and repository. In this case, delete the package and re-install it.
Namespaces NAMESPACE not found The namespace in which you want to install the package does not exist. To create the namespace in which you want to install the package, use the create-namespace flag during package installation.
Provided service account SERVICE-ACCOUNT-NAME is already used by another package in namespace NAMESPACE The service account provided with the service-account-name flag is already used by another installed package. Either let the package plugin create the service account for you or choose another service account name.

Pinniped

Post-Deploy Pinniped Job Fails

Problem

Upgrading to Tanzu Kubernetes Grid v1.6.0 returns an error similar to the following:

 Operation cannot be fulfilled on certificates.cert-manager.io "pinniped-cert": the object has been modified; please apply your changes to the latest version and try again

Solution

This error may occur if the Pinniped post-deploy job conflicts with a component’s upgrade process. Follow these steps to delete and redeploy the job.

  1. Delete the Pinniped post-deploy job.

    kubectl delete jobs.batch -n pinniped-supervisor pinniped-post-deploy-job
    
  2. Wait about 5 minutes for kapp-controller to redeploy the post-deploy job.

  3. Check the status of the Pinniped app.

    kubectl get app -n tkg-system pinniped
    NAME       DESCRIPTION           SINCE-DEPLOY   AGE
    pinniped   Reconcile succeeded   5s             49m
    

    If the DESCRIPTION shows Reconciling, wait a few minutes, then check again. Once it shows Reconcile succeeded continue to the next step.

  4. Check the status of the Pinniped post-deploy job.

    kubectl get jobs -n pinniped-supervisor
    NAME                             COMPLETIONS   DURATION   AGE
    pinniped-post-deploy-job-ver-1   1/1           9s         62s
    

Pinniped Authentication Error on Workload Cluster After Management Cluster Upgrade

Problem

You recently upgraded your management cluster. When attempting to authenticate to a workload cluster associated with this management cluster, you receive an error message similar to the following:

Error: could not complete Pinniped login: could not perform OIDC discovery for "https://IP:PORT": Get "https://IP:PORT/.well-known/openid-configuration": x509: certificate signed by unknown authority

Solution

This happens because the copy of the Pinniped supervisor CA bundle that the workload cluster is using is out of date. To update the supervisor CA bundle, follow the steps below:

  1. Set the kubectl context to the management cluster.

  2. Obtain the base64-encoded CA bundle and the issuer endpoint from the pinniped-info ConfigMap:

    kubectl get configmap pinniped-info -n kube-public -o jsonpath={.data.issuer_ca_bundle_data} > /tmp/ca-bundle && kubectl get configmap pinniped-info -n kube-public -o jsonpath={.data.issuer} > /tmp/supervisor-endpoint
    
  3. Obtain the values.yaml section from the Pinniped add-on secret for the workload cluster:

    kubectl get secret WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -o jsonpath="{.data.values\.yaml}" | base64 -d > values.yaml
    

    This secret is located on the management cluster.

  4. In the values.yaml file created above, update the supervisor_ca_bundle_data key to match the CA bundle from the pinniped-info ConfigMap. Additionally, ensure that the supervisor_svc_endpoint matches the issuer endpoint.

  5. Apply your update by base64 encoding the edited values.yaml file and replacing it in the workload cluster secret. This command differs depending on the OS of your environment. For example:

    Linux:

    kubectl patch secret/WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -p "{\"data\":{\"values.yaml\":\"$(base64 -w 0 < values.yaml)\"}}" --type=merge
    

    macOS:

    kubectl patch secret/WORKLOAD-CLUSTER-NAME-pinniped-addon -n WORKLOAD-CLUSTER-NAMESPACE -p "{\"data\":{\"values.yaml\":\"$(base64 < values.yaml)\"}}" --type=merge
    
  6. On the workload cluster, confirm that the Pinniped app successfully reconciled the changes:

    kubectl get app pinniped -n tkg-system
    
  7. Authenticate to the cluster. For example:

    kubectl get pods -A --kubeconfig my-cluster-credentials
    

Pods

Pods Are Stuck in Pending on Cluster Due to vCenter Connectivity

Problem

When you run kubectl get pods -A on the created cluster, some pods remain in pending.

You run kubectl describe pod -n pod-namespace pod-name on an affected pod and review events and see the following event:

n node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate

Solution

Ensure there is connectivity and firewall rules in place to ensure communication between the cluster and vCenter. For firewall ports and protocols requirements, see Ports and Protocols.

Tanzu CLI

Tanzu CLI Cannot Reach Management Cluster

Problem

Failure messages are seen when running tkg or tanzu CLI commands on a Tanzu Kubernetes Grid (TKG). The failure message is see on an installation that is a year old and was upgraded at least once since initial installation.

Failed to invoke API on cluster : the server has asked for the client to provide credentials, retrying

The failure messages are seen because the relevant certificate has expired.

Solution

This is a known issue that affects TKG 1.2 and later. For a workaround, see Update Management Cluster Certificate in Your Tanzu CLI Configuration and Cannot access the clusters using tkg/tanzu cli commands in Tanzu Kubernetes Grid.

Workload Clusters

Deploying a Workload Cluster Times Out, but the Cluster Is Created

Problem

Running tanzu cluster create fails with a timeout error similar to the following:

I0317 11:11:16.658433 clusterclient.go:341] Waiting for resource my-cluster of type *v1beta1.Cluster to be up and running
E0317 11:26:16.932833 common.go:29]
Error: unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be provisioned (this may take a few minutes): cluster control plane is still being initialized
E0317 11:26:16.933251 common.go:33]

Solution

Use the --timeout flag to specify the time to wait for the cluster provisioning to complete. The default waiting time is 30 minutes.

tanzu cluster create --timeout TIME

TIME is the length of time, in minutes, to wait for the completion of cluster provisioning. For example:

--timeout 60m

Note: The time required may depend on the region.

Workload Cluster is Stuck in Deletion

Problem

tanzu cluster delete fails to delete workload cluster.

To delete the cluster manually, see the two solutions below.

Solution 1

  1. On the target cluster, delete the StatefulSet object for AKO, which runs in the avi-system namespace:

    kubectl delete sts ako -n avi-system
    

Solution 2

  1. Log in to the cluster and delete the worker machines:

    kubectl delete machine worker1 worker2
    
  2. From vCenter, power off and delete the worker node VMs.

  3. Edit control plane machines and remove the finalizer link:

    finalizers:
     - machine.cluster.x-k8s.io
    
  4. Delete the control plane machines:

    kubectl delete machine controlplane1 controlplane2
    
  5. From vCenter, power off and delete the control plane VMs

Cluster Worker Nodes in NotReady Status Due to Mismatched MTUs

Problem

Different Maximum Transmission Unit (MTU) settings on the worker nodes in a cluster result in TLS handshake timeout.

Logs from journalctl -u kubelet on a node show communication failure with the API server. Running kubectl get nodes shows that worker nodes have moved to the NotReady status.

You can reconfirm the issue by doing the following:

  1. On the control plane node and the worker node machines, run ip link and compare the MTU values of the eth0 interface. If they do not match, it is indicative of this issue.
  2. Run Crash Diagnostics (Crashd) and review the kubelet logs to determine that the connections are timed out, or the worker nodes are in the NotReady status. For more information on running Crashd, see Troubleshooting Workload Clusters with Crash Diagnostics
  3. Confirm that the following commands fail when you run them on a machine, which is in the NotReady node status:

    • openssl s_client -connect IP:PORT

    • curl IP:PORT -k /healthz

    Where IP and PORT is the IP address and port number of the Kubernetes API server control plane endpoint. By default, PORT is set to 6443.

Solution

  1. Review the privileged daemonsets deployed on the cluster, and review any daemonsets from third-party vendors that might modify the network configurations of the host operating system. You might need to consult the software vendor to find this out. The daemonsets that can modify the host operating system will either have .spec.template.spec.hostNetwork: true or have either privileged: true or NET_ADMIN in the capabilities field of any container security context.

  2. If you want to configure large MTU settings, provision the cluster with control plane with a higher MTU value.

  3. Ensure that the cluster network either allows Path MTU discovery or has TCP MSS clamping in place to allow correct MTU sizing to external services, such as vCenter or container registries.

  4. Ensure that you configure the same MTU settings for all the nodes in a cluster.

  5. The network firewall settings must allow for packets of the configured MTU size.

For more information about troubleshooting the configuration of auto-managed packages, see View and Update Configuration Information for Auto-Managed Packages.

Tanzu Kubernetes Grid Installer Interface

Tanzu Kubernetes Grid UI Does Not Display Correctly on Windows

Problem

When you run the tanzu management-cluster create --ui or tanzu mc create --ui command on a Windows system, the UI opens in your default browser, but the graphics and styling are not applied. This happens because a Windows registry is set to application/x-css.

Solution

  1. In Windows search, enter regedit to open the Registry Editor utility.
  2. Expand HKEY_CLASSES_ROOT and select .js.
  3. Right-click Content Type and select Modify.
  4. Set the Value to application/javascript and click OK.
  5. Run the tanzu mc create --ui command again to relaunch the UI.
check-circle-line exclamation-circle-line close-line
Scroll to top icon