General Troubleshooting

This topic assists with diagnosing and troubleshooting issues when installing or using VMware Tanzu Kubernetes Grid Integrated Edition (TKGI).

Overview

Refer to the following for troubleshooting assistance:

The Fluent Bit Pod Restarts Due to Out-of-Memory Issue
TKGI API is Slow or Times Out
All Cluster Operations Fail
Cluster Creation Fails
Cluster Deletion Fails
Cannot Re-Create a Cluster that Failed to Deploy
Windows Stemcell for vSphere Creation Fails with Login Issue
Cannot Access Add-On Features or Functions
Resurrecting VMs Causes Incorrect Permissions in vSphere HA
Worker Node Hangs Indefinitely
Cannot Authenticate to an OpenID Connect-Enabled Cluster
Cannot Access Apps Deployed to Clusters That Utilize Websocket
Login Failed Error: Credentials were rejected
Login Failed Errors Due to Server State
Error: Failed Jobs
Error: No Such Host
Error: FailedMount
Error: Plan Not Found
Failed to Allocate FIP from Pool

The Fluent Bit Pod Restarts Due to Out-of-Memory Issue

Symptom

When the LogSink feature is enabled, the Fluent Bit Pod can experience an out-of-memory issue during high memory utilization. The Fluent Bit Pod logs an OOMKilled Kubernetes exit code 137 error, and restarts.

Explanation

The Fluent Bit Pod has insufficient memory for your environment’s utilization.

Solution

In TKGI v1.16.1 and later, increase the Fluent Bit Pod memory limit. For more information, see Log Sink Resources in the Installing Tanzu Kubernetes Grid Integrated Edition topic for your IaaS.

TKGI API is Slow or Times Out

Symptom

When you run TKGI CLI commands, the TKGI API times out or is slow to respond.

Explanation

The TKGI API VM requires more resources.

Solution

Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.
Select the Tanzu Kubernetes Grid Integrated Edition tile.
Select the Resource Config page.
For the TKGI API job, select a VM Type with greater CPU and memory resources.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.

All Cluster Operations Fail

Symptom

All TKGI CLI cluster operations fail including attempts to create or delete clusters with tkgi create-cluster and tkgi delete-cluster.

The output of tkgi cluster CLUSTER-NAME contains Last Action State: error, and the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms indicates that the Process State of at least one deployed node is failing.

Explanation

If any deployed control plane or worker nodes run out of disk space in /var/vcap/store , all cluster operations such as the creation or deletion of clusters will fail.

Diagnostics

To confirm that there is a disk space issue, check recent BOSH activity for any disk space error messages.

Log in to the BOSH Director and run bosh tasks. The output from bosh tasks provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition for more information about logging in to the BOSH Director.
In the BOSH command output, locate a task that attempted to perform a cluster operation, such as cluster creation or deletion.
To retrieve more information about the task, run the following command:
```
bosh -e MY-ENVIRONMENT task TASK-NUMBER
```
Where:
- MY-ENVIRONMENT is the name of your BOSH environment.
- TASK-NUMBER is the number of the task that attempted to create the cluster.
For example:
```
$ bosh -e tkgi task 23
```
In the output, look for the following text string:
```
no space left on device
```
Check the health of your deployed Kubernetes clusters by following the procedure in Verifying Deployment Health.

In the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms, look for any nodes that display failing as their Process State. For example:

Instance                                     Process State  AZ       IPs         VM CID                                   VM Type  Active  
master/3a3adc92-14ce-4cd4-a12c-6b5eb03e33d6  failing        az-1     10.0.11.10  vm-09027f0e-dac5-498e-474e-b47f2cda614d  small    true

Make a note of the plan assigned to the failing node.

Solution

In the Tanzu Kubernetes Grid Integrated Edition tile, locate the plan assigned to the failing node.
In the plan configuration, select a larger VM type for the plan’s control plane or worker nodes or both.

For more information about scaling existing clusters by changing the VM types, see Scale Vertically by Changing Cluster Node VM Sizes in the TKGI Tile.

Cluster Creation Fails

Symptom

When creating a cluster, you run tkgi cluster CLUSTER-NAME to monitor the cluster creation status. In the command output, the value for Last Action State is error.

Explanation

There was an error creating the cluster.

Diagnostics

Log in to the BOSH Director and run bosh tasks. The output from bosh tasks provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition for more information about logging in to the BOSH Director.
In the BOSH command output, locate the task that attempted to create the cluster.
To retrieve more information about the task, run the following command:
```
bosh -e MY-ENVIRONMENT task TASK-NUMBER
```
Where:
- MY-ENVIRONMENT is the name of your BOSH environment.
- TASK-NUMBER is the number of the task that attempted to create the cluster.
For example:
```
$ bosh -e tkgi task 23
```

BOSH logs are used for error diagnostics. If the issue you see in the BOSH logs is related to using or managing Kubernetes, consult the Kubernetes Documentation for troubleshooting that issue.

For troubleshooting failed BOSH tasks, see the BOSH documentation.

Cluster Deletion Fails

Symptom

When deleting a cluster in a large-scale NSX-T environment, TKGI delete-cluster becomes stuck.

Explanation

A TKGI-API process has timed out and cluster deletion is stuck.

Solution

To avoid the TKGI-API process time out, increase the TKGI Operation Timeout:

SSH to the TKGI Control Plane VM.
Change directory to /var/vcap/jobs/pks-nsx-t-osb-proxy.

Run the following command:

time ./bin/ncp_cleanup test-read  ROUTER-ID

Where ROUTER-ID is your NSX-T Tier-0 Router ID.

For example:

pivotal-container-service/88d4bf76-d3967d53b4c4:/var/vcap/jobs/pks-nsx-t-osb-proxy# time ./bin/ncp_cleanup test-read 8dc31113-64e8-40bb-83fb-1af75857d5ae
real 1m28.057s
user 0m13.121s
sys 0m0.629s

Collect the returned real value.
Add 30 seconds to the real value and convert the sum from minutes-seconds to seconds, rounding up. For example, sum, convert, and round 1m28.057s to 120.
Convert the summed value to milliseconds. This is your calculated Operation Timeout value.
Configure the TKGI Operation Timeout field on the TKGI Tile with your calculated Operation Timeout value. For more information on configuring the TKGI Operation Timeout field, see Networking in Installing TKGI on vSphere with NSX-T.

Cannot Re-Create a Cluster that Failed to Deploy

Symptom

After cluster creation fails, you cannot re-run tkgi create-cluster to attempt creating the cluster again.

Explanation

Tanzu Kubernetes Grid Integrated Edition does not automatically clean up the failed BOSH deployment. Running tkgi create-cluster using the same cluster name creates a name clash error in BOSH.

Solution

Log in to the BOSH Director and delete the BOSH deployment manually, then retry the tkgi delete-cluster operation. After cluster deletion succeeds, re-create the cluster.

Log in to the BOSH Director and obtain the deployment name for cluster you want to delete. For instructions, see Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition.
Run the following BOSH command:
```
bosh -e MY-ENVIRONMENT delete-deployment -d DEPLOYMENT-NAME
```
Where:
- MY-ENVIRONMENT is the name of your BOSH environment.
- DEPLOYMENT-NAME is the name of your BOSH deployment.
Note: If necessary, you can append the –force flag to delete the deployment.
Run the following TKGI command:
```
tkgi delete-cluster CLUSTER-NAME
```
Where CLUSTER-NAME is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.
Note: Use only lowercase characters in your TKGI-provisioned Kubernetes cluster names if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.
To re-create the cluster, run the following TKGI command:
```
tkgi create-cluster CLUSTER-NAME
```
Where CLUSTER-NAME is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.

Note: Use only lowercase characters when naming your cluster if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.

Windows Stemcell for vSphere Creation Fails with Login Issue

Symptom

The stembuild construct command fails with error: Cannot complete login due to an incorrect user name or password.

Explanation

Your vCenter login contains special characters, or you have GOVC environment variables set locally.

Solution

For special characters, see Authentication Error with Special Characters in stembuild Commands, in the TAS for VMs [Windows] documentation.

For GOVC variables, follow the steps to unset the variables in Step 4: Construct the BOSH Stemcell, in the TAS for VMs [Windows] documentation.

Cannot Access Add-On Features or Functions

Symptom

You cannot access a feature or function provided by a Kubernetes add-on.

For example, pods cannot resolve DNS names, and error messages report the service CoreDNS is invalid. If CoreDNS is not deployed, the cluster typically fails to start.

Explanation

Kubernetes features and functions are provided by Tanzu Kubernetes Grid Integrated Edition add-ons. DNS resolution, for example, is provided by the CoreDNS service.

To activate these add-ons, Ops Manager must run scripts after deploying Tanzu Kubernetes Grid Integrated Edition. You must configure Ops Manager to automatically run these post-deploy scripts.

Solution

Perform the following steps to configure Ops Manager to run post-deploy scripts to deploy the missing add-ons to your cluster.

Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.
Click the BOSH Director tile.
Select Director Config.
Select Enable Post Deploy Scripts.
Note: This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
After Ops Manager finishes applying changes, enter tkgi delete-cluster on the command line to delete the cluster. For more information, see Deleting Clusters.
On the command line, enter tkgi create-cluster to recreate the cluster. For more information, see Creating Clusters.

Resurrecting VMs Causes Incorrect Permissions in vSphere HA

Symptoms

Output resulting from the bosh vms command alternates between showing that the VMs are failing and showing that the VMs are running. The operator must run the bosh vms command multiple times to see this cycle.

Explanation

The VMs’ permissions are altered during the restarting of the VM so operators have to reset permissions every time the VM reboots or is redeployed.

VMs cannot be successfully resurrected if the resurrection state of your VM is set to off or if the vSphere HA restarts the VM before BOSH is aware that the VM is down. For more information about VM resurrection, see Resurrection in the BOSH documentation.

Solution

Run the following command on all of your control plane and worker VMs:

bosh -environment BOSH-DIRECTOR-NAME -deployment DEPLOYMENT-NAME ssh INSTANCE-GROUP-NAME -c "sudo /var/vcap/jobs/kube-controller-manager/bin/pre-start; sudo /var/vcap/jobs/kube-apiserver/bin/post-start"

Where:

BOSH-DIRECTOR-NAME is your BOSH Director name.
DEPLOYMENT-NAME is the name of your BOSH deployment.
INSTANCE-GROUP-NAME is the name of the BOSH instance group you are referencing.

The above command, when applied to each VM, gives your VMs the correct permissions.

Worker Node Hangs Indefinitely

Symptoms

After making your selection in the Upgrade all clusters errand section, the worker node might hang indefinitely. For more information about monitoring the Upgrade all clusters errand using the BOSH CLI, see Upgrade the TKGI Tile in Upgrading Tanzu Kubernetes Grid Integrated Edition (Flannel Networking).

Explanation

During the Tanzu Kubernetes Grid Integrated Edition tile upgrade process, worker nodes are cordoned and drained. This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely. Kubernetes might be unable to unschedule the node if the PodDisruptionBudget object has been configured to permit zero disruptions and only a single instance of the pod has been scheduled.

In your spec file, the .spec.replicas configuration sets the total amount of replicas that are available in your app. PodDisruptionBudget objects specify the amount of replicas, proportional to the total, that must be available in your app, regardless of downtime. Operators can configure PodDisruptionBudget objects for each app using their spec file.

Some apps deployed using Helm charts might have a default PodDisruptionBudget set. For more information on configuring PodDisruptionBudget objects using a spec file, see Specifying a PodDisruptionBudget in the Kubernetes documentation.

If .spec.replicas is configured correctly, you can also configure the default node drain behavior to prevent cluster upgrades from hanging or failing.

Solution

To resolve this issue, do one of the following:

Configure .spec.replicas to be greater than the PodDisruptionBudget object.

When the number of replicas configured in .spec.replicas is greater than the number of replicas set in the PodDisruptionBudget object, disruptions can occur.

For more information, see How Disruption Budgets Work in the Kubernetes documentation.
For more information about workload capacity and uptime requirements in Tanzu Kubernetes Grid Integrated Edition, see Prepare to Upgrade in Upgrading Tanzu Kubernetes Grid Integrated Edition (Antrea and Flannel Networking).

Configure the default node drain behavior by doing the following:

Navigate to Ops Manager Installation > Tanzu Kubernetes Grid Integrated Edition > Plans.

Set the default node drain behavior by configuring the following fields:

Field	Instructions
Node Drain Timeout	Enter a timeout in minutes for the node to drain pods. You must enter a valid integer between `0` and `1440`. If you set this value to `0`, the node drain does not terminate.
Pod Shutdown Grace	Enter a timeout in seconds for the node to wait before it forces the pod to terminate. You must enter a valid integer between `-1` and `86400`. If you set this value to `-1`, the timeout is set to the default timeout specified by the pod.
Force node to drain even if it has running pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet.	If you activate this configuration, the node still drains when pods are not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet.
Force node to drain even if it has running DaemonSet-managed pods.	If you activate this configuration, the node still drains when pods are managed by a DeamonSet.
Force node to drain even if it has running running pods using emptyDir.	If you activate this configuration, the node still drains when pods are using an emptyDir volume.
Force node to drain even if pods are still running after timeout.	If you activate this configuration and then during the timeout pods fail to drain on the worker node, the node forces running pods to terminate and the upgrade or scale continues.

Warning: If you select Force node to drain even if pods are still running after timeout, the node halts all running workloads on pods. Before enabling this configuration, set Node Drain Timeout to greater than 0.

Warning: If you deselect Force node to drain even if it has running DaemonSet-managed pods with Enable Metric Sink Resources, Enable Log Sink Resources, or Enable Node Exporter selected, the upgrade will fail as all options deploy a DaemonSet in the pks-system namespace.

Navigate to Ops Manager Installation Dashboard > Review Pending Changes, select Upgrade all clusters errand, and Apply Changes. The new behavior takes effect during the next upgrade, not immediately after applying your changes.

Note: You can also use the TKGI CLI to configure node drain behavior. To configure the default node drain behavior with the TKGI CLI, run tkgi update-cluster with an action flag. You can view the current node drain behavior with tkgi cluster –details. For more information, see Configure Node Drain Behavior in Upgrade Preparation Checklist for Tanzu Kubernetes Grid Integrated Edition v1.9. Warning: Do not use tkgi update-cluster on clusters configured with a network profile CNI configuration.

Cannot Authenticate to an OpenID Connect-Enabled Cluster

Symptom

When you authenticate to an OpenID Connect-enabled cluster using an existing kubeconfig file, you see an authentication or authorization error.

Explanation

users.user.auth-provider.config.id-token and users.user.auth-provider.config.refresh-token contained in the kubeconfig file for the cluster might have expired.

Solution

Upgrade the TKGI CLI to v1.2.0 or later.

To download the TKGI CLI, navigate to Broadcom Support. For more information, see Installing the TKGI CLI.
Obtain a kubeconfig file that contains the new tokens by running the following command:
```
tkgi get-credentials CLUSTER-NAME
```
Where CLUSTER-NAME is the name of your cluster.

For example:
```
$ tkgi get-credentials tkgi-example-cluster

Fetching credentials for cluster tkgi-example-cluster.
Context set for cluster tkgi-example-cluster.

You can now switch between clusters by using:
$kubectl config use-context <cluster-name>
```
Note: If your operator has configured Tanzu Kubernetes Grid Integrated Edition to use a SAML identity provider, you must include an additional SSO flag to use the above command. For information about the SSO flags, see the section for the above command in TKGI CLI. For information about configuring SAML, see Connecting Tanzu Kubernetes Grid Integrated Edition to a SAML Identity Provider
Connect to the cluster using kubectl.

If you continue to see an authentication or authorization error, verify that you have sufficient access permissions for the cluster.

Cannot Access Apps Deployed to Clusters That Utilize Websocket

Symptom

Your NSX-T LB disconnects the sessions for your apps deployed to clusters utilizing websocket. These apps are inaccessible or non-functional.

Explanation

Tanzu Kubernetes Grid Integrated Edition on vSphere with NSX-T fully supports websocket. The most likely cause for this behavior is a connectivity issue specific to supporting websocket.

Solution

Review your configuration for a source for the connectivity issues:

Review the connectivity to the NSX-T LB instance.
Confirm the devices between your NSX-T LB and app are not blocking websocket.

Login Failed Error: Credentials were rejected

Symptom

TKGI login command fails with an error “Credentials were rejected, please try again.”

Explanation

You might experience this issue when a large number of pods are running continuously in your Tanzu Kubernetes Grid Integrated Edition deployment. As a result, the persistent disk on the TKGI Database VM runs out of space.

Solution

Check the total number of pods in your Tanzu Kubernetes Grid Integrated Edition deployments.
If there are a large number of pods such as over 1,000 pods, then check the amount of available persistent disk space on the TKGI Database VM.
If available disk space is low, increase the amount of persistent disk storage on the TKGI Database VM depending on the number of pods in your Tanzu Kubernetes Grid Integrated Edition deployment. Refer to the table in the following section.

Storage Requirements for Large Numbers of Pods

If you expect the cluster workload to run a large number of pods continuously, then increase the size of persistent disk storage allocated to the TKGI Database VM as follows:

Number of Pods	Persistent Disk Requirements (GB)
1,000 pods	20
5,000 pods	100
10,000 pods	200
50,000 pods	1,000

Login Failed Errors Due to Server State

Symptom

You encounter an error similar to one of the following when running a kubectl or cluster command:

“Error: You must be logged in to the server (Unauthorized)”
“Error: You are not currently authenticated. Please log in to continue”

Explanation

You might experience this issue when your authentication server or a host has the incorrect time.

Workaround

To refresh your credentials, run the following:
```
pks get-credentials
```

Solution

To resolve the problem permanently, correct the time on the server with the incorrect time.

Error: Failed Jobs

Symptom

In stdout or log files, you see an error message referencing post-start scripts failed or Failed Jobs.

Explanation

After deploying Tanzu Kubernetes Grid Integrated Edition, Ops Manager runs scripts to start a number of jobs. You must configure Ops Manager to automatically run these post-deploy scripts.

Solution

Perform the following steps to configure Ops Manager to run post-deploy scripts.

Navigate to https://YOUR-OPS-MANAGER-FQDN/ in a browser to log in to the Ops Manager Installation Dashboard.
Click the BOSH Director tile.
Select Director Config.
Select Enable Post Deploy Scripts.
Note: This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
(Optional) If it is a new deployment of Tanzu Kubernetes Grid Integrated Edition, follow the steps below:
1. On the command line, enter tkgi delete-cluster to delete the cluster. For more information, see Deleting Clusters.
2. Enter tkgi create-cluster to recreate the cluster. For more information, see Creating Clusters.

Error: No Such Host

Symptom

In stdout or log files, you see an error message that includes lookup vm-WORKER-NODE-GUID on IP-ADDRESS: no such host.

Explanation

This error occurs on GCP when the Ops Manager Director tile uses 8.8.8.8 as the DNS server. When this IP range is in use, the control plane node cannot locate the route to the worker nodes.

Solution

Use the Google internal DNS range, 169.254.169.254, as the DNS server.

Error: FailedMount

Symptom

In Kubernetes log files, you see a Warning event from kubelet with FailedMount as the reason.

Explanation

A persistent volume fails to connect to the Kubernetes cluster worker VM.

Diagnostics

In your cloud provider console, verify that volumes are being created and attached to nodes.
From the Kubernetes cluster control plane node, check the controller manager logs for errors attaching persistent volumes.
From the Kubernetes cluster worker node, check kubelet for errors attaching persistent volumes.

Error: Plan Not Found

Symptom

Plan not found error when an active plan is deactivated.

Explanation

You might receive the error “plan UUID not found” if, after creating a cluster using a plan (such as Plan 1), you then deactivate the plan (Plan 1) from the TKGI Tile in Ops Manager and then Save and Apply Changes with the Upgrade all clusters errand selected.

Ops Manager does not have capability to check clusters that are using a particular plan. Only when user saves the plan, the deployment process will check whether a plan can be deactivated. The error message "plan is displayed in the Ops Manager logs.

Solution

Do not deactivate a plan that is in use by or more clusters.
Run the command tkgi cluster my-cluster --details to view what plan the cluster is using.

Failed to Allocate FIP from Pool

Symptom

The pre-start script for tkgi create-cluster fails and logs floating IP pool allocation errors in the pre-start.stderr.log similar to the following:

level=error msg="operation failed with [POST /pools/ip-pools/{pool-id}][409] allocateOrReleaseFromIpPoolConflict  &{RelatedAPIError:{Details: ErrorCode:5141 ErrorData:<nil> ErrorMessage:Requested IP Address ... is already allocated. ModuleName:id-allocation service} RelatedErrors:[]}\n"

level=warning msg="failed to allocate FIP from (pool: ...: [POST /pools/ip-pools/{pool-id}][409] allocateOrReleaseFromIpPoolConflict  &{RelatedAPIError:{Details: ErrorCode:5141 ErrorData:<nil> ErrorMessage:Requested IP Address ... is already allocated. ModuleName:id-allocation service} RelatedErrors:[]}\n"

Error: an error occurred during FIP allocation

Explanation

TKGI administrators can allocate floating IP pool IP Addresses in a Network Profile configuration. The TKGI control plane allocates IP Addresses from the floating IP pool without accounting for the IPs allocated using Network Profiles.

Workaround

TKGI allocates IP addresses starting from the beginning of a floating IP pool range. When configuring a Network Profile, allocate IP Addresses starting at the end of the floating IP pool range instead of those at the beginning.