This topic assists with diagnosing and troubleshooting issues when installing or using VMware Tanzu Kubernetes Grid Integrated Edition (TKGI).
Refer to the following for troubleshooting assistance:
Symptom
When you run TKGI CLI commands, the TKGI API times out or is slow to respond.
Explanation
The TKGI API VM requires more resources.
Solution
Navigate to https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.
Select the Tanzu Kubernetes Grid Integrated Edition tile.
Select the Resource Config page.
For the TKGI API job, select a VM Type with greater CPU and memory resources.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Symptom
All TKGI CLI cluster operations fail including attempts to create or delete clusters with tkgi create-cluster
and tkgi delete-cluster
.
The output of tkgi cluster CLUSTER-NAME
contains Last Action State: error
, and the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms
indicates that the Process State
of at least one deployed node is failing
.
Explanation
If any deployed control plane or worker nodes run out of disk space in /var/vcap/store
, all cluster operations such as the creation or deletion of clusters will fail.
Diagnostics
To confirm that there is a disk space issue, check recent BOSH activity for any disk space error messages.
Log in to the BOSH Director and run bosh tasks
. The output from bosh tasks
provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition for more information about logging in to the BOSH Director.
In the BOSH command output, locate a task that attempted to perform a cluster operation, such as cluster creation or deletion.
To retrieve more information about the task, run the following command:
bosh -e MY-ENVIRONMENT task TASK-NUMBER
Where:
MY-ENVIRONMENT
is the name of your BOSH environment.TASK-NUMBER
is the number of the task that attempted to create the cluster.For example:
$ bosh -e tkgi task 23
In the output, look for the following text string:
no space left on device
Check the health of your deployed Kubernetes clusters by following the procedure in Verifying Deployment Health.
In the output of bosh -e ENV-ALIAS -d SERVICE-INSTANCE vms
, look for any nodes that display failing
as their Process State
. For example:
Instance Process State AZ IPs VM CID VM Type Active
master/3a3adc92-14ce-4cd4-a12c-6b5eb03e33d6 failing az-1 10.0.11.10 vm-09027f0e-dac5-498e-474e-b47f2cda614d small true
Make a note of the plan assigned to the failing node.
Solution
In the Tanzu Kubernetes Grid Integrated Edition tile, locate the plan assigned to the failing node.
In the plan configuration, select a larger VM type for the plan’s control plane or worker nodes or both.
For more information about scaling existing clusters by changing the VM types, see Scale Vertically by Changing Cluster Node VM Sizes in the TKGI Tile.
Symptom
When creating a cluster, you run tkgi cluster CLUSTER-NAME
to monitor the cluster creation status. In the command output, the value for Last Action State is error
.
Explanation
There was an error creating the cluster.
Diagnostics
Log in to the BOSH Director and run bosh tasks
. The output from bosh tasks
provides details about the tasks that the BOSH Director has run. See Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition for more information about logging in to the BOSH Director.
In the BOSH command output, locate the task that attempted to create the cluster.
To retrieve more information about the task, run the following command:
bosh -e MY-ENVIRONMENT task TASK-NUMBER
Where:
MY-ENVIRONMENT
is the name of your BOSH environment.TASK-NUMBER
is the number of the task that attempted to create the cluster.For example:
$ bosh -e tkgi task 23
BOSH logs are used for error diagnostics. If the issue you see in the BOSH logs is related to using or managing Kubernetes, consult the Kubernetes Documentation for troubleshooting that issue.
For troubleshooting failed BOSH tasks, see the BOSH documentation.
Symptom
When deleting a cluster in a large-scale NSX-T environment, TKGI delete-cluster
becomes stuck.
Explanation
A TKGI-API process has timed out and cluster deletion is stuck.
Solution
To avoid the TKGI-API process time out, increase the TKGI Operation Timeout:
/var/vcap/jobs/pks-nsx-t-osb-proxy
.Run the following command:
time ./bin/ncp_cleanup test-read ROUTER-ID
Where ROUTER-ID
is your NSX-T Tier-0 Router ID.
For example:
pivotal-container-service/88d4bf76-d3967d53b4c4:/var/vcap/jobs/pks-nsx-t-osb-proxy# time ./bin/ncp_cleanup test-read 8dc31113-64e8-40bb-83fb-1af75857d5ae
real 1m28.057s
user 0m13.121s
sys 0m0.629s
real
value.real
value and convert the sum from minutes-seconds to seconds, rounding up. For example, sum, convert, and round 1m28.057s
to 120
.Symptom
After cluster creation fails, you cannot re-run tkgi create-cluster
to attempt creating the cluster again.
Explanation
Tanzu Kubernetes Grid Integrated Edition does not automatically clean up the failed BOSH deployment. Running tkgi create-cluster
using the same cluster name creates a name clash error in BOSH.
Solution
Log in to the BOSH Director and delete the BOSH deployment manually, then retry the tkgi delete-cluster
operation. After cluster deletion succeeds, re-create the cluster.
Log in to the BOSH Director and obtain the deployment name for cluster you want to delete. For instructions, see Using BOSH Diagnostic Commands in Tanzu Kubernetes Grid Integrated Edition.
Run the following BOSH command:
bosh -e MY-ENVIRONMENT delete-deployment -d DEPLOYMENT-NAME
Where:
MY-ENVIRONMENT
is the name of your BOSH environment.DEPLOYMENT-NAME
is the name of your BOSH deployment.Note: If necessary, you can append the –force
flag to delete the deployment.
Run the following TKGI command:
tkgi delete-cluster CLUSTER-NAME
Where CLUSTER-NAME
is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.
Note: Use only lowercase characters in your TKGI-provisioned Kubernetes cluster names if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.
To re-create the cluster, run the following TKGI command:
tkgi create-cluster CLUSTER-NAME
Where CLUSTER-NAME
is the name of your Tanzu Kubernetes Grid Integrated Edition cluster.
Note: Use only lowercase characters when naming your cluster if you manage your clusters with Tanzu Mission Control (TMC). Clusters with names that include an uppercase character cannot be attached to TMC.
Symptom
The stembuild construct
command fails with error: Cannot complete login due to an incorrect user name or password.
Explanation
Your vCenter login contains special characters, or you have GOVC
environment variables set locally.
Solution
For special characters, see Authentication Error with Special Characters in stembuild Commands, in the TAS for VMs [Windows] documentation.
For GOVC
variables, follow the steps to unset the variables in Step 4: Construct the BOSH Stemcell, in the TAS for VMs [Windows] documentation.
Symptom
You cannot access a feature or function provided by a Kubernetes add-on.
For example, pods cannot resolve DNS names, and error messages report the service CoreDNS
is invalid. If CoreDNS
is not deployed, the cluster typically fails to start.
Explanation
Kubernetes features and functions are provided by Tanzu Kubernetes Grid Integrated Edition add-ons. DNS resolution, for example, is provided by the CoreDNS
service.
To activate these add-ons, Ops Manager must run scripts after deploying Tanzu Kubernetes Grid Integrated Edition. You must configure Ops Manager to automatically run these post-deploy scripts.
Solution
Perform the following steps to configure Ops Manager to run post-deploy scripts to deploy the missing add-ons to your cluster.
Navigate to https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.
Click the BOSH Director tile.
Select Director Config.
Select Enable Post Deploy Scripts.
Note: This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
After Ops Manager finishes applying changes, enter tkgi delete-cluster
on the command line to delete the cluster. For more information, see Deleting Clusters.
On the command line, enter tkgi create-cluster
to recreate the cluster. For more information, see Creating Clusters.
Symptoms
Output resulting from the bosh vms
command alternates between showing that the VMs are failing
and showing that the VMs are running
. The operator must run the bosh vms
command multiple times to see this cycle.
Explanation
The VMs’ permissions are altered during the restarting of the VM so operators have to reset permissions every time the VM reboots or is redeployed.
VMs cannot be successfully resurrected if the resurrection state of your VM is set to off
or if the vSphere HA restarts the VM before BOSH is aware that the VM is down. For more information about VM resurrection, see Resurrection in the BOSH documentation.
Solution
Run the following command on all of your control plane and worker VMs:
bosh -environment BOSH-DIRECTOR-NAME -deployment DEPLOYMENT-NAME ssh INSTANCE-GROUP-NAME -c "sudo /var/vcap/jobs/kube-controller-manager/bin/pre-start; sudo /var/vcap/jobs/kube-apiserver/bin/post-start"
Where:
BOSH-DIRECTOR-NAME
is your BOSH Director name.DEPLOYMENT-NAME
is the name of your BOSH deployment.INSTANCE-GROUP-NAME
is the name of the BOSH instance group you are referencing.The above command, when applied to each VM, gives your VMs the correct permissions.
Symptoms
After making your selection in the Upgrade all clusters errand section, the worker node might hang indefinitely. For more information about monitoring the Upgrade all clusters errand using the BOSH CLI, see Upgrade the TKGI Tile in Upgrading Tanzu Kubernetes Grid Integrated Edition (Flannel Networking).
Explanation
During the Tanzu Kubernetes Grid Integrated Edition tile upgrade process, worker nodes are cordoned and drained. This drain is dependent on Kubernetes being able to unschedule all pods. If Kubernetes is unable to unschedule a pod, then the drain hangs indefinitely. Kubernetes might be unable to unschedule the node if the PodDisruptionBudget
object has been configured to permit zero disruptions and only a single instance of the pod has been scheduled.
In your spec file, the .spec.replicas
configuration sets the total amount of replicas that are available in your app. PodDisruptionBudget
objects specify the amount of replicas, proportional to the total, that must be available in your app, regardless of downtime. Operators can configure PodDisruptionBudget
objects for each app using their spec file.
Some apps deployed using Helm charts might have a default PodDisruptionBudget
set. For more information on configuring PodDisruptionBudget
objects using a spec file, see Specifying a PodDisruptionBudget in the Kubernetes documentation.
If .spec.replicas
is configured correctly, you can also configure the default node drain behavior to prevent cluster upgrades from hanging or failing.
Solution
To resolve this issue, do one of the following:
Configure .spec.replicas
to be greater than the PodDisruptionBudget
object.
When the number of replicas configured in .spec.replicas
is greater than the number of replicas set in the PodDisruptionBudget
object, disruptions can occur.
For more information, see How Disruption Budgets Work in the Kubernetes documentation.
For more information about workload capacity and uptime requirements in Tanzu Kubernetes Grid Integrated Edition, see Prepare to Upgrade in Upgrading Tanzu Kubernetes Grid Integrated Edition (Antrea and Flannel Networking).
Configure the default node drain behavior by doing the following:
Set the default node drain behavior by configuring the following fields:
Field | Instructions |
---|---|
Node Drain Timeout | Enter a timeout in minutes for the node to drain pods. You must enter a valid integer between 0 and 1440 . If you set this value to 0 , the node drain does not terminate. |
Pod Shutdown Grace | Enter a timeout in seconds for the node to wait before it forces the pod to terminate. You must enter a valid integer between -1 and 86400 . If you set this value to -1 , the timeout is set to the default timeout specified by the pod. |
Force node to drain even if it has running pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet. | If you activate this configuration, the node still drains when pods are not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet. |
Force node to drain even if it has running DaemonSet-managed pods. | If you activate this configuration, the node still drains when pods are managed by a DeamonSet. |
Force node to drain even if it has running running pods using emptyDir. | If you activate this configuration, the node still drains when pods are using an emptyDir volume. |
Force node to drain even if pods are still running after timeout. | If you activate this configuration and then during the timeout pods fail to drain on the worker node, the node forces running pods to terminate and the upgrade or scale continues. |
Warning: If you select Force node to drain even if pods are still running after timeout, the node halts all running workloads on pods. Before enabling this configuration, set Node Drain Timeout to greater than 0
.
Warning: If you deselect Force node to drain even if it has running DaemonSet-managed pods with Enable Metric Sink Resources, Enable Log Sink Resources, or Enable Node Exporter selected, the upgrade will fail as all options deploy a DaemonSet in the pks-system
namespace.
Note: You can also use the TKGI CLI to configure node drain behavior. To configure the default node drain behavior with the TKGI CLI, run tkgi update-cluster
with an action flag. You can view the current node drain behavior with tkgi cluster –details
. For more information, see Configure Node Drain Behavior in Upgrade Preparation Checklist for Tanzu Kubernetes Grid Integrated Edition v1.9. Warning: Do not use tkgi update-cluster
on clusters configured with a network profile CNI configuration.
Symptom
When you authenticate to an OpenID Connect-enabled cluster using an existing kubeconfig file, you see an authentication or authorization error.
Explanation
users.user.auth-provider.config.id-token
and users.user.auth-provider.config.refresh-token
contained in the kubeconfig file for the cluster might have expired.
Solution
Upgrade the TKGI CLI to v1.2.0 or later.
To download the TKGI CLI, navigate to Broadcom Support. For more information, see Installing the TKGI CLI.
Obtain a kubeconfig file that contains the new tokens by running the following command:
tkgi get-credentials CLUSTER-NAME
Where CLUSTER-NAME
is the name of your cluster.
For example:
$ tkgi get-credentials tkgi-example-cluster
Fetching credentials for cluster tkgi-example-cluster.
Context set for cluster tkgi-example-cluster.
You can now switch between clusters by using:
$kubectl config use-context <cluster-name>
Note: If your operator has configured Tanzu Kubernetes Grid Integrated Edition to use a SAML identity provider, you must include an additional SSO flag to use the above command. For information about the SSO flags, see the section for the above command in TKGI CLI. For information about configuring SAML, see Connecting Tanzu Kubernetes Grid Integrated Edition to a SAML Identity Provider
Connect to the cluster using kubectl.
If you continue to see an authentication or authorization error, verify that you have sufficient access permissions for the cluster.
Symptom
Your NSX-T LB disconnects the sessions for your apps deployed to clusters utilizing websocket. These apps are inaccessible or non-functional.
Explanation
Tanzu Kubernetes Grid Integrated Edition on vSphere with NSX-T fully supports websocket. The most likely cause for this behavior is a connectivity issue specific to supporting websocket.
Solution
Review your configuration for a source for the connectivity issues:
Symptom
TKGI login command fails with an error “Credentials were rejected, please try again.”
Explanation
You might experience this issue when a large number of pods are running continuously in your Tanzu Kubernetes Grid Integrated Edition deployment. As a result, the persistent disk on the TKGI Database VM runs out of space.
Solution
If you expect the cluster workload to run a large number of pods continuously, then increase the size of persistent disk storage allocated to the TKGI Database VM as follows:
Number of Pods | Persistent Disk Requirements (GB) |
---|---|
1,000 pods | 20 |
5,000 pods | 100 |
10,000 pods | 200 |
50,000 pods | 1,000 |
Symptom
You encounter an error similar to one of the following when running a kubectl
or cluster
command:
Explanation
You might experience this issue when your authentication server or a host has the incorrect time.
Workaround
To refresh your credentials, run the following:
pks get-credentials
Solution
Symptom
In stdout or log files, you see an error message referencing post-start scripts failed
or Failed Jobs
.
Explanation
After deploying Tanzu Kubernetes Grid Integrated Edition, Ops Manager runs scripts to start a number of jobs. You must configure Ops Manager to automatically run these post-deploy scripts.
Solution
Perform the following steps to configure Ops Manager to run post-deploy scripts.
Navigate to https://YOUR-OPS-MANAGER-FQDN/
in a browser to log in to the Ops Manager Installation Dashboard.
Click the BOSH Director tile.
Select Director Config.
Select Enable Post Deploy Scripts.
Note: This setting activates post-deploy scripts for all tiles in your Ops Manager installation.
Click Save.
Click the Installation Dashboard link to return to the Installation Dashboard.
Click Review Pending Changes. Review the changes that you made. For more information, see Reviewing Pending Product Changes.
Click Apply Changes.
(Optional) If it is a new deployment of Tanzu Kubernetes Grid Integrated Edition, follow the steps below:
tkgi delete-cluster
to delete the cluster. For more information, see Deleting Clusters.tkgi create-cluster
to recreate the cluster. For more information, see Creating Clusters.Symptom
In stdout or log files, you see an error message that includes lookup vm-WORKER-NODE-GUID on IP-ADDRESS: no such host
.
Explanation
This error occurs on GCP when the Ops Manager Director tile uses 8.8.8.8 as the DNS server. When this IP range is in use, the control plane node cannot locate the route to the worker nodes.
Solution
Use the Google internal DNS range, 169.254.169.254, as the DNS server.
Symptom
In Kubernetes log files, you see a Warning
event from kubelet with FailedMount
as the reason.
Explanation
A persistent volume fails to connect to the Kubernetes cluster worker VM.
Diagnostics
Symptom
Plan not found error when an active plan is deactivated.
Explanation
You might receive the error “plan UUID not found” if, after creating a cluster using a plan (such as Plan 1), you then deactivate the plan (Plan 1) from the TKGI Tile in Ops Manager and then Save and Apply Changes with the Upgrade all clusters errand selected.
Ops Manager does not have capability to check clusters that are using a particular plan. Only when user saves the plan, the deployment process will check whether a plan can be deactivated. The error message "plan is displayed in the Ops Manager logs.
Solution
tkgi cluster my-cluster --details
to view what plan the cluster is using.Symptom
The pre-start script for tkgi create-cluster
fails and logs floating IP pool allocation errors in the pre-start.stderr.log
similar to the following:
level=error msg="operation failed with [POST /pools/ip-pools/{pool-id}][409] allocateOrReleaseFromIpPoolConflict &{RelatedAPIError:{Details: ErrorCode:5141 ErrorData:<nil> ErrorMessage:Requested IP Address ... is already allocated. ModuleName:id-allocation service} RelatedErrors:[]}\n"
level=warning msg="failed to allocate FIP from (pool: ...: [POST /pools/ip-pools/{pool-id}][409] allocateOrReleaseFromIpPoolConflict &{RelatedAPIError:{Details: ErrorCode:5141 ErrorData:<nil> ErrorMessage:Requested IP Address ... is already allocated. ModuleName:id-allocation service} RelatedErrors:[]}\n"
Error: an error occurred during FIP allocation
Explanation
TKGI administrators can allocate floating IP pool IP Addresses in a Network Profile configuration. The TKGI control plane allocates IP Addresses from the floating IP pool without accounting for the IPs allocated using Network Profiles.
Workaround
TKGI allocates IP addresses starting from the beginning of a floating IP pool range. When configuring a Network Profile, allocate IP Addresses starting at the end of the floating IP pool range instead of those at the beginning.