If you are unable to provision a TKGS cluster, review this list of common errors to troubleshoot.
Check Cluster API Logs
If you cannot create a TKG cluster, check that CAPW/V is functioning.
The CAPW/V controller is the infrastructure-specific implementation of the Cluster API. CAPW/V is enabled through Supervisor. CAPW/V is a component of TKG and is responsible for managing the life cycle of TKG clusters.
CAPW/V is responsible for creating and updating the VirtualNetwork. Only if the VirtualNetwork is ready can the creation of the cluster nodes move forward. Did the cluster creation workflow pass this phase?
CAPW/V is responsible for creating and updating the VirtualMachineService. Did the VirtualMachineService get created successfully? Did it get the external IP? Did the cluster creation workflow pass this phase?
kubectl config use-context tkg-cluster-ns
kubectl get pods -n vmware-system-capw | grep capv-controller
kubectl logs -n vmware-system-capw -c manager capv-controller-manager-...
Cluster Spec Validation Error
According to the YAML specification, the space character is allowed to be used in a key name. It is a scalar string containing a space and does not require quotes.
However, TKGS validation does not allow the use of the space character in key names. In TKGS, a valid key name must consist only of alphanumeric characters, a dash (such as key-name
), an underscore (such as KEY_NAME
) or a dot (such as key.name
).
If you use the space character in a key name in the cluster spec, the TKGS cluster is not deployed. The vmware-system-tkg-controller-manager log show the following error message:
Invalid value: \"Key Name\": a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')"
To fix the error, remove the space character entirely, or replace it with a supported character.
Errors When Applying the TKG Cluster YAML
- Cluster network is not in correct state
-
Understand the TKG cluster provisioning workflow:
- CAPV creates a VirtualNetwork object for each TKG cluster network.
- If Supervisor is configured with NSX networking, NCP watches for VirtualNetwork objects and creates an NSX Tier-1 router and a NSX segment for each VirtualNetwork.
- CAPV checks the status of VirtualNetwork and, when ready, proceeds to next step in the workflow.
VM Service controller is watching for custom objects created by CAPV and uses those specifications to create and configure the VMs that make up the TKG cluster.
NSX Container Plugin (NCP) is a controller that watches for network resources added to etcd through the Kubernetes API and orchestrates the creation of corresponding objects in NSX.
Each of these controllers run as Kubernetes pods on the Supervisor control plane. To troubleshoot network issues, check the CAPV Controller log, the VM Service log, and the NCP log.
- Invalid control plane node count
-
TKG cluster on Supervisor supports 1 or 3 control plane nodes. If you enter a different number of replicas, cluster provisioning fails.
- Invalid storage class for control plane/worker VM
-
Run the following command:
kubectl describe ns <tkg-clsuter-namesapce>
Make sure that a storage class has been assigned to the namespace in which you're attempting to create the TKG cluster. There needs to be a ResourceQuota in the vSphere Namespace referencing that storage class, and the storage class needs to exist in Supervisor.
Make sure that the name matches the storage class present in Supervisor. Run
kubectl get storageclasses
Supervisor as an vSphere administrator. WCP may transform the name when applying the storage profile to Supervisor (for example, hyphens become underscores). - Invalid VM class
-
Ensure the value supplied in the cluster YAML matches one of the VM classes returned by
kubectl get virtualmachineclass
. Only bound classes can be used by a TKG cluster. A VM class is bound when you add it to the vSphere Namespace.The command
kubectl get virtualmachineclasses
returns all VM classes on Supervisor, but only those that are bound can be used. - Unable to find TKR distributions
-
If you see an error similar to the following:
“Error from server (unable to find Kubernetes distributions): admission webhook “version.mutating.tanzukubernetescluster.run.tanzu.vmware.com” denied the request: unable to find Kubernetes distributions”
This is likely because of a content library issue. To list what you have available, use the command
kubectl get virtualmachineimages -A
. The result is what is available and synchronized or uploaded in your content library.For TKG on Supervisor, there are new TKR names that are compatible with the new TKR API. You need to be sure to name each TKR correctly in the content library.
Name in content library:
photon-3-amd64-vmi-k8s-v1.23.8---vmware.2-tkg.1-zshippable
Corresponding name in the TKG cluster spec:
version: v1.23.8+vmware.2-tkg.1-zshippable
TKG YAML Is Applied But No VMs Are Created
- Check CAPI/CAPV resources
-
Check if TKG created the CAPI/CAPV level resources.
- Check to see if CAPV created the VirtualMachine resources.
- Check the VM Operator logs to see why the VM was not created, for example, OVF deployment may have failed due to insufficient resources on the ESX host.
- Check the CAPV and VM Operator logs.
- Check NCP logs NCP is responsible to realize VirtualNetwork, VirtualNetworkInterface and LoadBalancer for control plane. If any error related to those resources, it can be an issue.
- Virtual Machine Services Error
-
Virtual Machine Services Error
- Run
kubectl get virtualmachineservices
in your namespace - Was a Virtual Machine Service created?
- Run
kubectl describe virtualmachineservices
in your namespace - Are there errors reported on the Virtual Machine Service?
- Run
- Virtual Network Error
-
Run
kubectl get virtualnetwork
in your namespace.Is the Virtual Network created for this cluster?
Run
kubectl describe virtual network
in your namespace.Is the Virtual Network Interface created for the VM?
TKG Cluster Control Plane Is Not Running
- Check if resources were ready when error happened
-
Besides looking for logs, checking related objects' status ControlPlaneLoadBalancer will help you understand if resources were ready when error happened. See network troubleshooting.
- Is it a Join Node control plane that is not up or an Init Node?
-
Node joins sometimes don’t work properly. Look at the node logs for particular VM. The cluster might be missing worker and control plane nodes if init node doesn’t come up successfully.
- Provider ID is not set in the node object
-
If VM created, check if it has IPs then look into cloud-init logs (kubeadm commands are properly executed)
Check CAPI controller logs to see if there is any issue. You can check that using
kubectl get nodes
on the TKG cluster and then check if the provider ID exists on the node object.
TKG Worker Nodes Are Not Created
kubectl describe cluster CLUSTER-NAME
Check for virtual machine resources in the namespace, were there any others created?
If not, check the CAPV logs to see why it's not creating the other virtual machine objects bootstrap data not available.
If CAPI cannot talk to the TKG cluster control plane via the load balancer, either NSX with node VM IP or VDS with external load balancer, get the TKG cluster kubeconfig using the secret in the namespace:
kubectl get secret -n <namespace> <tkg-cluster-name>-kubeconfig -o jsonpath='{.data.value}' | base64 -d > tkg-cluster-kubeconfig; kubectl --kubeconfig tkg-cluster-kubeconfig get pods -A
If this fails with 'connection refused' your control plane probably did not initialize properly. If there's an I/O timeout, verify connectivity to the IP address in the kubeconfig.
NSX with embedded load balancer:
- Verify control plane LB is up and reachable.
- If LB does not have IP, check NCP logs and check NSX-T UI to see if related components are in correct states. (NSX-T LB, VirtualServer, ServerPool should all be in healthy state).
- If LB has IP but not reachable (
curl -k https://<LB- VIP>:6443/healthz
should return unauthorized error).If LoadBalancer Type of Service External IP is in a "pending" state, check that the TKG cluster can communicate with the Supervisor Kubernetes API via the Supervisor LB VIP. Make sure there is no IP address overlap.
- Check if TKG cluster control plane reports any error (like cannot make node with provider ID).
- TKG cluster cloud provider did not mark the node with the correct provider ID, thus CAPI cannot compare the provider ID in the guest cluster node and the machine resource in the supervisor cluster to verify.
kubectl get po -n vmware-system-cloud-provider
kubectl logs -n vmware-system-cloud-provider <pod name>
If the VMOP did not reconcile VirtualMachineService successfully, check VM Operator log.
If NCP had issues creating NSX-T resources, check the NCP log.
kubectl get virtualmachine -n <namespace> <TKC-name>-control-plane-0 -o yaml
ssh vmware-system-user@<vm-ip> -i tkc-cluster-ssh
cat /var/log/cloud-init-output.log | less
Provisioned TKG Cluster Stuck in "Creating" Phase
Run the following commands to check the cluster status.
kubectl get tkc -n <namespace>
kubectl get cluster -n <namespace>
kubectl get machines -n <namespace>KubeadmConfig was present but CAPI was not able to find it. Checked if the token in vmware-system-capv had the right permissions to query kubeadmconfig.
$kubectl --token=__TOKEN__ auth can-i get kubeadmconfig yesIt is possible that controller-runtime cache was not being updated. The CAPI watch caches may be stale and not picking up the new objects. If necessary restart the capi-controller-manager to resolve the issue.
kubectl rollout restart deployment capi-controller-manager -n vmware-system-capv
vSphere Namespace Stuck in "Terminating" Phase
Verify that the TKR, Supervisor, and vCenter are in sync from a version compatibility perspective.
kubectl describe namespace NAME
The following error was found: “Error from server (unable to find Kubernetes distributions): admission webhook “version.mutating.tanzukubernetescluster.run.tanzu.vmware.com” denied the request: unable to find Kubernetes distributions”
kubectl get virtualmachineimages -A