Troubleshoot TKGS Cluster Provisioning Errors

Check Cluster API Logs

If you cannot create a TKG cluster, check that CAPW/V is functioning.

The CAPW/V controller is the infrastructure-specific implementation of the Cluster API. CAPW/V is enabled through Supervisor. CAPW/V is a component of TKG and is responsible for managing the life cycle of TKG clusters.

CAPW/V is responsible for creating and updating the VirtualNetwork. Only if the VirtualNetwork is ready can the creation of the cluster nodes move forward. Did the cluster creation workflow pass this phase?

CAPW/V is responsible for creating and updating the VirtualMachineService. Did the VirtualMachineService get created successfully? Did it get the external IP? Did the cluster creation workflow pass this phase?

To answer these qustions, check the Cluster API log as follows:

kubectl config use-context tkg-cluster-ns

kubectl get pods -n vmware-system-capw  | grep capv-controller

kubectl logs -n vmware-system-capw -c manager capv-controller-manager-...

Cluster Spec Validation Error

According to the YAML specification, the space character is allowed to be used in a key name. It is a scalar string containing a space and does not require quotes.

However, TKGS validation does not allow the use of the space character in key names. In TKGS, a valid key name must consist only of alphanumeric characters, a dash (such as key-name), an underscore (such as KEY_NAME) or a dot (such as key.name).

If you use the space character in a key name in the cluster spec, the TKGS cluster is not deployed. The vmware-system-tkg-controller-manager log show the following error message:

Invalid value: \"Key Name\": a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')"

To fix the error, remove the space character entirely, or replace it with a supported character.

Errors When Applying the TKG Cluster YAML

If you receive errors when applying the TKG cluster YAML, troubleshoot as follows.

Cluster network is not in correct state

Understand the TKG cluster provisioning workflow:

CAPV creates a VirtualNetwork object for each TKG cluster network.
If Supervisor is configured with NSX networking, NCP watches for VirtualNetwork objects and creates an NSX Tier-1 router and a NSX segment for each VirtualNetwork.
CAPV checks the status of VirtualNetwork and, when ready, proceeds to next step in the workflow.

VM Service controller is watching for custom objects created by CAPV and uses those specifications to create and configure the VMs that make up the TKG cluster.

NSX Container Plugin (NCP) is a controller that watches for network resources added to etcd through the Kubernetes API and orchestrates the creation of corresponding objects in NSX.

Each of these controllers run as Kubernetes pods on the Supervisor control plane. To troubleshoot network issues, check the CAPV Controller log, the VM Service log, and the NCP log.

Check the container logs, where name-XXXX is the unique pod name when you run

kubectl get pods -A
kubectl logs pod/name-XXXXX -c pod-name -n namesapce

Invalid control plane node count

TKG cluster on Supervisor supports 1 or 3 control plane nodes. If you enter a different number of replicas, cluster provisioning fails.

Invalid storage class for control plane/worker VM

Run the following command:

kubectl describe ns <tkg-clsuter-namesapce>

Make sure that a storage class has been assigned to the namespace in which you're attempting to create the TKG cluster. There needs to be a ResourceQuota in the vSphere Namespace referencing that storage class, and the storage class needs to exist in Supervisor.

Make sure that the name matches the storage class present in Supervisor. Run kubectl get storageclasses Supervisor as an vSphere administrator. WCP may transform the name when applying the storage profile to Supervisor (for example, hyphens become underscores).

Invalid VM class

Ensure the value supplied in the cluster YAML matches one of the VM classes returned by kubectl get virtualmachineclass. Only bound classes can be used by a TKG cluster. A VM class is bound when you add it to the vSphere Namespace.

The command kubectl get virtualmachineclasses returns all VM classes on Supervisor, but only those that are bound can be used.

Unable to find TKR distributions

If you see an error similar to the following:

“Error from server (unable to find Kubernetes distributions): 
admission webhook “version.mutating.tanzukubernetescluster.run.tanzu.vmware.com” 
denied the request: unable to find Kubernetes distributions”

This is likely because of a content library issue. To list what you have available, use the command kubectl get virtualmachineimages -A. The result is what is available and synchronized or uploaded in your content library.

For TKG on Supervisor, there are new TKR names that are compatible with the new TKR API. You need to be sure to name each TKR correctly in the content library.

Name in content library: photon-3-amd64-vmi-k8s-v1.23.8---vmware.2-tkg.1-zshippable

Corresponding name in the TKG cluster spec: version: v1.23.8+vmware.2-tkg.1-zshippable

TKG YAML Is Applied But No VMs Are Created

If the TKG 2.0 cluster YAML is valid and applied but the node VMs are not created, troubleshoot as follows.

Check CAPI/CAPV resources

Check if TKG created the CAPI/CAPV level resources.

Check to see if CAPV created the VirtualMachine resources.
Check the VM Operator logs to see why the VM was not created, for example, OVF deployment may have failed due to insufficient resources on the ESX host.
Check the CAPV and VM Operator logs.
Check NCP logs NCP is responsible to realize VirtualNetwork, VirtualNetworkInterface and LoadBalancer for control plane. If any error related to those resources, it can be an issue.

Virtual Machine Services Error

Virtual Machine Services Error

Run kubectl get virtualmachineservices in your namespace
Was a Virtual Machine Service created?
Run kubectl describe virtualmachineservices in your namespace
Are there errors reported on the Virtual Machine Service?

Virtual Network Error

Run kubectl get virtualnetwork in your namespace.

Is the Virtual Network created for this cluster?

Run kubectl describe virtual network in your namespace.

Is the Virtual Network Interface created for the VM?

TKG Cluster Control Plane Is Not Running

If the TKG control plane is not running, check if resources were ready when the error happened. Is it a Join Node control plane that is not up or an Init Node? Also, check if the Provider ID is not set in the node object.

Check if resources were ready when error happened

Besides looking for logs, checking related objects' status ControlPlaneLoadBalancer will help you understand if resources were ready when error happened. See network troubleshooting.

Is it a Join Node control plane that is not up or an Init Node?

Node joins sometimes don’t work properly. Look at the node logs for particular VM. The cluster might be missing worker and control plane nodes if init node doesn’t come up successfully.

Provider ID is not set in the node object

If VM created, check if it has IPs then look into cloud-init logs (kubeadm commands are properly executed)

Check CAPI controller logs to see if there is any issue. You can check that using kubectl get nodes on the TKG cluster and then check if the provider ID exists on the node object.

TKG Worker Nodes Are Not Created

If the TKG cluster and control plane VMs are created, but no workers are created or no other virtual machine objects created, try the following:

kubectl describe cluster CLUSTER-NAME

Check for virtual machine resources in the namespace, were there any others created?

If not, check the CAPV logs to see why it's not creating the other virtual machine objects bootstrap data not available.

If CAPI cannot talk to the TKG cluster control plane via the load balancer, either NSX with node VM IP or VDS with external load balancer, get the TKG cluster kubeconfig using the secret in the namespace:

Get the TKG cluster kubeconfig using the secret in the namespace:

kubectl get secret -n <namespace> <tkg-cluster-name>-kubeconfig -o jsonpath='{.data.value}' | base64 -d 
> tkg-cluster-kubeconfig; kubectl --kubeconfig tkg-cluster-kubeconfig get pods -A

If this fails with 'connection refused' your control plane probably did not initialize properly. If there's an I/O timeout, verify connectivity to the IP address in the kubeconfig.

NSX with embedded load balancer:

Verify control plane LB is up and reachable.
If LB does not have IP, check NCP logs and check NSX-T UI to see if related components are in correct states. (NSX-T LB, VirtualServer, ServerPool should all be in healthy state).
If LB has IP but not reachable (curl -k https://<LB- VIP>:6443/healthz should return unauthorized error).
If LoadBalancer Type of Service External IP is in a "pending" state, check that the TKG cluster can communicate with the Supervisor Kubernetes API via the Supervisor LB VIP. Make sure there is no IP address overlap.

Check if TKG control plane nodes are in healthy state:

Check if TKG cluster control plane reports any error (like cannot make node with provider ID).
TKG cluster cloud provider did not mark the node with the correct provider ID, thus CAPI cannot compare the provider ID in the guest cluster node and the machine resource in the supervisor cluster to verify.

SSH into the control plane VM or use the TKG cluster kubeconfig to check if the TKG cloud provider pod is running/logged any errors. See Connecting to TKG Service Clusters as a Kubernetes Administrator and System User.

kubectl get po -n vmware-system-cloud-provider

kubectl logs -n vmware-system-cloud-provider <pod name>

If the VMOP did not reconcile VirtualMachineService successfully, check VM Operator log.

If NCP had issues creating NSX-T resources, check the NCP log.

If the control plane did not initialize properly, determine the VM IP. The status should contain the VM IP.

kubectl get virtualmachine -n <namespace> <TKC-name>-control-plane-0 -o yaml

ssh vmware-system-user@<vm-ip> -i tkc-cluster-ssh

Check to see if kubeadm logged any errors.

cat /var/log/cloud-init-output.log | less

Provisioned TKG Cluster Stuck in "Creating" Phase

Run the following commands to check the cluster status.

kubectl get tkc -n <namespace>

kubectl get cluster -n <namespace>

kubectl get machines -n <namespace>

KubeadmConfig was present but CAPI was not able to find it. Checked if the token in vmware-system-capv had the right permissions to query kubeadmconfig.

$kubectl --token=__TOKEN__ auth can-i get kubeadmconfig 
yes

It is possible that controller-runtime cache was not being updated. The CAPI watch caches may be stale and not picking up the new objects. If necessary restart the capi-controller-manager to resolve the issue.

kubectl rollout restart deployment capi-controller-manager -n vmware-system-capv

vSphere Namespace Stuck in "Terminating" Phase

Verify that the TKR, Supervisor, and vCenter are in sync from a version compatibility perspective.

Namespaces can only be deleted when all the resources under the namespaces in turn are deleted.

kubectl describe namespace NAME

The following error was found: “Error from server (unable to find Kubernetes distributions): admission webhook “version.mutating.tanzukubernetescluster.run.tanzu.vmware.com” denied the request: unable to find Kubernetes distributions”

Check the virtual machine images on vCenter.

kubectl get virtualmachineimages -A