Troubleshoot TKG Service Cluster Networking Errors

Refer to the tips in this section to troubleshoot TKGS cluster networking errors.

Check Node Networking

Each TKG cluster should have the following network resources.

Network Object	Network Resources	Description	Troubleshoot	Command
VirtualNetwork	Tier-1 Router and linked Segment	Node network for the cluster	Make sure SNAT IP is assigned	kubectl get virtualnetwork -n NS-NAME
VirtualNetworkInterface	Logical Port on Segment	Node network interface for cluster nodes	Make sure each VirtualMachine has IP address	kubectl get virtualmachines -n NS-NAME NODE-NAME

Check the Load Balancer for the Control Plane

The load balancer for the TKG cluster control plane provides access to the Kubernetes API server. This load balancer is automatically provisioned by the system during cluster creation. It should have the following resources.

Checking the status of the control plane load balancer can you help you understand if resources were ready when errors occurred. In general you find this load balancer using Find these load balancers with this command against the supervisor cluster: kubectl get services -A | grep control-plane-service

Network Object	Network Resources	Description	Troubleshoot	Command
VirtualMachineService	N/A	VirtualMachineService is created and is translated to a k8s service.	Make sure its status is updated and includes the load balancer virtual IP (VIP).	kubectl get virtualmachineservices -n NS-NAME SERVICE-NAME
Service	Load Balancer Server with VirtualServer instance and associated Server Pool (member pool)	Kubernetes service of type Load Balancer is created for access to the TKG cluster API server.	Make sure an external IP is assigned. Make sure you can access the TKG cluster API via the external IP of the LB service.	Supervisor namespace: kubectl get services -A \| grep control-plane-service Cluster namespace: kubectl get services -n NS-NAME Either namespace curl -k https://EXTERNAL-IP:PORT/healthz
Endpoints	The endpoint members (TKG cluster control plane nodes) should be in the member pool.	An endpoint is created to include all the TKG cluster control plane nodes.		kubectl get endpoints -n NS-NAME SERVICE-NAME

Check Load Balancer Services on Worker Nodes

A load balancer instance for the TKG cluster worker nodes is created by the user when a Kubernetes service of type LoadBalancer is created.

The first step is to make sure the cloud provider is running on the TKG cluster.

kubectl get pods -n vmware-system-cloud-provider

Verify related Kubernetes objects have been created and are in a correct state.

Network Objects	Network Resources	Description	Command
VirtualMachineService in Supervisor	N/A	A VirtualMachineService is created in Supervisor and translated to a Kubernetes service in Supervisor	kubectl get virtualmachineservice -n NS-NAME SVC-NAME
Load Balancer Service in Supervisor	VirtualServer in the TKG cluster load balancer, and an associated member pool.	Load balancer Service is created in Supervisor for access of this LB type of service	kubectl get services -n NS-NAME SVC-NAME
Endpoints in Supervisor	The endpoint members (TKG cluster worker nodes) should be in the member pool in NSX.	An endpoint is created to include all the TKG cluster worker nodes	# kubectl get endpoints -n NS-NAME SVC-NAME
Load Balancer Service in TKG cluster	N/A	The Load Balancer Service in the TKG cluster deployed by user should have its status updated with the load balancer IP	kubectl get services

Check Supervisor NSX Networking Stack

The Kubernetes API server, the NCP pod, and the manager container that runs in any controller pod are the primary starting points for checking infrastructure networking issues.

The following error message can indicate a routing or MTU issue at any point of the network fabric, including the physical prot group that the ESXi hosts NICs are connected to:

{"log":"I0126 19:40:15.347154 1 log.go:172] http: TLS handshake error from 
100.64.128.1:4102: EOF\n","stream":"stderr","time":"2021-01-26T19:40:15.347256146Z"}

To troubleshoot, SSH to the ESXi host and run the following command:

esxcli network ip interface ipv4 get

This command lists all vmkernel interfaces of the host. If you have a single TEP interface it will always be vmk10. If you have a 2nd or 3rd TEP interface, it will be vmk11 and vk12 and so on. The amount of TEP interfaces that are created depends on how many uplinks you assigned to the TEP in the uplink profile. A TEP interface gets created per uplink, if you selected “load sharing” for the TEPs across uplinks.

The main TEP to TEP ping command has the following syntax:

vmkping ++netstack=vxlan -s 1572 -d -I vmk10 10.218.60.66

Where

-s is the packet size
-d means don’t fragment
-I means source the link from vmk10
The IP address is a TEP interface on another ESXi host or NSX edge that you are pinging

If the MTU is set to 1600, a packet size over 1573 should fail (you only need MTU above 1500). If MTU is set to 1500 anything over 1473 should fail. You may want to change this to vmk11 if you have additional TEP interfaces from which you want to source the ping.