After the system experiences a very high rate of events in a short period of time, some of the pods in the TKG Cluster on Supervisor or upstream Kubernetes cluster are stuck in the Terminating status.

Problem

After the system recovered from a very high rate of events occurring, the System > NSX Application Platform UI displays that the NSX Application Platform is in a Degraded status. In addition, some of the pods in the TKG Cluster on Supervisor or upstream Kubernetes cluster are stuck in the Terminating status for a few minutes or longer.

Cause

Due to some Kubernetes infrastructure issues, some of the pods cannot be deleted correctly because of one of the following reasons.
  • A finalizer associated with the stuck pod is not able to complete.
  • The stuck pod is not responding to the termination signals.

Solution

Ask your infrastructure administrator to use the following information to manually delete the pods that are stuck in the Terminating status.
  1. Log in to the control node for your TKG Cluster on Supervisor or upstream Kubernetes cluster.
  2. Use the following command to find all of the pods that are in the Terminating status.
    get pod -A | grep Terminating
  3. Force delete the pods with the Terminating status, using the following command.
    kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
  4. Repeat the following command and verify that the stuck pods have been deleted successfully. If necessary, repeat step 3 again for the pods that continue to be in the Terminating status.
    get pod -A | grep Terminating