After running vkube cluster update, the cluster fails to update and appears to be in an ERROR state.

Cause

Master nodes or worker nodes in the cluster have failed.

Solution

To troubleshoot cluster failure, refresh the cluster infrastructure. The same troubleshooting procedure applies when:

  • cluster dashboard is unresponsive or not reachable

  • kubectl fails to connect to the cluster

Perform each subsequent step in the procedure only if the previous step fails to bring the cluster out of its ERROR state.

Procedure

  1. Run the vkube cluster update command to refresh the cluster infrastructure.
  2. If the cluster remains in an ERROR state, refresh the cluster infrastructure according to instructions specific for the provider.
    • For a cluster using an OpenStack provider.

      1. Use the OpenStack client to delete all failing nodes that are instances in OpenStack.

      2. To refresh the cluster interface, run: vkube cluster update

    • For a cluster using an SDDC provider.

      1. Using SSH, log in to the VMware Integrated OpenStack with Kubernetes VM.

      2. Run the command: docker exec -it toolbox bash

      3. Run the command: source ~/cloudadmin.rc

      4. Run the command: export OS_PASSWORD=<password>

      5. Use the OpenStack client to delete all failing nodes.

      6. To refresh the cluster interface, run: vkube cluster update

  3. If the cluster remains in an ERROR state, refresh the cluster infrastructure and restore all application configurations.
    1. Stop kubelet in all worker nodes. See Stop Kubelet in Worker Nodes.
    2. Use the OpenStack client to delete all master nodes.
    3. To refresh the cluster interface, run: vkube cluster update.
    4. To restore the configurations of applications running on the cluster, run: vkube job cluster recover.
    5. For applications running on Kubernetes that are deployed using a daemon set with service account, perform the following additional steps:
      1. Delete the service account and service.

      2. Redeploy the service account and service.

      3. If you are using local storage, restore your application data.