After running vkube cluster update, the cluster fails to update and appears to be in an ERROR state.

Cause

Master nodes or worker nodes in the cluster have failed.

Solution

To troubleshoot a cluster failure, refresh the cluster infrastructure. The same troubleshooting procedure applies when:

  • Cluster dashboard is unresponsive or not reachable.

  • kubectl fails to connect to the cluster.

Perform each subsequent step in the procedure only if the previous step fails to bring the cluster out of its error state.

Procedure

  1. Refresh the cluster infrastructure.
    vkube cluster heal <cluster_id>
  2. If the cluster remains in an ERROR state with the error Recreation of 2 master node(s) exceeded the maximum of 1 out of 3 master node(s) allowed, verify that you have a recent backup of the cluster ready.
  3. List all the nodes in the cluster.
    vkube cluster show <cluster_id>

    The output is a comma-separated list of nodes in the cluster.

  4. Refresh the master nodes in the cluster.
    vkube cluster heal <cluster_id> --nodes <k8s-master_node1, k8s-master_node2, ...>

    Where k8s-master_node1, k8s-master_node2, ... are nodes in the comma-separated list that begin with k8s-master.

  5. Restore the configurations of applications running on the cluster.
    vkube job cluster recover
    Note:

    For applications running on Kubernetes that are deployed using a daemon set with service account, perform the following additional steps:

    1. Delete the service account and service.

    2. Redeploy the service account and service.

    3. If you are using local storage, restore your application data.