After running vkube cluster update, the cluster fails to update and appears to be in an ERROR state.

To troubleshoot a cluster failure, refresh the cluster infrastructure. The same troubleshooting procedure applies when:
  • Cluster dashboard is unresponsive or not reachable.
  • kubectl fails to connect to the cluster.

Perform each subsequent step in the procedure only if the previous step fails to bring the cluster out of its error state.

Cause

Primary nodes or worker nodes in the cluster have failed.

Solution

  1. Refresh the cluster infrastructure.
    vkube cluster heal <cluster_id>
  2. If the cluster remains in an ERROR state with the error Recreation of 2 master node(s) exceeded the maximum of 1 out of 3 master node(s) allowed, verify that you have a recent backup of the cluster ready.
  3. List all the nodes in the cluster.
    vkube cluster show <cluster_id>
    The output is a comma-separated list of nodes in the cluster.
  4. Refresh the primary nodes in the cluster.
    vkube cluster heal <cluster_id> --nodes <k8s-master_node1, k8s-master_node2, ...>
    Where k8s-master_node1, k8s-master_node2, ... are nodes in the comma-separated list that begin with k8s-master.
  5. Restore the configurations of applications running on the cluster.
    vkube job cluster recover
    Note: For applications running on Kubernetes that are deployed using a daemon set with service account, perform the following additional steps:
    1. Delete the service account and service.
    2. Redeploy the service account and service.
    3. If you are using local storage, restore your application data.