If the update of a TKG cluster fails, you can restart the update job and try the update again.

Problem

The update of a TKG cluster fails and results in cluster status of upgradefailed.

Cause

There can be several reasons for a failed cluster update, such as insufficient storage. To restart a failed update job and try the cluster update again, complete the following procedure.

Solution

  1. Log in to the Supervisor as an administrator.
  2. Look up the update_job_name.
    kubectl get jobs -n vmware-system-tkg -l "run.tanzu.vmware.com/cluster-namespace=${cluster_namespace},cluster.x-k8s.io/cluster-name=${cluster_name}"
  3. Run kubectl proxy so that curl can be used to issue requests.
    kubectl proxy &
    You should see Starting to serve on 127.0.0.1:8001.
    Note: You cannot use kubectl to patch or update the .status of a resource.
  4. Using curl issue the following patch command to raise the .spec.backoffLimit.
    curl -H "Accept: application/json" -H "Content-Type: application/json-patch+json" 
    --request PATCH --data '[{"op": "replace", "path": "/spec/backoffLimit", "value": 8}]' 
    http://127.0.0.1:8001/apis/batch/v1/namespaces/vmware-system-tkg/jobs/${update_job_name}
  5. Using curl issue the following patch command to clear the .status.conditions so that the Job controller can create new pods.
    $ curl -H "Accept: application/json" -H "Content-Type: application/json-patch+json" 
    --request PATCH --data '[{"op": "remove", "path": "/status/conditions"}]' 
    http://127.0.0.1:8001/apis/batch/v1/namespaces/vmware-system-tkg/jobs/${update_job_name}/status