This topic serves as a checklist for preparing to upgrade VMware Tanzu Kubernetes Grid Integrated Edition from v1.11 to v1.12.
This topic describes the steps that you must follow before beginning your TKGI upgrade.
Warning: Failure to follow these instructions may jeopardize your existing deployment data and cause the TKGI upgrade to fail.
To prepare for a TKGI Upgrade:
After completing the steps in this topic, continue to Upgrading Tanzu Kubernetes Grid Integrated Edition (Antrea and Flannel Networking) or Upgrading Tanzu Kubernetes Grid Integrated Edition (NSX-T Networking).
VMware recommends backing up your Tanzu Kubernetes Grid Integrated Edition deployment and workloads before upgrading. To back up Tanzu Kubernetes Grid Integrated Edition, see Backing Up and Restoring Tanzu Kubernetes Grid Integrated Edition.
If you have not already done so, review About Tanzu Kubernetes Grid Integrated Edition Upgrades.
Plan your upgrade based on your workload capacity and uptime requirements.
Review the Release Notes for Tanzu Kubernetes Grid Integrated Edition v1.12.
To determine the upgrade order for your Tanzu Kubernetes Grid Integrated Edition environment, review Upgrade Order for Tanzu Kubernetes Grid Integrated Edition Environments on vSphere.
Coordinate the Tanzu Kubernetes Grid Integrated Edition upgrade with cluster admins and users. During the upgrade:
Their workloads will remain active and accessible.
They will be unable to perform cluster management functions, including creating, resizing, updating, and deleting clusters.
They will be unable to log in to TKGI or use the TKGI CLI and other TKGI control plane services.
Note: Cluster admins should not start any cluster management tasks right before an upgrade. Wait for cluster operations to complete before upgrading.
The built-in Clair container image scanner is deprecated in favor of Trivy. If you have enabled Clair, do one of the following before upgrading to Tanzu Kubernetes Grid Integrated Edition v1.12:
Install the Harbor tile v2.2.1 and select the Trivy scanner as the default in “Interrogation Service”. For more information about the Harbor tile, see the VMware Harbor Registry documentation.
Install the Clair scanner outside of the Harbor tile VM and configure the Clair scanner as the default scanner in “Interrogation Service”. For more information, see Getting Started With Clair in the Clair documentation.
Tanzu Kubernetes Grid Integrated Edition v1.12 does not support clusters running versions of TKGI earlier than v1.11.
Before you upgrade from Tanzu Kubernetes Grid Integrated Edition v1.11 to v1.12, you must upgrade all of your TKGI-provisioned clusters to v1.11.
To upgrade TKGI-provisioned clusters:
Check the version of your clusters:
tkgi clusters
If one or more of your clusters are running a version of TKGI earlier than v1.11, upgrade the clusters. For instructions, see Upgrading Clusters.
Note: When upgrading TKGI to mitigate the Apache Log4j vulnerability you must also upgrade all TKGI clusters.
It is critical that you confirm that a cluster’s resource usage is within the recommended maximum limits before upgrading the cluster.
VMware Tanzu Kubernetes Grid Integrated Edition upgrades a cluster by upgrading control plane and worker nodes individually. The upgrade processes a control plane node by redistributing the node’s workload, stopping the node, upgrading it and restoring its workload. This redistribution of a node’s workloads increases the resource usage on the remaining nodes during the upgrade process.
If a Kubernetes cluster control plane VM is operating too close to capacity, the upgrade can fail.
Warning: Downtime is required to repair a cluster failure resulting from upgrading an overloaded Kubernetes cluster control plane VM.
To prevent workload downtime during a cluster upgrade, complete the following before upgrading a cluster:
Ensure none of the control plane VMs being upgraded will become overloaded during the cluster upgrade. See Control Plane Node VM Size for more information.
Review the cluster’s workload resource usage.
Scale up the cluster if it is near capacity on its existing infrastructure. Scale up your cluster by running the command below or create a new cluster using a larger plan. For more information, see Changing Cluster Configurations.
tkgi update-cluster CLUSTER-NAME --num-nodes NUMBER-OF-WORKER-NODES
Where:
CLUSTER-NAME
is the name of your cluster.NUMBER-OF-WORKER-NODES
is the number of worker nodes that you want to set for the cluster. Note: VMware recommends that you avoid using the tkgi resize
command to perform resizing operations.
Run the cluster’s workloads on at least three worker VMs using multiple replicas of your workloads spread across those VMs. For more information, see Maintaining Workload Uptime.
Verify that your Kubernetes environment is healthy. To verify the health of your Kubernetes environment, see Verifying Deployment Health.
If you are upgrading Tanzu Kubernetes Grid Integrated Edition, verify the configuration of your environment supports the TKGI version you are installing:
If you are using Flannel networking, this verification step is unnecessary.
If you are upgrading Tanzu Kubernetes Grid Integrated Edition for environments using vSphere with NSX-T, perform the following steps:
Note: Workloads in your Kubernetes cluster are unavailable while the NSX Edge nodes run the upgrade unless you configure NSX Edge for high availability. For more information, see the Configure NSX Edge for High Availability (HA) section of Preparing NSX-T Before Deploying Tanzu Kubernetes Grid Integrated Edition.
If you are upgrading Tanzu Kubernetes Grid Integrated Edition in an environment using Antrea networking, perform the following steps:
Note: Port 6081 must be open on all of the worker node VMs and port 8091 must be open on all control plane node VMs in the clusters you create in an Antrea networking environment.
Clean up or fix any previous failed attempts to create TKGI clusters with the TKGI Command Line Interface (TKGI CLI) by performing the following steps:
View your deployed clusters by running the following command:
tkgi clusters
If the Status
of any cluster displays as FAILED
, continue to the next step. If no cluster displays as FAILED
, no action is required. Continue to the next section.
To troubleshoot and fix failed clusters, perform the procedure in Cluster Creation Fails.
To clean up failed BOSH deployments related to failed clusters, perform the procedure in Cannot Re-Create a Cluster that Failed to Deploy.
After fixing and cleaning up any failed clusters, view your deployed clusters again by running tkgi clusters
.
For more information about troubleshooting and fixing failed clusters, see the Knowledge Base.
Verify that existing Kubernetes clusters have unique external hostnames by checking for multiple Kubernetes clusters with the same external hostname. Perform the following steps:
Log in to the TKGI CLI. For more information, see Logging in to Tanzu Kubernetes Grid Integrated Edition. You must log in with an account that has the UAA scope of pks.clusters.admin
. For more information about UAA scopes, see Managing Tanzu Kubernetes Grid Integrated Edition Users with UAA.
View your deployed TKGI clusters by running the following command:
tkgi clusters
For each deployed cluster, run tkgi cluster CLUSTER-NAME
to view the details of the cluster. For example:
$ tkgi cluster my-clusterExamine the output to verify that the
Kubernetes Master Host
is unique for each cluster. Verify your current TKGI proxy configuration by performing the following steps:
Check whether an existing proxy is enabled:
If the existing No Proxy field contains any of the following values, or you plan to add any of the following values, contact Support:
localhost
my-host.mydomain.com
Tanzu Kubernetes Grid Integrated Edition upgrades can run without ever completing if any Kubernetes app has a PodDisruptionBudget
with maxUnavailable
set to 0
.
To ensure that no apps have a PodDisruptionBudget
with maxUnavailable
set to 0
:
Run the following kubectl
command to verify the PodDisruptionBudget
as the cluster administrator:
kubectl get poddisruptionbudgets --all-namespaces
Examine the output to verify that no app displays 0
in the MAX UNAVAILABLE
column.
During the Tanzu Kubernetes Grid Integrated Edition upgrade process, worker nodes are cordoned and drained. Workloads can prevent worker nodes from draining and cause the upgrade to fail or hang.
To prevent hanging cluster upgrades, you can configure default node drain behavior using the following methods:
The new default behavior takes effect during the next upgrade, not immediately after configuring the behavior.
To configure node drain behavior in the Tanzu Kubernetes Grid Integrated Edition tile, see Worker Node Hangs Indefinitely in Troubleshooting.
To configure default node drain behavior with the TKGI CLI:
View the current node drain behavior by running the following command:
tkgi cluster CLUSTER-NAME --details
Where CLUSTER-NAME
is the name of your cluster.
For example:
$ tkgi cluster my-cluster –details
Name: my-cluster
Plan Name: small
UUID: f55ed6c4-c0a7-451d-b735-56c89fdb2ad7
Last Action: CREATE
Last Action State: succeeded
Last Action Description: Instance provisioning completed
Kubernetes Master Host: my-cluster.tkgi.local
Kubernetes Master Port: 8443
Worker Nodes: 3
Kubernetes Master IP(s): 10.196.219.88
Network Profile Name:
Kubernetes Settings Details:
Set by Cluster:
Kubelet Node Drain timeout (mins) (kubelet-drain-timeout): 10
Kubelet Node Drain grace-period (mins) (kubelet-drain-grace-period): 10
Kubelet Node Drain force (kubelet-drain-force): true
Set by Plan:
Kubelet Node Drain force-node (kubelet-drain-force-node): true
Kubelet Node Drain ignore-daemonsets (kubelet-drain-ignore-daemonsets): true
Kubelet Node Drain delete-local-data (kubelet-drain-delete-local-data): true
Configure the default node drain behavior by running the following command:
tkgi update-cluster CLUSTER-NAME FLAG
Where:
CLUSTER-NAME
is the name of your cluster.FLAG
is an action flag for updating the node drain behavior.For example:
$ tkgi update-cluster my-cluster –kubelet-drain-timeout 1 –kubelet-drain-grace-period 5
Update summary for cluster my-cluster:
Kubelet Drain Timeout: 1
Kubelet Drain Grace Period: 5
Are you sure you want to continue? (y/n): y
Use ‘tkgi cluster my-cluster’ to monitor the state of your cluster
For a list of the available action flags for setting node drain behavior, see tkgi update-cluster in TKGI CLI.