Upgrade Preparation Checklist for Tanzu Kubernetes Grid Integrated Edition

This topic describes the preparation steps to complete before upgrading VMware Tanzu Kubernetes Grid Integrated Edition (TKGI) from v1.13 to v1.14.

Overview

The following are the procedures that you must complete before beginning your TKGI upgrade.

Warning: Failure to follow these instructions might jeopardize your existing deployment data and cause the TKGI upgrade to fail.

To prepare for a TKGI Upgrade:

Back Up Your TKGI Deployment
Review What Happens During TKGI Upgrades
Review Changes in TKGI
Determine Upgrade Order (vSphere Only)
Set User Expectations and Restrict Cluster Access
Upgrade All Clusters
Verify Your Clusters Support Upgrading
Switch to the Auto-Deployed CSI Driver Before Upgrading
Customize Cluster Container Runtimes Before Upgrading
Verify Health of Kubernetes Environment
Verify Your Environment Configuration
Clean Up or Fix Failed Kubernetes Clusters
Verify Kubernetes Clusters Have Unique External Hostnames
Verify TKGI Proxy Configuration
Check PodDisruptionBudget Value
(Optional) Configure Node Drain Behavior

After completing the steps in this topic, continue to Upgrading Tanzu Kubernetes Grid Integrated Edition (Antrea and Flannel Networking) or Upgrading Tanzu Kubernetes Grid Integrated Edition (NSX-T Networking).

Back Up Your Tanzu Kubernetes Grid Integrated Edition Deployment

VMware recommends backing up your Tanzu Kubernetes Grid Integrated Edition deployment and workloads before upgrading. To back up Tanzu Kubernetes Grid Integrated Edition, see Backing Up and Restoring Tanzu Kubernetes Grid Integrated Edition.

Review What Happens During Tanzu Kubernetes Grid Integrated Edition Upgrades

If you have not already done so, review About Tanzu Kubernetes Grid Integrated Edition Upgrades.

Plan your upgrade based on your workload capacity and uptime requirements.

Review Changes in Tanzu Kubernetes Grid Integrated Edition v1.14

Review the Release Notes for Tanzu Kubernetes Grid Integrated Edition v1.14.

Determine Upgrade Order (vSphere Only)

To determine the upgrade order for your Tanzu Kubernetes Grid Integrated Edition environment, review Upgrade Order for Tanzu Kubernetes Grid Integrated Edition Environments on vSphere.

Set User Expectations and Restrict Cluster Access

Coordinate the Tanzu Kubernetes Grid Integrated Edition upgrade with cluster admins and users. During the upgrade:

Their workloads will remain active and accessible.
They will be unable to perform cluster management functions, including creating, resizing, updating, and deleting clusters.
They will be unable to log in to TKGI or use the TKGI CLI and other TKGI control plane services.

Note: Do not start any cluster management tasks right before an upgrade. Wait for cluster operations to complete before upgrading.

Upgrade All Clusters

Tanzu Kubernetes Grid Integrated Edition v1.14 does not support clusters running versions of TKGI earlier than v1.13.

Before you upgrade from Tanzu Kubernetes Grid Integrated Edition v1.13 to v1.14, you must upgrade all of your TKGI-provisioned clusters to v1.13.

To upgrade TKGI-provisioned clusters:

Check the version of your clusters:
```
tkgi clusters
```
If one or more of your clusters are running a version of TKGI earlier than v1.13:
1. Verify these clusters support being upgraded to TKGI v1.13. For more information, see Verify Your Clusters Support Upgrading below.
2. Upgrade the clusters to TKGI v1.13. For instructions, see the Upgrading Clusters topic in the TKGI v1.13 documentation.

Note: When upgrading TKGI to mitigate the Apache Log4j vulnerability you must also upgrade all TKGI clusters.

Verify Your Clusters Support Upgrading

It is critical that you confirm that a cluster’s resource usage is within the recommended maximum limits before upgrading the cluster.

VMware Tanzu Kubernetes Grid Integrated Edition upgrades a cluster by upgrading control plane and worker nodes individually. The upgrade processes a control plane node by redistributing the node’s workload, stopping the node, upgrading it and restoring its workload. This redistribution of a node’s workloads increases the resource usage on the remaining nodes during the upgrade process.

If a Kubernetes cluster control plane VM is operating too close to capacity, the upgrade can fail.

Warning: Downtime is required to repair a cluster failure resulting from upgrading an overloaded Kubernetes cluster control plane VM.

To prevent workload downtime during a cluster upgrade, complete the following before upgrading a cluster:

Ensure none of the control plane VMs being upgraded will become overloaded during the cluster upgrade. See Control Plane Node VM Size for more information.
Review the cluster’s workload resource usage.
Scale up the cluster if it is near capacity on its existing infrastructure. Scale up your cluster by running the command below or create a new cluster using a larger plan. For more information, see Changing Cluster Configurations.
```
tkgi update-cluster CLUSTER-NAME --num-nodes NUMBER-OF-WORKER-NODES
```
Where:
- CLUSTER-NAME is the name of your cluster.
- NUMBER-OF-WORKER-NODES is the number of worker nodes that you want to set for the cluster.
Note: VMware recommends that you avoid using the tkgi resize command to perform resizing operations.
Run the cluster’s workloads on at least three worker VMs using multiple replicas of your workloads spread across those VMs. For more information, see Maintaining Workload Uptime.

Switch to the Auto-Deployed vSphere CSI Driver Before Upgrading

TKGI v1.14 and later do not support the manually installed vSphere CSI driver.

If you have manually installed the vSphere CSI driver on your clusters, you must switch your clusters to use the automatically installed CSI driver before upgrading the clusters to TKGI v1.14.

For more information, see Switch From the Manually Installed vSphere CSI Driver to the Automatic CSI Driver in Deploying and Managing Cloud Native Storage (CNS) on vSphere.

(Optional) Customize Cluster Container Runtimes Before Upgrading

Containerd is the default container runtime for newly created clusters.

Note: All Docker container runtime clusters must be switched to use the containerd-runtime prior to upgrading to TKGI v1.15.

By default, the TKGI v1.14 upgrade-cluster errand will switch a cluster’s container runtime from Docker to containerd.

Warning: During a TKGI upgrade, cluster workloads will experience downtime while the cluster switches from using the Docker container runtime to containerd. For more information, see Cluster Workloads Experience Downtime While Upgrading and Switching Container Runtimes in the Release Notes.

To avoid workload downtime, VMware recommends that before upgrading to TKGI v1.14 that you either switch your clusters to the containerd container runtime or “lock” your clusters to the Docker container runtime:

Switch a Cluster to a Different Container Runtime
Lock a Cluster to the Docker Container Runtime

Warning: The default value for lock_container_runtime is false. The upgrade to TKGI v1.14.0 will switch a “locked” cluster to using the containerd runtime if, between locking and upgrading, you ran tkgi update-cluster without including the lock_container_runtime: true parameter in your configuration.

Switch a Cluster to a Different Container Runtime

You can switch an existing cluster from using a Docker container runtime to a containerd container runtime.

Warning: During a TKGI upgrade, cluster workloads will experience downtime while the cluster switches from using the Docker container runtime to containerd. To avoid workload downtime, VMware recommends that you switch your clusters to the containerd container runtime before upgrading to TKGI v1.14. For more information, see Cluster Workloads Experience Downtime While Upgrading and Switching Container Runtimes in the Release Notes.

To switch an existing cluster to a different container runtime:

To identify which of your existing clusters use a Docker container runtime:
```
kubectl get nodes -o wide
```
Create either a JSON or YAML cluster configuration file containing the following content:
- JSON formatted configuration file:
```
{
    "runtime": "RUNTIME-NAME"
}
```
- YAML formatted configuration file:
```
---
runtime: RUNTIME-NAME
```
Where RUNTIME-NAME specifies either docker or containerd as the container runtime to switch to.
To update your cluster with your configuration settings, run the following command:
```
tkgi update-cluster CLUSTER-NAME --config-file CONFIG-FILE-NAME
```
Where:
- CLUSTER-NAME is the name of your cluster.
- CONFIG-FILE-NAME is the cluster configuration file you created above.
WARNING: Update the configuration file only on a TKGI cluster that has been upgraded to the current TKGI version. For more information, see Tasks Supported Following a TKGI Control Plane Upgrade in About Tanzu Kubernetes Grid Integrated Edition Upgrades.
Verify your cluster now uses the containerd container runtime.

Lock a Cluster to the Docker Container Runtime

If you want an existing cluster to continue using the Docker container runtime after it has been upgraded to TKGI v1.14, you must lock the cluster’s container runtime before upgrading.

Note: If you want to lock a cluster to the Docker container runtime, you must lock the container runtime prior to upgrading to TKGI v1.14.

To lock an existing cluster to its current container runtime:

To identify which of your existing clusters use a Docker container runtime:
```
kubectl get nodes -o wide
```
Create either a JSON or YAML cluster configuration file containing the following content:
- JSON formatted configuration file:
```
{
    "lock_container_runtime": true
}
```
- YAML formatted configuration file:
```
---
lock_container_runtime: true
```
To update your cluster with your configuration settings, run the following command:
```
tkgi update-cluster CLUSTER-NAME --config-file CONFIG-FILE-NAME
```
Where:
- CLUSTER-NAME is the name of your cluster.
- CONFIG-FILE-NAME is the configuration file to use to lock the container runtime.

WARNING: Update the configuration file only on a TKGI cluster that has been upgraded to the current TKGI version. For more information, see Tasks Supported Following a TKGI Control Plane Upgrade in About Tanzu Kubernetes Grid Integrated Edition Upgrades.

Verify Health of Kubernetes Environment

Verify that your Kubernetes environment is healthy. To verify the health of your Kubernetes environment, see Verifying Deployment Health.

Verify Your Environment Configuration

If you are upgrading Tanzu Kubernetes Grid Integrated Edition, verify the configuration of your environment supports the TKGI version you are installing:

Verify Your vSphere with NSX-T Configuration
Verify Your Antrea Environment Configuration

If you are using Flannel networking, this verification step is unnecessary.

Verify Your vSphere with NSX-T Configuration

If you are upgrading Tanzu Kubernetes Grid Integrated Edition for environments using vSphere with NSX-T, perform the following steps:

Verify that the vSphere datastores have enough space.
Verify that the vSphere hosts have enough memory.
Verify that there are no alarms in vSphere.
Verify that the vSphere hosts are in a good state.
Verify that NSX Edge is configured for high availability.
Note: Workloads in your Kubernetes cluster are unavailable while the NSX Edge nodes run the upgrade unless you configure NSX Edge for high availability. For more information, see the Configure NSX Edge for High Availability (HA) section of Preparing NSX-T Before Deploying Tanzu Kubernetes Grid Integrated Edition.

Verify Your Antrea Environment Configuration

If you are upgrading Tanzu Kubernetes Grid Integrated Edition in an environment using Antrea networking, perform the following steps:

Verify the 6081 UDP port is open on all worker node VMs.
Verify the 8091 TCP port is open on all control plane node VMs.
Verify your environment configuration meets the Antrea networking requirements. For more information, see Network Requirements in the Antrea GitHub repository.

Note: Port 6081 must be open on all of the worker node VMs and port 8091 must be open on all control plane node VMs in the clusters you create in an Antrea networking environment.

Clean Up or Fix Failed Kubernetes Clusters

Clean up or fix any previous failed attempts to create TKGI clusters with the TKGI Command Line Interface (TKGI CLI) by performing the following steps:

View your deployed clusters by running the following command:
```
tkgi clusters
```
If the Status of any cluster displays as FAILED, continue to the next step. If no cluster displays as FAILED, no action is required. Continue to the next section.
To troubleshoot and fix failed clusters, perform the procedure in Cluster Creation Fails.
To clean up failed BOSH deployments related to failed clusters, perform the procedure in Cannot Re-Create a Cluster that Failed to Deploy.
After fixing and cleaning up any failed clusters, view your deployed clusters again by running tkgi clusters.

For more information about troubleshooting and fixing failed clusters, see the Knowledge Base.

Verify Kubernetes Clusters Have Unique External Hostnames

Verify that existing Kubernetes clusters have unique external hostnames by checking for multiple Kubernetes clusters with the same external hostname. Perform the following steps:

Log in to the TKGI CLI. For more information, see Logging in to Tanzu Kubernetes Grid Integrated Edition. You must log in with an account that has the UAA scope of pks.clusters.admin. For more information about UAA scopes, see Managing Tanzu Kubernetes Grid Integrated Edition Users with UAA.
View your deployed TKGI clusters by running the following command:
```
tkgi clusters
```
For each deployed cluster, run tkgi cluster CLUSTER-NAME to view the details of the cluster. For example:
```
$ tkgi cluster my-cluster
```
Examine the output to verify that the Kubernetes Master Host is unique for each cluster.

Verify TKGI Proxy Configuration

Verify your current TKGI proxy configuration by performing the following steps:

Check whether an existing proxy is enabled:
1. Log in to Ops Manager.
2. Click the VMware Tanzu Kubernetes Grid Integrated Edition tile.
3. Click Networking.
4. If HTTP/HTTPS Proxy is Disabled, no action is required. Continue to the next section. If HTTP/HTTPS Proxy is Enabled, continue to the next step.
Verify the No Proxy field values do not contain an underscore character, for example, my_host.mydomain.com.

Warning: An underscore character in the No Proxy field can cause your upgrade to fail. If an existing No Proxy field value contains an underscore character, or you plan to add a value containing an underscore, contact Support.

Check PodDisruptionBudget Value

Tanzu Kubernetes Grid Integrated Edition upgrades can run without ever completing if any Kubernetes app has a PodDisruptionBudget with maxUnavailable set to 0.

To ensure that no apps have a PodDisruptionBudget with maxUnavailable set to 0:

Run the following kubectl command to verify the PodDisruptionBudget as the cluster administrator:
```
kubectl get poddisruptionbudgets --all-namespaces
```
Examine the output to verify that no app displays 0 in the MAX UNAVAILABLE column.

(Optional) Configure Node Drain Behavior

During the Tanzu Kubernetes Grid Integrated Edition upgrade process, worker nodes are cordoned and drained. Workloads can prevent worker nodes from draining and cause the upgrade to fail or hang.

To prevent hanging cluster upgrades, you can configure default node drain behavior using the following methods:

Configure with the TKGI Tile
Configure with the TKGI CLI

The new default behavior takes effect during the next upgrade, not immediately after configuring the behavior.

Configure with the TKGI Tile

To configure node drain behavior in the Tanzu Kubernetes Grid Integrated Edition tile, see Worker Node Hangs Indefinitely in Troubleshooting.

Configure with the TKGI CLI

To configure default node drain behavior with the TKGI CLI:

View the current node drain behavior by running the following command:

tkgi cluster CLUSTER-NAME --details

Where CLUSTER-NAME is the name of your cluster.

For example:

$ tkgi cluster my-cluster --details

  Name:                     my-cluster
  Plan Name:                small
  UUID:                     f55ed6c4-c0a7-451d-b735-56c89fdb2ad7
  Last Action:              CREATE
  Last Action State:        succeeded
  Last Action Description:  Instance provisioning completed
  Kubernetes Master Host:   my-cluster.tkgi.local
  Kubernetes Master Port:   8443
  Worker Nodes:             3
  Kubernetes Master IP(s):  10.196.219.88
  Network Profile Name:
  Kubernetes Settings Details:
    Set by Cluster:
    Kubelet Node Drain timeout (mins)            (kubelet-drain-timeout):               10
    Kubelet Node Drain grace-period (mins)       (kubelet-drain-grace-period):          10
    Kubelet Node Drain force                     (kubelet-drain-force):                 true
    Set by Plan:
    Kubelet Node Drain force-node                (kubelet-drain-force-node):            true
    Kubelet Node Drain ignore-daemonsets         (kubelet-drain-ignore-daemonsets):     true
    Kubelet Node Drain delete-local-data         (kubelet-drain-delete-local-data):     true

Configure the default node drain behavior by running the following command:

tkgi update-cluster CLUSTER-NAME FLAG

Where:

CLUSTER-NAME is the name of your cluster.
FLAG is an action flag for updating the node drain behavior.

For example:

$ tkgi update-cluster my-cluster --kubelet-drain-timeout 1 --kubelet-drain-grace-period 5

Update summary for cluster my-cluster:
Kubelet Drain Timeout: 1
Kubelet Drain Grace Period: 5
Are you sure you want to continue? (y/n): y
Use 'tkgi cluster my-cluster' to monitor the state of your cluster

For a list of the available action flags for setting node drain behavior, see tkgi update-cluster in TKGI CLI.