This topic contains release notes for Tanzu Kubernetes Grid Integrated Edition (TKGI) v1.20.
Release Date: August 19, 2024
Release Details |
||
---|---|---|
Version | v1.20.0 | |
Release date | August 19, 2024 | |
Internal Component Versions |
||
Antrea | v1.8.2* | Release Notes |
cAdvisor | v0.47.2 | |
Cloud Providers | AWS: v1.29.2* Azure: v1.29.7* vSphere: v1.29.0* |
Release Notes: AWS Azure vSphere |
Containerd | Linux: v1.6.33* Windows: v1.6.33* |
|
CoreDNS | v1.11.1+vmware.10* | |
CSI Driver for vSphere | v3.3.0* | Release Notes |
etcd | v3.5.12* | |
Harbor | v2.11.0 | Release Notes |
Kubernetes | v1.29.6* | Release Notes |
Metrics Server | v0.7.0* | |
NCP | v4.2.0.0* | Release Notes |
Percona XtraDB Cluster (PXC) (in BOSH pxc-release) |
v8.0.36-28 pxc-release: v1.0.29 |
Release Notes: PXC pxc-release |
UAA | v77.14.0 | |
Velero | 1.12.1* | Release Notes |
Wavefront | Wavefront Collector: v1.29.0 Wavefront Proxy: v13.4 |
|
Stemcell Compatibility |
||
Ubuntu Jammy stemcells | See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB. | |
Windows stemcells | v2019.75* or later | |
Interoperability |
||
Ops Manager | See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB. | |
VMware Aria Operations Management Pack for Kubernetes | v2.0 | Release Notes |
VMware Cloud Foundation (VCF) | v5.1.1, v5.1, v4.5.2 | Release Notes: v5.1.1, v5.1, v4.5.2 |
VMware NSX** | See VMware Product Interoperability Matrices***. | |
vSphere | ||
Management Console |
v1.20.0 | Note: The component versions supported by TKGI Management Console might differ from or be more limited than the versions supported by TKGI. |
Installed Ops Manager version | v3.0.31* | Release Notes |
Installed Harbor Registry version | v2.11.0* | Release Notes |
Ubuntu Jammy stemcell | v1.506 | Release Notes |
* Components marked with an asterisk have been updated.
** As of May 7, 2024, NSX networking and firewall components are sold separately from TKGI.
*** Migration from NSX Management Plane API to NSX Policy API requires VMware NSX v4.0.1.1 or later. NSX v4.0.1.1 supports only 50% of NSX Management Plane API scale. To use Policy API at 100% of Management Plane API scale, use NSX v4.1.1 or later.
The supported upgrade paths to Tanzu Kubernetes Grid Integrated Edition v1.20.0 are from TKGI v1.19.2 and earlier v1.19 patches.
TKGI v1.20.0 has no breaking changes.
--private-registries
as described in Configuring Cluster Access to Private Registries.
Private Registries:
line in the output of tkgi cluster
indicates whether it is enabled.tkgi update-cluster
with a compute profile that specifies new AZs, and the --enforce-compute-profile-update
option. See Using Compute Profiles (vSphere) for details and limitations.front_proxy_ca_2024
and leaf certificate front_proxy_client_2024
, set by Kubernetes API server options --requestheader-client-ca-file
and --proxy-client-cert-file
, to enable third-party extension server authentication as described in Configure the Aggregation Layer in the Kubernetes documentation.hostPath
values from the cAdvisor
pod.Management Console:
TKGI v1.20.0 resolves the following issues:
TKGI v1.20.0 also incorporates fixes from previous minor line patch releases, listed in TKGI 1.19, TKGI 1.18, and TKGI 1.17 Release Notes. These fixes include but are not limited to:
CrashLoopBackOff
stateTKGI v1.20.0 deprecates the following:
TKGI v1.20.0 has the following known issues:
TKGI supports using a public cloud CSI Driver on a TKGI-provisioned cluster.
Installing a Public Cloud CSI Driver on a TKGI Cluster
If you plan to use a public cloud CSI Driver on a TKGI-provisioned cluster, VMware recommends you take additional steps before installing the CSI Driver:
For most public clouds, VMware recommends you follow the CSI Driver installation procedure recommended by the public cloud provider.
For installing the Azure CSI Driver on a TKGI cluster, VMware recommends you follow the procedure in the How to install Azure file/disk CSI driver onto TKGI 1.14 cluster knowledge base article in the VMware Tanzu Support Hub.
Managing a TKGI Cluster That Uses a Public Cloud CSI Driver
If you have enabled a public cloud CSI Driver on a TKGI cluster, you must take additional steps when deleting,upgrading, or updating the cluster:
Updating a Cluster on a Public Cloud
When updating a cluster that uses a public cloud CSI Driver:
To prepare a single-worker node cluster for updating:
Upgrading a Cluster on a Public Cloud
When upgrading a cluster that uses a public cloud CSI Driver:
To prepare a single-worker node cluster for upgrading:
Deleting a Cluster on a Public Cloud
When deleting a cluster that uses a public cloud CSI Driver:
You can only change a cluster’s control plane AZs, as described in Using Compute Profiles (vSphere), under both of the following conditions:
The cluster has at least three control plane nodes.
You do not change multiple AZs at the same time. Each time you run tkgi update-cluster --compute-profile ... --enforce-compute-profile-update
, the cluster_customization.control_plane.az_names
value can only change one AZ name from its previous value.
Failure to meet these conditions can result in etcd
data loss.
Symptom
On clusters configured to use a containerd registry and Istio CNI, upgrading the TKGI version without also upgrading the stemcell fails with errors kubelet cannot find istio-cni binary
and nsx fails to recieve message header
.
This error does not occur when you upgrade to a new stemcell along with the new TKGI version.
Explanation
When TKGI cluster upgrades and drains the node during upgrade, it leaves the cluster nodes’ Istio CNI agent and CNI configuration in a corrupted state.
If the cluster nodes are not automatically re-created by a stemcell change, the corrupted Istio CNI state remains.
Workaround
For clusters that use both Containerd and Istio CNI:
If you have already encountered this issue, re-create all worker nodes using the bosh recreate
command:
Run the bosh vms
command to list the cluster VMs:
bosh -d service-instance-DEPLOYMENT-ID vms
Where DEPLOYMENT-ID
is the BOSH-generated ID of your Kubernetes cluster deployment.
For each VM instance listed as worker/UUID
in the output, run bosh recreate VM-NAME
:
bosh -d service-instance-DEPLOYMENT-ID recreate worker/UUID
In the future, you can avoid this issue by upgrading a cluster’s stemcell whenever you upgrade its TKGI version.
Symptom
In a TKGI environment with Antrea networking, when using the management console to create or edit a compute profile as described in Define Compute Profile, the following form fields do not accept input:
This issue does not apply to TKGI environments with NSX networking.
Symptom
For Windows worker clusters that authenticate users via a group Managed Service Account (gMSA) in Microsoft AD, upgrading the clusters to a new Windows stemcell may cause users to be unable to log in to the cluster. Valid credentials for containers on the cluster may no longer work.
Logfile join-domain/pre-start.stdout.log
contains:
WARNING: The changes will take effect after you restart the computer WIN-<ID-STRING>.
Already joined to domain
Explanation
When BOSH upgrades a VM’s Windows stemcell, it re-creates the VM and then triggers a join-domain
job to reconnect it with its gMSA group. Reconnecting with gMSA requires a second VM reboot, but TKGI does not currently trigger the reboot automatically because its timing would interfere with other upgrade operations.
Workaround
After upgrading a Windows cluster with GMSA to a new stemcell, manually reboot its worker nodes:
Run bosh vms
to list the names of the worker nodes, and record their Deployment
ID and Instance
IDs.
For the Windows worker nodes, which have Instance
IDs that begin with worker/
, log in to them and restart them as follows:
Run bosh -d DEPLOYMENT-ID ssh INSTANCE-ID
.
powershell
.Run the following script, which restarts the VM and returns you to your local shell:
Set-Service bosh-agent -StartupType Automatic
Set-Service bosh-dns-windows -StartupType Automatic
Set-Service bosh-dns-healthcheck-windows -StartupType Automatic
Set-Service bosh-dns-nameserverconfig-windows -StartupType Automatic
Set-Service kubelet -StartupType Automatic
Set-Service nsx-kube-proxy -StartupType Automatic
Set-Service nsx-node-agent -StartupType Automatic
Set-Service containerd -StartupType Automatic
Set-Service ovs-vswitchd -StartupType Automatic
Set-Service ovsdb-server -StartupType Automatic
Set-Service system-metrics-agent -StartupType Automatic
Get-Service bosh-agent | Select-Object -Property Name, StartType, Status
Stop-Service -Name bosh-agent -Force -NoWait
# Restart to apply changes
echo "Restarting vm"
Restart-Computer
Wait until the VM is restarted and check the pod status:
kubectl get pod POD-NAME
If the pod is stuck, bosh restart
it. For example:
bosh -d service-instance_0d0f7798-e4e9-473f-8ddb-279bc61faef0 restart worker/2deca1a2-c6ed-4e37-8dce-91b141d98e8f
Where service-instance_0d0f7798-e4e9-473f-8ddb-279bc61faef0
is the example instance group DEPLOYMENT-ID and worker/2deca1a2-c6ed-4e37-8dce-91b141d98e8f
is the example VM INSTANCE-ID.
Symptom
When you deploy a workload on a TKGI-provisioned cluster with NSX networking that is running Tanzu Application Platform (TAP), you see an error Failed to create pod sandbox
and no resources are created in the cluster’s nsx-system
namespace.
Explanation
The total number of Kubernetes object labels and other tags created by both TKGI and TAP can exceed the number that is allowed by NSX.
Workaround
Create or update your network profile as described in Creating and Managing Network Profiles (NSX Only), setting the cni_configurations
parameter extensions.ncp.k8s.label_filtering_regex_list
as described under label_filtering Settings.
Symptom
After you restore Ops Manager and the TKGI API VM from backup, TKGI functions normally, but your TKGI MC tabs include the following error: “…product ‘pivotal-container service’ is not deployed…”.
Explanation
TKGI MC is associated with an Ops Manager with a specific name. If you rename Ops Manager with a new name while restoring, your TKGI MC will not recognize the restored Ops Manager and cannot manage it.
VMware vRealize Operations (vROPs) does not support Windows worker-based Kubernetes clusters and cannot be used to manage TKGI-provisioned Windows workers.
To monitor Windows-based worker node clusters with a Wavefront collector and proxy, you must first install Wavefront on the clusters manually, using Helm. For instructions, see the Wavefront section of the Monitoring Windows Worker Clusters and Nodes topic.
TKGI-provisioned Windows worker-based Kubernetes clusters inherit a Kubernetes limitation that prevents outbound ICMP communication from workers. As a result, pinging Windows workers does not work.
For information about this limitation, see Limitations > Networking in the Windows in Kubernetes documentation.
When restoring the TKGI management plane from backup as described in Restoring TKGI Management Plane Components, you may see an error like the following, along with errors for the bbr-uaadb
and pks-api
components:
```
ERROR 3780 (HY000) at line 25: Referencing column 'SESSION_PRIMARY_ID' and referenced column 'PRIMARY_ID' in foreign key constraint 'SPRING_SESSION_ATTRIBUTES_FK' are incompatible.
```
With these errors, the User Account and Authentication (UAA) database fails to restore.
You can use Velero to back up stateless TKGI-provisioned Windows workers only. You cannot use Velero to back up stateful Windows applications. For more information, see Velero on Windows in Basic Install in the Velero documentation.
TMC Data Protection feature supports privileged TKGI containers only. For more information, see Plans in the Installing TKGI topic for your IaaS.
Windows worker-based Kubernetes clusters integrated with group Managed Service Account (gMSA) cannot be managed using compute profiles.
TKGI CLI does not prevent accidentally reducing a cluster’s control plane node count using a compute profile.
Warning: Reducing a cluster’s control plane node count can destroy the cluster. Do not scale out or scale in existing control plane nodes by reconfiguring the TKGI tile or by using a compute profile. Reducing a cluster’s number of control plane nodes might remove a control plane node and cause the cluster to become inactive.
Symptom
After you delete a VM using the management console of your infrastructure provider, you notice a Windows worker node that had been on that VM is now in a notReady
state.
Solution
To identify the leftover node:
kubectl get no -o wide
notReady
state and have the same IP address as another node in the list.To manually delete a notReady
node:
kubectl delete node NODE-NAME
Where NODE-NAME
is the name of the node in the notReady
state.
Symptom
You experience a “502 Bad Gateway” error from the NSX load balancer after you log in to OIDC.
Explanation
A large response header has exceeded your NSX load balancer maximum response header size. The default maximum response header size is 10,240 characters and should be resized to 16,384.
Workaround
If you experience this issue, manually reconfigure your NSX request_header_size
to 4096
characters and your response_header_size
to 16384
. For information about configuring NSX default header sizes, see OIDC Response Header Overflow in the Knowledge Base.
You must configure a global proxy in the Tanzu Kubernetes Grid Integrated Edition tile > Networking pane before you create any Windows workers that use the proxy.
You cannot change the proxy configuration for Windows workers in an existing cluster.
For vSphere with NSX, the HTTP Proxy password field does not support the following special characters: &
or ;
.
Symptom
You receive the following error after modifying your existing Harbor installation’s storage configuration:
Error response from daemon: manifest for ... not found: manifest unknown: manifest unknown
Explanation
Harbor does not support modifying an existing Harbor installation’s storage configuration.
Workaround
To modify your Harbor storage configuration, re-install Harbor. Before starting Harbor, configure the new Harbor installation with the desired configuration.
Symptom
Permissions are removed from your cluster’s files and processes after resizing the persistent disk during a cluster upgrade. The ingress controller statefulset fails to start.
Explanation
When resizing a persistent disk, Bosh migrates the data from the old disk to the new disk but does not copy the files’ extended attributes.
Workaround
To resolve the problem, complete the steps in [Ingress controller statefulset fails to start after resize of worker nodes with permission denied] (https://knowledge.broadcom.com/external/article/298618/) in the Broadcom Support Knowledge Base.
Symptom
You experience issues when configuring a load balancer for a multi-control plane node Kubernetes cluster or creating a service of type LoadBalancer
. Additionally, in the Azure portal, the VM > Networking page does not display any inbound and outbound traffic rules for your cluster VMs.
Explanation
As part of configuring the Tanzu Kubernetes Grid Integrated Edition tile for Azure, you enter Default Security Group in the Kubernetes Cloud Provider pane. When you create a Kubernetes cluster, Tanzu Kubernetes Grid Integrated Edition automatically assigns this security group to each VM in the cluster. However, on Azure the automatic assignment might not occur.
As a result, your inbound and outbound traffic rules defined in the security group are not applied to the cluster VMs.
Workaround
If you experience this issue, manually assign the default security group to each VM NIC in your cluster.
Symptom
One of your plan IDs is one character longer than your other plan IDs.
Explanation
In TKGI, each plan has a unique plan ID. A plan ID is normally a UUID consisting of 32 alphanumeric characters and 4 hyphens. However, the Plan 4 ID consists of 33 alphanumeric characters and 4 hyphens.
Solution
You can safely configure and use Plan 4. The length of the Plan 4 ID does not affect the functionality of Plan 4 clusters.
If you require all plan IDs to have identical length, do not activate or use Plan 4.
Symptom
After you stop one instance in a multiple-instance database cluster, the cluster stops, or communication between the remaining databases times out, and the entire cluster becomes unreachable.
The following might be in your UAA log:
WSREP has not yet prepared node for application use
Explanation
The database cluster is unable to recover automatically because a member is no longer available to reconcile quorum.
Symptom
Backing up vSphere persistent volumes using Velero fails and your Velero backup log includes the following error:
rpc error: code = Unknown desc = Failed during IsObjectBlocked check: Could not translate selfLink to CRD name
Explanation
This is a known issue when backing up clusters on Kubernetes v1.20 and later using the Velero Plugin for vSphere v1.1.0 or earlier.
Workaround
To resolve the problem, complete the steps in Velero backups of vSphere persistent volumes fail on Kubernetes clusters version 1.20 or higher (83314) in the Broadcom Support Knowledge Base.
Symptom
The first time that you try to create two Windows clusters at the same time, the creation of one of the clusters fails. If you run pks cluster CLUSTER-NAME
to examine the last action taken on the cluster, you see the following:
Last Action: Create Last Action State: failed Last Action Description: Instance provisioning failed: There was a problem completing your request. … operation: create, error-message: Failed to acquire lock … locking task id is 111, description: ‘create deployment’
Explanation
This is a known issue that occurs the first time that you create two Windows clusters concurrently.
Workaround
Recreate the failed cluster. This issue only occurs the first time that you create two Windows clusters concurrently.
Symptom
After running tkgi delete-cluster
and cluster deletion has completed, the deleted cluster continues to be listed when running tkgi clusters
.
Workaround
You must manually remove the deleted cluster using a customized version of the ncp_cleanup script. For more information, see Deleting a Tanzu Kubernetes Grid Integrated Edition cluster with “tkgi delete-cluster” stuck “in progress” status in the Broadcom Support Knowledge Base.
Symptom
After you uninstall TKGI, then reinstall TKGI in the same environment, BOSH Director logs errors similar to the following:
.../gems/bosh-director-0.0.0/lib/bosh/director/deployment_plan/cloud_manifest_parser.rb:120:in `parse_vm_extensions': Duplicate vm extension name 'disk_enable_uuid' (Bosh::Director::DeploymentDuplicateVmExtensionName)
Explanation
The pivotal-container-service
cloud-config was not removed when you uninstalled the TKGI tile, and it remained active. When you reinstalled the TKGI tile, an additional pivotal-container-service
cloud-config was created, causing the metrics_server to fall into a crash-loop state.
Workaround
You must manually remove the pivotal-container-service
cloud-config after removing your TKGI deployment, including after removing the TKGI tile from Ops Manager.
For more information, see “Duplicate vm extension name” error when metrics_server runs on Director VM in Tanzu Kubernetes Grid Integrated Edition in the VMware Tanzu Community Knowledge Base.
Symptom
Your TKGI logs include the following error:
'uaa'. Errors are:- Error filling in template 'uaa.yml.erb' (line 59: Client redirect-uri is invalid: uaa.clients.pks_cli.redirect-uri Client redirect-uri is invalid: uaa.clients.pks_cluster_client.redirect-uri)
Explanation
The TKGI API fully-qualified domain name (FQDN) for your cluster contains leading or trailing whitespace.
Workaround
Do not include whitespace in the TKGI tile API Hostname (FQDN) field.
The TMC Cluster Data Protection Backup fails in TKGI environments upgraded from an earlier version.
Symptom
The TMC Cluster Data Protection Backup fails to back up your existing clusters and logs the following error:
error executing custom action (groupResource=customresourcedefinitions.apiextensions.k8s.io, namespace=, name=ncpconfigs.nsx.vmware.com): rpc error: code = Unknown desc = error fetching v1beta1 version of ncpconfigs.nsx.vmware.com: the server could not find the requested resource
Explanation
Kubernetes v1.22 disallows the spec.preserveUnknownFields: true
configuration in your existing clusters and the creation of a v1 CustomResourceDefinitions configuration fails.
The TMC Cluster Data Protection Restore operation can fail when restoring multiple Antea resources.
Symptom
The TMC Cluster Data Protection Restore fails and logs errors that requests to restore the admission webhook
have been denied.
Explanation
Velero has encountered a race condition while operating a resource. For more information, see Allow customizing restore order for Kubernetes controllers and their managed resources in the Velero GitHub repository.
TKGI does not support environments where there are multiple matching networks, such as a mixed CVDS/NVDS environment.
Symptom
TKGI logs errors similar to the following in an environment with multiple matching networks:
LastOperationstatus='failed', description='Instance provisioning failed:
There was a problem completing your request. Please contact your operations team providing the following information:
service: p.pks, service-instance-guid: ..., broker-request-id: ..., task-id: ..., operation: create,
error-message: Unknown CPI error 'Unknown' with message 'undefined method `mob' for <VimSdk::Vim::OpaqueNetwork:' in create_vm' CPI method
Explanation
TKGI cannot identify which of the matching networks you intend to use and has selected the wrong network.
Occasionally, tkgi update-cluster
hangs while updating a Windows worker node instance and the BOSH task cannot finish and exits.
Symptom
The ovsdb-server
service has stopped but other processes report that it is running.
Explanation
The ovsdb-server.pid
file uses the pid for a process that is not the ovsdb-server.
To confirm that this is the root cause for tkgi update-cluster
to hang:
ovsdb-server
service has actually stopped, run the PowerShell Get-services
command on the Windows worker node.To verify that other processes report the ovsdb-server
service is still running:
Review the ovsdb-server job-service-wrapper.err.log
log file.
The job-service-wrapper.err.log
log file is located at:
C:\var\vcap\sys\log\openvswitch-windows\ovsdb-server\job-service-wrapper.err.log
Confirm that after the flushing processes, the log includes an error similar to the following:
Pid-Guard : ovsdb-server is already runing, please stop it first
At C:\var\vcap\jobs\openvswitch-windows\bin\ovsdb-server_ctl.ps1:30 char:5
+ Pid-Guard $PIDFILE "ovsdb-server"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: ( [Write-Error], WriteErrorException
+ FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Pid-Guard
To verify the root cause:
Run the following PowerShell commands on the Windows worker node:
$RUN_DIR = "C:\var\vcap\sys\run\openvswitch-windows"
$PIDFILE = "$RUN_DIR\ovsdb-server.pid"
$pid1 = Get-Content $PidFile -First 1
echo $pid1
$rst = Get-Process -Id $pid1 -ErrorAction SilentlyContinue
echo $rst
ProcessName
is not ovsdb-server
.Workaround
To resolve this issue for a single Windows worker:
Run the following:
rm C:\var\vcap\sys\run\openvswitch-windows\ovsdb-server.pid
ovsdb-server
process to start.If LDAP is enabled, Harbor private projects are inaccessible after upgrading to TKGI v1.13.0. For more information, see Private projects become inaccessible after upgrading Harbor for TKGI to v2.4.x with LDAP feature enabled in the Broadcom Support Knowledge Base.
Microsoft changed Microsoft Windows’ support for tar file commands in the January 2022 Microsoft Windows security patch.
Packaging scripts that use tar commands for Windows worker-based Kubernetes Cluster deployments can fail after the Microsoft tar command patch update has been applied.
The BOSH agent used by vSphere stemcells built by stembuild v2019.43 and earlier use tar commands that are no longer supported and will fail if the Microsoft Windows security patch has been applied.
Workaround
stembuild v2019.44 and later include a version of the BOSH agent that does not use unsupported tar commands.
If you use vSphere stemcells, use stembuild 2019.44 or later to avoid the BOSH agent tar error.
TKGI supports clusters that use NSGroup Policy API resources, but Policy API NSGroups created in one NSX version will be empty after upgrading NSX to a newer version.
Workaround
BOSH reconfigures a deployment’s NSGroup members if the deployment is redeployed.
After upgrading NSX, redeploy affected deployments to reconfigure their NSGroup members:
When TKGI is deployed on NSX v3.2.3 and there are large numbers of pods with liveness probes, the pods on TKGI-provisioned clusters can enter a NotReady
state.
Symptom
In addition to your pods being NotReady
, if you restart NSX Manager:
"POST /nsxapi/api/v1/firewall/sections/.../rules?operation=insert_bottom HTTP/1.1" ...
.Your NCP logs include errors similar to:
"nsx-container-ncp" subcomp="ncp" level="ERROR" security="True" errorCode="NCP00034"] nsx_ujo.ncp.nsx.manager.firewall_service Failed to create health check rule for port ...: Service cluster: 'https://nsx-manager.example.com' is unavailable. Please, check NSX setup and/or configuration.
Description
As pods are created or deleted, DFW firewall rules are replicated for the pod’s liveness probe. In NSX v3.2.3, the firewall rules are unintentionally duplicated during this replication. After numerous pod creation/deletion events, the compounded duplication creates a DFW firewall section large enough to create noticeable delays during pod operations and, eventually, a pod NotReady
state.
Workaround
Upgrade NSX to a version that includes the fix, namely 3.2.4 or 4.1.1 or later.