This topic contains release notes for Tanzu Kubernetes Grid Integrated Edition (TKGI) v1.19.


TKGI v1.19.2

Release Date: August 9, 2024


Product Snapshot

Release Details

Version v1.19.2
Release date August 9, 2024

Internal Component Versions

Antrea v1.8.1 Release Notes
cAdvisor v0.47.2
Cloud Providers AWS: v1.28.1
vSphere: v1.28.0
Release Notes:
AWS
vSphere
Containerd Linux: v1.6.33*
Windows: v1.6.33*
CoreDNS v1.10.1+vmware.23*
CSI Driver for vSphere v3.1.2 Release Notes
etcd v3.5.12
Harbor v2.11.0* Release Notes
Kubernetes v1.28.11* Release Notes
Metrics Server v0.7.0
NCP v4.1.2.2* Release Notes
Percona XtraDB Cluster (PXC)
(in BOSH pxc-release)
v8.0.36-28
pxc-release: v1.0.29
Release Notes:
PXC
pxc-release
UAA v77.14.0*
Velero v1.12.1 Release Notes
Wavefront Wavefront Collector: v1.29.0
Wavefront Proxy: v13.4

Stemcell Compatibility

Ubuntu Jammy stemcells See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB.
Windows stemcells v2019.75* or later

Interoperability

Ops Manager See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB.
VMware Aria Operations Management Pack for Kubernetes v2.0 Release Notes
VMware Cloud Foundation (VCF) v5.1.1, v5.1, v4.5.2 Release Notes: v5.1.1, v5.1, v4.5.2
VMware NSX** See VMware Product Interoperability Matrices***.
vSphere

Management Console

v1.19.2

Note: The component versions supported by TKGI Management Console might differ from or be more limited than the versions supported by TKGI.

Installed Ops Manager version v3.0.31* Release Notes
Installed Harbor Registry version v2.11.0* Release Notes
Ubuntu Jammy stemcell v1.506* Release Notes

* Components marked with an asterisk have been updated.

** As of May 7, 2024, NSX networking and firewall components are sold separately from TKGI.

*** Migration from NSX Management Plane API to NSX Policy API requires VMware NSX v4.0.1.1 or later. NSX v4.0.1.1 supports only 50% of NSX Management Plane API scale. To use Policy API at 100% of Management Plane API scale, use NSX v4.1.1 or later.


Upgrade Path

The supported upgrade paths to Tanzu Kubernetes Grid Integrated Edition v1.19.2 are from TKGI v1.19.1 and and from v1.18.4 and earlier v1.18 patches.


Breaking Changes

TKGI v1.19.2 does not include any new breaking changes.


Features and Enhancements

TKGI v1.19.2 does not include any new features.


Resolved Issues

TKGI v1.19.2 resolves the following issues in the TKGI tile for Tanzu Operations Manager:

TKGI v1.19.2 resolves the following issues in the TKGI management console:


Known Issues

Except where noted, the known issues in TKGI v1.19.2 are also in TKGI v1.19.1. For more information, see TKGI v1.19.1 Known Issues below.

TKGI v1.19.2 does not include any new known issues.


TKGI v1.19.1

Release Date: June 18, 2024


Product Snapshot

Release Details

Version v1.19.1
Release date June 18, 2024

Internal Component Versions

Antrea v1.8.1* Release Notes
cAdvisor v0.47.2
Cloud Providers AWS: v1.28.1
vSphere: v1.28.0*
Release Notes:
AWS
vSphere
Containerd Linux: v1.6.28
Windows: v1.6.28
CoreDNS v1.10.1+vmware.21*
CSI Driver for vSphere v3.1.2 Release Notes
etcd v3.5.12*
Harbor v2.10.0 Release Notes
Kubernetes v1.28.9* Release Notes
Metrics Server v0.7.0
NCP v4.1.2.1* Release Notes
Percona XtraDB Cluster (PXC)
(in BOSH pxc-release)
v8.0.36-28*
pxc-release: v1.0.29*
Release Notes:
PXC
pxc-release
UAA v77.11.0*
Velero v1.12.1* Release Notes
Wavefront Wavefront Collector: v1.29.0*
Wavefront Proxy: v13.4*

Stemcell Compatibility

Ubuntu Jammy stemcells See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB.
Windows stemcells v2019.73* or later

Interoperability

Ops Manager See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB.
VMware Aria Operations Management Pack for Kubernetes v2.0 Release Notes
VMware Cloud Foundation (VCF) v5.1.1*, v5.1*, v4.5.2 Release Notes: v5.1.1, v5.1, v4.5.2
VMware NSX** See VMware Product Interoperability Matrices***.
vSphere

Management Console

v1.19.1

Note: The component versions supported by TKGI Management Console might differ from or be more limited than the versions supported by TKGI.

Installed Ops Manager version v3.0.29* Release Notes
Installed Harbor Registry version v2.10.2* Release Notes
Ubuntu Jammy stemcell v1.465* Release Notes

* Components marked with an asterisk have been updated.

** As of May 7, 2024, NSX networking and firewall components are sold separately from TKGI.

*** Migration from NSX Management Plane API to NSX Policy API requires VMware NSX v4.0.1.1 or later. NSX v4.0.1.1 supports only 50% of NSX Management Plane API scale. To use Policy API at 100% of Management Plane API scale, use NSX v4.1.1 or later.


Upgrade Path

The supported upgrade paths to Tanzu Kubernetes Grid Integrated Edition v1.19.1 are from TKGI v1.19.0 and and from v1.18.4 and earlier v1.18 patches.


Breaking Changes

TKGI v1.19.1 does not include any new breaking changes.


Features and Enhancements

  • Supports running clusters across multiple vSphere Datacenters.
    • If all of TKGI’s Availability Zones (AZs) are in the same Datacenter, you can upgrade without changing any configurations.
    • If you are running TKGI on AZs in multiple vSphere Datacenters, see Prepare to Upgrade with Multiple Datacenters for how to configure TKGI to support upgrading with multiple datacenters.
      • After upgrading TKGI with multiple datacenters, you can use the tkgi CLI to upgrade TKGI clusters.

        Note: The TKGI Management Console does not yet support upgrading TKGI running on multiple datacenters.

  • Updated fluent-bit version to v2.2.3 to address CVE-2024-4323.


Resolved Issues

TKGI v1.19.1 resolves the following issues in the TKGI tile for Tanzu Operations Manager:

TKGI v1.19.1 resolves the following issues in the TKGI management console:


Known Issues

Except where noted, the known issues in TKGI v1.19.1 are also in TKGI v1.19.0. For more information, see TKGI v1.19.0 Known Issues below.

TKGI v1.19.1 includes the following known issue:

Fluent Bit pods in CrashLoopBackOff state

Fluent Bit v2.2.3 uses Go v1.21, which has a known issue.

See Fluent Bit pods are in “CrashLoopBackOff” in TKGi v1.19.1 in the Broadcom Support Knowledge Base for details and a workaround.


TKGI v1.19.0

Release Date: April 10, 2024


Product Snapshot

Release Details

Version v1.19.0
Release date April 10, 2024

Internal Component Versions

Antrea v1.8.1* Release Notes
cAdvisor v0.47.2
Cloud Providers AWS: v1.28.1
vSphere: v1.28.0*
Release Notes:
AWS
vSphere
Containerd Linux: v1.6.28
Windows: v1.6.28
CoreDNS v1.10.1+vmware.17
CSI Driver for vSphere v3.1.2 Release Notes
etcd v3.5.10*
Harbor v2.10.0 Release Notes
Kubernetes v1.28.7* Release Notes
Metrics Server v0.7.0*
NCP v4.1.2.1* Release Notes
Percona XtraDB Cluster (PXC)
(in BOSH pxc-release)
v8.0.35-27
pxc-release: v1.0.25*
Release Notes:
PXC
pxc-release
UAA v77.4.0*
Velero v1.12.1* Release Notes
Wavefront Wavefront Collector: v1.29.0*
Wavefront Proxy: v13.4*

Stemcell Compatibility

Ubuntu Jammy stemcells See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB.
Windows stemcells v2019.69 or later

Interoperability

Ops Manager See Retrieve Product Version Compatibilities from the Tanzu API in the Broadcom Support KB.
VMware Aria Operations Management Pack for Kubernetes v2.0 Release Notes
VMware Cloud Foundation (VCF) v5.1.1*, v5.1*, v4.5.2 Release Notes: v5.1.1, v5.1, v4.5.2
VMware NSX** See VMware Product Interoperability Matrices***.
vSphere

Management Console

v1.19.0

Note: The component versions supported by TKGI Management Console might differ from or be more limited than the versions supported by TKGI.

Installed Ops Manager version v3.0.25* Release Notes
Installed Harbor Registry version v2.10.0* Release Notes
Ubuntu Jammy stemcell v1.406* Release Notes

* Components marked with an asterisk have been updated.

** As of May 7, 2024, NSX networking and firewall components are sold separately from TKGI.

*** Migration from NSX Management Plane API to NSX Policy API requires VMware NSX v4.0.1.1 or later. NSX v4.0.1.1 supports only 50% of NSX Management Plane API scale. To use Policy API at 100% of Management Plane API scale, use NSX v4.1.1 or later.


Upgrade Path

The supported upgrade paths to Tanzu Kubernetes Grid Integrated Edition v1.19.0 are from TKGI v1.18.4 and earlier v1.18 patches.


Breaking Changes

The following have been removed from TKGI v1.19:

  • Google Cloud Platform: Support for the Google Cloud Platform (GCP).

  • Flannel: Antrea is the Container Networking Interface (CNI) for TKGI v1.19. Flannel CNI is no longer supported.

    • If you have clusters running with Flannel CNI on TKGI 1.18 or earlier, VMware recommends that you create clusters with Antrea CNI and move all workloads from Flannel to Antrea clusters before you upgrade TKGI to 1.19.
      • Clusters with Flannel CNI continue to run after TKGI upgrade to 1.19, but cannot be updated or upgraded.
    • For more information about Flannel CNI removal, see About Switching from the Flannel CNI to the Antrea CNI in the TKGI 1.18 documentation.


Features and Enhancements

  • Supports deploying TKGI to Oracle Cloud VMware Solution (OCVS).
  • PersistentVolumes (PVs) used by TKGI clusters on vSphere can be migrated between datastores as described in Migrating Persistent Volumes Between Datastores Provisioning Support in Kubernetes.
  • A compute profile can specify both AZs and persistent disk sizes; you can set azs and persistent_disk_in_mb in one compute profile.
  • When NSX Manager is behind an HTTP/HTTPS proxy, the TKGI CLI can rotate NSX certificates, with command tkgi rotate-certificates.
  • Reflecting dropped support for the SecurityContextDeny Admission Controller in TKGI 1.18 and Pod Security Policy deprecation, the MC configuration wizard Plan pane does not include Admission Plugin options.
  • Cluster update retry setting: Ops Manager TKGI tile > TKGI API > Automatic retry on cluster update operations failure option sets TKGI to retry tkgi update-cluster process up to three times if it fails, to improve resilience.
  • Enhancements for command tkgi upgrade-cluster --pre-check when upgrading clusters to a new TKGI version:

    • Does not upgrade clusters with nodes that BOSH has marked as ignored.
    • Outputs a warning message when nodes have disk utilization above 95%.
  • Updated Kubernetes CPIs: cloud-provider-vsphere v1.28.0, cloud-provider-aws v1.28.3
  • SSH configuration improvements address CVE-2023-48795.
  • Updated component versions to address CVEs:

    • base OS Jammy
    • SAMBA+ 4.18.4
    • golang v1.22, curl v8.4.0, jq v1.7.1
    • fluent-bit v2.2.2, telegraf v1.29.5


Resolved Issues

TKGI v1.19.0 resolves the following issues:

TKGI v1.19.0 also incorporates fixes from previous minor line patch releases, listed in TKGI 1.18, TKGI 1.17, and TKGI 1.16 Release Notes.


Known Issues

TKGI v1.19.0 has the following known issues:

Limitations on Using the VMware vSphere CSI Driver

The VMware vSphere CSI Driver supports a limited set of VMware vSphere features. Before enabling the vSphere CSI Driver on a TKGI cluster, confirm the cluster and storage configuration are supported by the driver. For more information, see Unsupported Features and Limitations in Deploying and Managing Cloud Native Storage (CNS) on vSphere.


Limitations on Using a Public Cloud CSI Driver

TKGI supports using a public cloud CSI Driver on a TKGI-provisioned cluster.

Installing a Public Cloud CSI Driver on a TKGI Cluster

If you plan to use a public cloud CSI Driver on a TKGI-provisioned cluster, VMware recommends you take additional steps before installing the CSI Driver:

  • For most public clouds, VMware recommends you follow the CSI Driver installation procedure recommended by the public cloud provider.

  • For installing the Azure CSI Driver on a TKGI cluster, VMware recommends you follow the procedure in the How to install Azure file/disk CSI driver onto TKGI 1.14 cluster knowledge base article in the VMware Tanzu Support Hub.

Managing a TKGI Cluster That Uses a Public Cloud CSI Driver

If you have enabled a public cloud CSI Driver on a TKGI cluster, you must take additional steps when deleting,upgrading, or updating the cluster:

Updating a Cluster on a Public Cloud

When updating a cluster that uses a public cloud CSI Driver:

  • No preparation steps are needed when updating a multi-worker node cluster.
  • To prepare a single-worker node cluster for updating:

    1. Resize the cluster to two or more worker nodes before updating the cluster. For more information, see Scaling Existing Clusters.
    2. Update the cluster.

Upgrading a Cluster on a Public Cloud

When upgrading a cluster that uses a public cloud CSI Driver:

  • No preparation steps are needed when upgrading a multi-worker node cluster.
  • To prepare a single-worker node cluster for upgrading:

    1. Resize the cluster to two or more worker nodes before upgrading the cluster. For more information, see Scaling Existing Clusters.
    2. Upgrade the cluster. For more information on upgrading clusters, see Upgrading Clusters.

Deleting a Cluster on a Public Cloud

When deleting a cluster that uses a public cloud CSI Driver:

  1. Manually delete the workload PVCs and PVs before deleting the cluster.
  2. Delete the cluster. For more information on deleting clusters, see Deleting Clusters.


TKGI Cluster creation with NSX Edge fails with “no available capacity on edge node”.

This issue is fixed in TKGI v1.19.2.

Symptom

Deploying new clusters with NSX Edge nodes fails due to a failure in the pks-nsx-t-prepare-master-vm job. On TKGI control plane VMs, the job log file /var/vcap/data/sys/log/pks-nsx-t-prepare-master-vm/pre-start.stdout.log reports an error like:

Creating Load Balancer
create loadbalancer: update lb service: [PUT /infra/lb-services/{lb-service-id}][400] updateLBServiceBadRequest &{RelatedAPIError:{Details: ErrorCode:502001 ErrorData:<nil> ErrorMessage:Errors validating path=[/infra/lb-services/lb-pks-b5ef6df4-cd11-4461-8861-893533940ecb]. ModuleName:policy} RelatedErrors:[0xc0001b25a0]}

Explanation

When creating a cluster, TKGI creates a NSX Tier-1 gateway and attaches a load balancer to it. This becomes the cluster’s default load balancer, hosting the virtual server for the cluster’s API endpoints and ingress rules. The error occurs when the Tier-1 gateway creates the LB in the NSX routing allocation pool instead of the NSX LB allocation pool. This can cause NSX Service Router components to deploy to edge nodes with no LB capacity, resulting in cluster creation failure.

Workarounds

  • Use a different edge cluster with load balancer capacity on most nodes.
  • Add nodes to the current edge clusters, in pairs to allow deployment of both active and standby service routers.
  • Reconfigure allocation pools for existing TKGI cluster’s Tier-1 router.
    • This does not apply to routers created for namespaces in dedicated Tier-1 topology.


Limitation on Multiple vSphere Datacenters

This issue is fixed in TKGI v1.19.1.

TKGI on vSphere does not support running workload clusters in multiple vCenter server inventories. All vSphere clusters must be managed by the same vCenter server, due to an internal vSphere CPI change from in-tree to out-of-tree.


Cannot upgrade after rotating Ops Manager CA

This issue is fixed in TKGI v1.19.1.

Symptom

With TKGI deployed by the Management Console (MC), after you rotate the Ops Manager CA certificate, you cannot upgrade . The upgrade fails with errors that the MC cannot access BOSH:

```
Error GetInstanceByID: cannot get BOSH client: 
[...]
Get [https://10.110.93.3:25555/info|https://10.110.93.3:25555/info/]: x509: certificate signed by unknown authority
```

Workaround

Immediately after you rotate the Ops Manager CA, run the MC Configuration Wizard, step through the configuration panes, run Generate Configuration, and then run Apply Configuration.


Wrong cluster Floating IP pools after TKGI upgrade with Management Console Automated NAT Deployment

This issue is fixed in v1.19.2.

Symptom

On TKGI deployments on which users have updated cluster IP ranges using the NSX Manager instead of TKGI network profiles, after TKGI upgrade via the Management Console configured for Automated NAT Deployment, clusters fail with network connection errors. NCP logs list NSX configuration errors Resource could not be found for IpPool.

Explanation

During TKGI upgrade, the Management Console does not check whether cluster IP Pools have been updated at the underlying NSX layer, and instead re-applies the IP pool settings as configured in TKGI. This causes an IP pool mismatch between TKGI and NSX.

Workaround

Contact Support for scripts that reallocate IP addresses to the cluster’s current floating IP pool, release unused addresses, and delete stale IP pools.

To avoid this issue, update cluster IP pools via TKGI network profiles rather than in NSX Manager.


TKGI version upgrade without new stemcell fails for Containerd runtime clusters with Istio CNI

Symptom

On clusters configured to use a containerd registry and Istio CNI, upgrading the TKGI version without also upgrading the stemcell fails with errors kubelet cannot find istio-cni binary and nsx fails to recieve message header.

This error does not occur when you upgrade to a new stemcell along with the new TKGI version.

Explanation

When TKGI cluster upgrades and drains the node during upgrade, it leaves the cluster nodes’ Istio CNI agent and CNI configuration in a corrupted state.

If the cluster nodes are not automatically re-created by a stemcell change, the corrupted Istio CNI state remains.

Workaround

For clusters that use both Containerd and Istio CNI:

  • If you have already encountered this issue, re-create all worker nodes using the bosh recreate command:

    1. Run the bosh vms command to list the cluster VMs:

      bosh -d service-instance-DEPLOYMENT-ID vms
      

      Where DEPLOYMENT-ID is the BOSH-generated ID of your Kubernetes cluster deployment.

    2. For each VM instance listed as worker/UUID in the output, run bosh recreate VM-NAME:

      bosh -d service-instance-DEPLOYMENT-ID recreate worker/UUID
      
  • In the future, you can avoid this issue by upgrading a cluster’s stemcell whenever you upgrade its TKGI version.


With Antrea, Cannot Fill In Compute Profile Fields

Symptom

In a TKGI environment with Antrea networking, when using the management console to create or edit a compute profile as described in Define Compute Profile, the following form fields do not accept input:

  • Availability Zones
  • Control plane AZs
  • Control plane persistent disk size

This issue does not apply to TKGI environments with NSX networking.


GMSA authentication failures after stemcell upgrade

Symptom

For Windows worker clusters that authenticate users via a group Managed Service Account (gMSA) in Microsoft AD, upgrading the clusters to a new Windows stemcell may cause users to be unable to log in to the cluster. Valid credentials for containers on the cluster may no longer work.

Logfile join-domain/pre-start.stdout.log contains:

WARNING: The changes will take effect after you restart the computer WIN-<ID-STRING>.
Already joined to domain

Explanation

When BOSH upgrades a VM’s Windows stemcell, it re-creates the VM and then triggers a join-domain job to reconnect it with its gMSA group. Reconnecting with gMSA requires a second VM reboot, but TKGI does not currently trigger the reboot automatically because its timing would interfere with other upgrade operations.

Workaround

After upgrading a Windows cluster with GMSA to a new stemcell, manually reboot its worker nodes:

  1. Run bosh vms to list the names of the worker nodes, and record their Deployment ID and Instance IDs.

  2. For the Windows worker nodes, which have Instance IDs that begin with worker/, log in to them and restart them as follows:

  3. Run bosh -d DEPLOYMENT-ID ssh INSTANCE-ID.

  4. Execute powershell.
  5. Run the following script, which restarts the VM and returns you to your local shell:

    Set-Service bosh-agent -StartupType Automatic
    Set-Service bosh-dns-windows -StartupType Automatic
    Set-Service bosh-dns-healthcheck-windows -StartupType Automatic
    Set-Service bosh-dns-nameserverconfig-windows -StartupType Automatic
    Set-Service kubelet -StartupType Automatic
    Set-Service nsx-kube-proxy -StartupType Automatic
    Set-Service nsx-node-agent -StartupType Automatic
    Set-Service containerd -StartupType Automatic
    Set-Service ovs-vswitchd -StartupType Automatic
    Set-Service ovsdb-server -StartupType Automatic
    Set-Service system-metrics-agent -StartupType Automatic
    Get-Service bosh-agent | Select-Object -Property Name, StartType, Status
    Stop-Service -Name bosh-agent -Force -NoWait
    
    # Restart to apply changes
    echo "Restarting vm"
    Restart-Computer
    
  6. Wait until the VM is restarted and check the pod status:

    kubectl get pod POD-NAME
    
  7. If the pod is stuck, bosh restart it. For example:

    bosh -d service-instance_0d0f7798-e4e9-473f-8ddb-279bc61faef0 restart worker/2deca1a2-c6ed-4e37-8dce-91b141d98e8f
    

    Where service-instance_0d0f7798-e4e9-473f-8ddb-279bc61faef0 is the example instance group DEPLOYMENT-ID and worker/2deca1a2-c6ed-4e37-8dce-91b141d98e8f is the example VM INSTANCE-ID.


Upgrade Failure When Special Characters in vSphere Password

This issue is fixed in TKGI v1.19.1.

Symptom

When upgrading TKGI with vSphere Container Storage Plug-in (CSI) enabled, pods listed by kubectl get pods remain stuck with STATUS Pending.

Running kubectl describe on worker nodes lists Taints: node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule.

Log file /var/vcap/sys/log/vsphere-cloud-controller-manager/vsphere-cloud-controller-manager.stderr.log includes Credentials not found error, for example:

E0326 02:55:12.479807 21110 node_controller.go:236] error syncing 'b9a897b1-bf26-4460-a002-1c16a84d40a0': failed to get provider ID for node b9a897b1-bf26-4460-a002-1c16a84d40a0 at cloudprovider: failed to get instance ID from cloud provider: Credentials not found, requeuing

This behavior occurs when your vSphere password starts with a special character or contains backslash (\) characters.

Explanation

With internal vSphere CPI change from in-tree to out-of-tree, the CSI driver upgrade operation parses vCenter passwords incorrectly and cannot retrieve node information. This leads to the uninitialized=true:NoSchedule taint being attached to nodes.

Workaround

Change your vSphere password to not start with a special character or contain backslash (\) characters.


NSX pod creation fails when using Tanzu Application Platform

Symptom

When you deploy a workload on a TKGI-provisioned cluster with NSX networking that is running Tanzu Application Platform (TAP), you see an error Failed to create pod sandbox and no resources are created in the cluster’s nsx-system namespace.

Explanation

The total number of Kubernetes object labels and other tags created by both TKGI and TAP can exceed the number that is allowed by NSX.

Workaround

Create or update your network profile as described in Creating and Managing Network Profiles (NSX Only), setting the cni_configurations parameter extensions.ncp.k8s.label_filtering_regex_list as described under label_filtering Settings.


Updating cluster compute profile loses node drain and shutdown settings

This issue is fixed in TKGI v1.19.1.

Symptom

In the management console, when you update a Linux cluster that is created with a compute profile as described in Update Cluster Configuration, the Advanced Settings panel shows and applies incorrect defaults for node drain and pod shutdown grace period settings.

Workaround

Under Update Cluster, before you change the Compute Profile setting, click Show More to open the Advanced Settings. Record the current settings, and set them back if selecting a compute profile changes those settings.

Default settings are:

  • Node Train Timeout: 0
  • Pod Shutdown Grace Period: 10
  • Enabled: Force drain if externally-managed pods
  • Enabled: Force drain if DaemonSet-managed pods
  • Enabled: Force drain if pods using emptyDir
  • Disabled: Force drain if pods running after timeout{panel}


Upgrading cluster with CLI loses tags

This issue is fixed in TKGI v1.19.1.

Symptom

After running tkgi upgrade-cluster with TKGI on AWS, the cluster’s tags no longer appear. This issue exists on vSphere, AWS and Azure.

Workaround

To restore the cluster tags after upgrading a cluster with the TKGI CLI:

  1. Run tkgi cluster CLUSTER-NAME as described in Review Your Tags and copy the Tags: value from the command output.

  2. Run tkgi update-cluster CLUSTER-NAME --tags TAGS and pass in the existing tags value.


TKGI MC Unable to Manage TKGI after Restoring the TKGI Control Plane from Backup

Symptom

After you restore Ops Manager and the TKGI API VM from backup, TKGI functions normally, but your TKGI MC tabs include the following error: “…product ‘pivotal-container service’ is not deployed…”.

Explanation

TKGI MC is associated with an Ops Manager with a specific name. If you rename Ops Manager with a new name while restoring, your TKGI MC will not recognize the restored Ops Manager and cannot manage it.


VMware vRealize Operations Does Not Support Windows Worker-Based Kubernetes Clusters

VMware vRealize Operations (vROPs) does not support Windows worker-based Kubernetes clusters and cannot be used to manage TKGI-provisioned Windows workers.


TKGI Wavefront Requires Manual Installation for Windows Workers

To monitor Windows-based worker node clusters with a Wavefront collector and proxy, you must first install Wavefront on the clusters manually, using Helm. For instructions, see the Wavefront section of the Monitoring Windows Worker Clusters and Nodes topic.


Pinging Windows Worker Kubernetes Clusters Does Not Work

TKGI-provisioned Windows worker-based Kubernetes clusters inherit a Kubernetes limitation that prevents outbound ICMP communication from workers. As a result, pinging Windows workers does not work.

For information about this limitation, see Limitations > Networking in the Windows in Kubernetes documentation.


BOSH Backup and Restore Does Not Restore UAA Database.

When restoring the TKGI management plane from backup as described in Restoring TKGI Management Plane Components, you may see an error like the following, along with errors for the bbr-uaadb and pks-api components:

```
ERROR 3780 (HY000) at line 25: Referencing column 'SESSION_PRIMARY_ID' and referenced column 'PRIMARY_ID' in foreign key constraint 'SPRING_SESSION_ATTRIBUTES_FK' are incompatible.
```

With these errors, the User Account and Authentication (UAA) database fails to restore.


Velero Does Not Support Backing Up Stateful Windows Workloads

You can use Velero to back up stateless TKGI-provisioned Windows workers only. You cannot use Velero to back up stateful Windows applications. For more information, see Velero on Windows in Basic Install in the Velero documentation.


TMC Data Protection Feature Requires Privileged TKGI Containers

TMC Data Protection feature supports privileged TKGI containers only. For more information, see Plans in the Installing TKGI topic for your IaaS.


Windows Worker Kubernetes Clusters with Group Managed Service Account Do Not Support Compute Profiles

Windows worker-based Kubernetes clusters integrated with group Managed Service Account (gMSA) cannot be managed using compute profiles.


TKGI CLI Does Not Prevent Reducing the Control Plane Node Count

TKGI CLI does not prevent accidentally reducing a cluster’s control plane node count using a compute profile.

Warning: Reducing a cluster’s control plane node count can destroy the cluster. Do not scale out or scale in existing control plane nodes by reconfiguring the TKGI tile or by using a compute profile. Reducing a cluster’s number of control plane nodes might remove a control plane node and cause the cluster to become inactive.


Windows Cluster Nodes Not Deleted After VM Deleted

Symptom

After you delete a VM using the management console of your infrastructure provider, you notice a Windows worker node that had been on that VM is now in a notReady state.

Solution

  1. To identify the leftover node:

    kubectl get no -o wide
    
  2. Locate nodes on the returned list that are in a notReady state and have the same IP address as another node in the list.
  3. To manually delete a notReady node:

    kubectl delete node NODE-NAME
    

    Where NODE-NAME is the name of the node in the notReady state.


502 Bad Gateway After OIDC Login

Symptom

You experience a “502 Bad Gateway” error from the NSX load balancer after you log in to OIDC.

Explanation

A large response header has exceeded your NSX load balancer maximum response header size. The default maximum response header size is 10,240 characters and should be resized to 16,384.

Workaround

If you experience this issue, manually reconfigure your NSX request_header_size to 4096 characters and your response_header_size to 16384. For information about configuring NSX default header sizes, see OIDC Response Header Overflow in the Knowledge Base.


Difficulty Changing Proxy for Windows Workers

You must configure a global proxy in the Tanzu Kubernetes Grid Integrated Edition tile > Networking pane before you create any Windows workers that use the proxy.

You cannot change the proxy configuration for Windows workers in an existing cluster.


Character Limitations in HTTP Proxy Password

For vSphere with NSX, the HTTP Proxy password field does not support the following special characters: & or ;.


Error After Modifying Your Harbor Storage Configuration

Symptom

You receive the following error after modifying your existing Harbor installation’s storage configuration:

Error response from daemon: manifest for ... not found: manifest unknown: manifest unknown

Explanation

Harbor does not support modifying an existing Harbor installation’s storage configuration.

Workaround

To modify your Harbor storage configuration, re-install Harbor. Before starting Harbor, configure the new Harbor installation with the desired configuration.


Ingress Controller Statefulset Fails to Start After Resizing Worker Nodes

Symptom

Permissions are removed from your cluster’s files and processes after resizing the persistent disk during a cluster upgrade. The ingress controller statefulset fails to start.

Explanation

When resizing a persistent disk, Bosh migrates the data from the old disk to the new disk but does not copy the files’ extended attributes.

Workaround

To resolve the problem, complete the steps in [Ingress controller statefulset fails to start after resize of worker nodes with permission denied] (https://knowledge.broadcom.com/external/article/298618/) in the Broadcom Support Knowledge Base.


Azure Default Security Group Is Not Automatically Assigned to Cluster VMs

Symptom

You experience issues when configuring a load balancer for a multi-control plane node Kubernetes cluster or creating a service of type LoadBalancer. Additionally, in the Azure portal, the VM > Networking page does not display any inbound and outbound traffic rules for your cluster VMs.

Explanation

As part of configuring the Tanzu Kubernetes Grid Integrated Edition tile for Azure, you enter Default Security Group in the Kubernetes Cloud Provider pane. When you create a Kubernetes cluster, Tanzu Kubernetes Grid Integrated Edition automatically assigns this security group to each VM in the cluster. However, on Azure the automatic assignment might not occur.

As a result, your inbound and outbound traffic rules defined in the security group are not applied to the cluster VMs.

Workaround

If you experience this issue, manually assign the default security group to each VM NIC in your cluster.


One Plan ID Longer than Other Plan IDs

Symptom

One of your plan IDs is one character longer than your other plan IDs.

Explanation

In TKGI, each plan has a unique plan ID. A plan ID is normally a UUID consisting of 32 alphanumeric characters and 4 hyphens. However, the Plan 4 ID consists of 33 alphanumeric characters and 4 hyphens.

Solution

You can safely configure and use Plan 4. The length of the Plan 4 ID does not affect the functionality of Plan 4 clusters.

If you require all plan IDs to have identical length, do not activate or use Plan 4.


Database Cluster Stops After a Database Instance is Stopped

Symptom

After you stop one instance in a multiple-instance database cluster, the cluster stops, or communication between the remaining databases times out, and the entire cluster becomes unreachable.

The following might be in your UAA log:

WSREP has not yet prepared node for application use

Explanation

The database cluster is unable to recover automatically because a member is no longer available to reconcile quorum.


Velero Back Up Fails for vSphere PVs Attached to Clusters on Kubernetes v1.20 and Later

Symptom

Backing up vSphere persistent volumes using Velero fails and your Velero backup log includes the following error:

rpc error: code = Unknown desc = Failed during IsObjectBlocked check: Could not translate selfLink to CRD name

Explanation

This is a known issue when backing up clusters on Kubernetes v1.20 and later using the Velero Plugin for vSphere v1.1.0 or earlier.

Workaround

To resolve the problem, complete the steps in Velero backups of vSphere persistent volumes fail on Kubernetes clusters version 1.20 or higher (83314) in the Broadcom Support Knowledge Base.


Creating Two Windows Clusters at the Same Time Fails

Symptom

The first time that you try to create two Windows clusters at the same time, the creation of one of the clusters fails. If you run pks cluster CLUSTER-NAME to examine the last action taken on the cluster, you see the following:

 Last Action: Create Last Action State: failed Last Action Description: Instance provisioning failed: There was a problem completing your request. … operation: create, error-message: Failed to acquire lock … locking task id is 111, description: ‘create deployment’ 

Explanation

This is a known issue that occurs the first time that you create two Windows clusters concurrently.

Workaround

Recreate the failed cluster. This issue only occurs the first time that you create two Windows clusters concurrently.


Deleted Clusters are Listed in Cluster Lists

Symptom

After running tkgi delete-cluster and cluster deletion has completed, the deleted cluster continues to be listed when running tkgi clusters.

Workaround

You must manually remove the deleted cluster using a customized version of the ncp_cleanup script. For more information, see Deleting a Tanzu Kubernetes Grid Integrated Edition cluster with “tkgi delete-cluster” stuck “in progress” status in the Broadcom Support Knowledge Base.


BOSH Director Logs the Error ‘Duplicate vm extension name’

Symptom

After you uninstall TKGI, then reinstall TKGI in the same environment, BOSH Director logs errors similar to the following:

.../gems/bosh-director-0.0.0/lib/bosh/director/deployment_plan/cloud_manifest_parser.rb:120:in `parse_vm_extensions': Duplicate vm extension name 'disk_enable_uuid' (Bosh::Director::DeploymentDuplicateVmExtensionName)

Explanation

The pivotal-container-service cloud-config was not removed when you uninstalled the TKGI tile, and it remained active. When you reinstalled the TKGI tile, an additional pivotal-container-service cloud-config was created, causing the metrics_server to fall into a crash-loop state.

Workaround

You must manually remove the pivotal-container-service cloud-config after removing your TKGI deployment, including after removing the TKGI tile from Ops Manager.

For more information, see “Duplicate vm extension name” error when metrics_server runs on Director VM in Tanzu Kubernetes Grid Integrated Edition in the VMware Tanzu Community Knowledge Base.


The TKGI API FQDN Must Not Include Trailing Whitespace

Symptom

Your TKGI logs include the following error:

'uaa'. Errors are:- Error filling in template 'uaa.yml.erb' (line 59: Client redirect-uri is invalid: uaa.clients.pks_cli.redirect-uri Client redirect-uri is invalid: uaa.clients.pks_cluster_client.redirect-uri)

Explanation

The TKGI API fully-qualified domain name (FQDN) for your cluster contains leading or trailing whitespace.

Workaround

Do not include whitespace in the TKGI tile API Hostname (FQDN) field.


TMC Cluster Data Protection Backup Fails After Upgrading TKGI

The TMC Cluster Data Protection Backup fails in TKGI environments upgraded from an earlier version.

Symptom

The TMC Cluster Data Protection Backup fails to back up your existing clusters and logs the following error:

error executing custom action (groupResource=customresourcedefinitions.apiextensions.k8s.io, namespace=, name=ncpconfigs.nsx.vmware.com): rpc error: code = Unknown desc = error fetching v1beta1 version of ncpconfigs.nsx.vmware.com: the server could not find the requested resource

Explanation

Kubernetes v1.22 disallows the spec.preserveUnknownFields: true configuration in your existing clusters and the creation of a v1 CustomResourceDefinitions configuration fails.


TMC Cluster Data Protection Restore Fails When Using Antrea CNI

The TMC Cluster Data Protection Restore operation can fail when restoring multiple Antea resources.

Symptom

The TMC Cluster Data Protection Restore fails and logs errors that requests to restore the admission webhook have been denied.

Explanation

Velero has encountered a race condition while operating a resource. For more information, see Allow customizing restore order for Kubernetes controllers and their managed resources in the Velero GitHub repository.


TKGI Does Not Support CVDS / NVDS Mixed Environments

TKGI does not support environments where there are multiple matching networks, such as a mixed CVDS/NVDS environment.

Symptom

TKGI logs errors similar to the following in an environment with multiple matching networks:

LastOperationstatus='failed', description='Instance provisioning failed:
There was a problem completing your request. Please contact your operations team providing the following information:
service: p.pks, service-instance-guid: ..., broker-request-id: ..., task-id: ..., operation: create,
error-message: Unknown CPI error 'Unknown' with message 'undefined method `mob' for <VimSdk::Vim::OpaqueNetwork:' in create_vm' CPI method

Explanation

TKGI cannot identify which of the matching networks you intend to use and has selected the wrong network.


Occasionally update-cluster Does Not Complete for Windows Workers

Occasionally, tkgi update-cluster hangs while updating a Windows worker node instance and the BOSH task cannot finish and exits.

Symptom

The ovsdb-server service has stopped but other processes report that it is running.

Explanation

The ovsdb-server.pid file uses the pid for a process that is not the ovsdb-server.

To confirm that this is the root cause for tkgi update-cluster to hang:

  • To verify the ovsdb-server service has actually stopped, run the PowerShell Get-services command on the Windows worker node.
  • To verify that other processes report the ovsdb-server service is still running:

    1. Review the ovsdb-server job-service-wrapper.err.log log file.
      The job-service-wrapper.err.log log file is located at:

      C:\var\vcap\sys\log\openvswitch-windows\ovsdb-server\job-service-wrapper.err.log
      
    2. Confirm that after the flushing processes, the log includes an error similar to the following:

      Pid-Guard : ovsdb-server is already runing, please stop it first
      At C:\var\vcap\jobs\openvswitch-windows\bin\ovsdb-server_ctl.ps1:30 char:5
      +     Pid-Guard $PIDFILE "ovsdb-server"
      +     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          + CategoryInfo          : NotSpecified: ( [Write-Error], WriteErrorException
          + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Pid-Guard
      
  • To verify the root cause:

    1. Run the following PowerShell commands on the Windows worker node:

      $RUN_DIR = "C:\var\vcap\sys\run\openvswitch-windows"
      $PIDFILE = "$RUN_DIR\ovsdb-server.pid"
      $pid1 = Get-Content $PidFile -First 1
      echo $pid1
      $rst = Get-Process -Id $pid1 -ErrorAction SilentlyContinue
      echo $rst
      
    2. Confirm the returned ProcessName is not ovsdb-server.

Workaround

To resolve this issue for a single Windows worker:

  1. SSH to the affected worker node.
  2. Run the following:

    rm C:\var\vcap\sys\run\openvswitch-windows\ovsdb-server.pid
    
  3. Wait for the ovsdb-server process to start.
  4. Confirm the dependent services also start.


Harbor Private Projects Are Inaccessible after Upgrading to TKGI v1.13.0

If LDAP is enabled, Harbor private projects are inaccessible after upgrading to TKGI v1.13.0. For more information, see Private projects become inaccessible after upgrading Harbor for TKGI to v2.4.x with LDAP feature enabled in the Broadcom Support Knowledge Base.


Deployments Fail on TKGI Windows Worker-based Kubernetes Clusters after the January 2022 Microsoft Windows Security Patch

Microsoft changed Microsoft Windows’ support for tar file commands in the January 2022 Microsoft Windows security patch.

Packaging scripts that use tar commands for Windows worker-based Kubernetes Cluster deployments can fail after the Microsoft tar command patch update has been applied.

The BOSH agent used by vSphere stemcells built by stembuild v2019.43 and earlier use tar commands that are no longer supported and will fail if the Microsoft Windows security patch has been applied.

Workaround

stembuild v2019.44 and later include a version of the BOSH agent that does not use unsupported tar commands.

If you use vSphere stemcells, use stembuild 2019.44 or later to avoid the BOSH agent tar error.


TKGI Clusters Fail after NSX Upgrade If They Use NSGroup Policy API Resources

TKGI supports clusters that use NSGroup Policy API resources, but Policy API NSGroups created in one NSX version will be empty after upgrading NSX to a newer version.

Workaround

BOSH reconfigures a deployment’s NSGroup members if the deployment is redeployed.

After upgrading NSX, redeploy affected deployments to reconfigure their NSGroup members:

  1. Re-Apply Changes on the Ops Manager UI to redeploy TKGI tile deployments.
  2. Re-deploy the affected cluster deployments.


Rotating NSX certificates fails after migrating to NSX Policy API

This issue is fixed in TKGI v1.20.0.

After migrating from NSX Management Plane API to NSX Policy API, rotating NSX certificates sometimes fails due to a mismatch between policy display name and ID.

Symptom

Running tkgi rotate-certificates CLUSTER --non-interactive --only-nsx results in the following error seen in the pks-api logs:

```
Failed to retrieve certificate of display name pks-f5703ad0-1af1-402a-8f77-8a0cb52fea58
2024-06-13 14:16:21.749 ERROR 278082 — [nio-9021-exec-8] i.p.pks.cluster.CertificateService : Unknown error occurred rotating nsx certs
```

Explanation:

When TKGI first creates a cluster, it names its NSX certificates following the pattern pks-CLUSTER-ID, as both a display name and an internal name.

TKGI v1.14 and prior had a known issue: Rotating a cluster’s NSX certificates saved the new certificates under an autogenerated internal name, a GUID without a pks- prefix, and did not retain the cert’s display name.

When you migrate a cluster NSX Policy API, its NSX certificate is saved as a policy object with its name set to the certificate’s internal name.

The certificate rotation process retrieves certificates by their display name, so it cannot find certificates rotated in TKGI v1.14 and prior.

Workaround

See How to rotate Tanzu Kubernetes Grid Integrated Edition tls-nsx-t cluster certificate in the Broadcom Support KB.


Pods on NSX v3.2.3 Can Enter a NotReady State

When TKGI is deployed on NSX v3.2.3 and there are large numbers of pods with liveness probes, the pods on TKGI-provisioned clusters can enter a NotReady state.

Symptom

In addition to your pods being NotReady, if you restart NSX Manager:

  • Your NSX API logs include numerous repetitions of "POST /nsxapi/api/v1/firewall/sections/.../rules?operation=insert_bottom HTTP/1.1" ....
  • Your NCP logs include errors similar to:

    "nsx-container-ncp" subcomp="ncp" level="ERROR" security="True" errorCode="NCP00034"] nsx_ujo.ncp.nsx.manager.firewall_service Failed to create health check rule for port ...: Service cluster: 'https://nsx-manager.example.com' is unavailable. Please, check NSX setup and/or configuration.
    

Description

As pods are created or deleted, DFW firewall rules are replicated for the pod’s liveness probe. In NSX v3.2.3, the firewall rules are unintentionally duplicated during this replication. After numerous pod creation/deletion events, the compounded duplication creates a DFW firewall section large enough to create noticeable delays during pod operations and, eventually, a pod NotReady state.

Workaround

Upgrade NSX to a version that includes the fix, namely 3.2.4 or 4.1.1 or later.



check-circle-line exclamation-circle-line close-line
Scroll to top icon