This topic contains release notes for Tanzu Kubernetes Grid Integrated Edition (TKGI) v1.16.



TKGI v1.16.0

Release Date: February 28, 2023


Product Snapshot

Release Details
Version v1.16.0
Release date February 28, 2023
Component Version
Antrea v1.6.0* Release Notes
cAdvisor v0.39.1
Containerd Linux: v1.6.6
Windows: v1.6.6
CoreDNS v1.9.3+vmware.4*
CSI Driver for vSphere v2.7.0* Release Notes
etcd v3.5.6*
Harbor v2.6.2* Release Notes
Kubernetes v1.25.4* Release Notes
Metrics Server v0.6.1
NCP v4.1.0*
Percona XtraDB Cluster (PXC) v0.44.0
UAA v74.5.63*
Velero v1.9.5* Release Notes
VMware Cloud Foundation (VCF) Incompatible**
Wavefront Wavefront Collector: v1.12.0*
Wavefront Proxy: v12.0*
Compatibilities Versions
Ops Manager See VMware Tanzu Network.
VMware NSX See VMware Product Interoperability Matrices***.
vSphere
Windows stemcells v2019.55* or later
Ubuntu Jammy stemcells See VMware Tanzu Network.

* Components marked with an asterisk have been updated.
** VCF is not supported with TKGI v1.16 at this time. For more information, see Interoperability with VMware Cloud Foundation Is Unavailable below.
*** In-Tree vSphere Storage Volume support requires vSphere 7.0u2 and later. Migration from NSX Management Plane API to NSX Policy API requires VMware NSX v4.0.1.1 or later.


Upgrade Path

The supported upgrade paths to Tanzu Kubernetes Grid Integrated Edition v1.16.0 are from TKGI v1.15.2 and earlier TKGI v1.15 patches.


Breaking Changes

TKGI v1.16.0 has the following breaking changes:

  • Existing Telemetry Program configuration settings are ignored and telemetry must be reconfigured.

    The terms of the Telemetry & Customer Experience Improvement program have been updated. Your previous selection will be reset upon upgrading to TKGI 1.16. Review VMware’s Customer Experience Improvement Program, and indicate your willingness to participate in TKGI’s CEIP Tab.

    To reconfigure Telemetry, see VMware CEIP in the Installing Tanzu Kubernetes Grid Integrated Edition topic for your IaaS.

    For more information on the Telemetry enhancements in this release, see Telemetry Enhancements.

  • Upgrades Kubernetes to v1.25:

    • Kubernetes no longer serves the following:

      • batch/v1beta1 API version of CronJob.
      • discovery.k8s.io/v1beta1 API version of EndpointSlice.
      • events.k8s.io/v1beta1 API version of Event.
      • autoscaling/v2beta1 API version of HorizontalPodAutoscaler.
      • policy/v1beta1 API version of PodDisruptionBudget.
      • PodSecurityPolicy in the policy/v1beta1 API.

        Note: Pod Security Policy configurations must be migrated to Pod Security Admission security before upgrading to TKGI v1.16. For more information, see Migrate from PSP to PSA Controller in Pod Security Admission in TKGI.

      • RuntimeClass in the node.k8s.io/v1beta1 API.

      For more information on Kubernetes v1.25 API removals, see Deprecated API Migration Guide - v1.25 in the Kubernetes documentation.

    • No longer supports In-Tree vSphere Storage Volumes on vSphere 7.0u1 and earlier. Kubernetes v1.25 supports In-Tree vSphere Storage Volumes on vSphere 7.0u2 and later only.

    For information about additional changes in Kubernetes v1.25, see CHANGELOG-1.25 in the Kubernetes GitHub repository.

  • Support for the Xenial Stemcell has been removed. For more information, see Supports the Ubuntu Jammy Stemcell below.
  • The TKGI API requires a CA certificate with a SAN field:

    • The custom CA certificate used to secure TKGI API connections must include a SAN field. If the TKGI API certificate does not include a SAN field, TKGI CLI commands will return the following error:

      An error occurred in the PKS API when processing
      


Features and Enhancements

TKGI v1.16.0 has the following features:


Telemetry Enhancements

Customers who participate in the CEIP receive proactive support benefits that include a weekly report based on telemetry data. Contact your Customer Success Manager to subscribe to this report. You can view a sample report at TKGI Platform Operations Report.


vSphere CSI Driver Enhancements

TKGI v1.16.0 includes the following vSphere CSI Driver enhancements:

  • Supports the snapshot and restore feature for persistent volumes. For more information, see Customize the Maximum Number of Volume Snapshots.

  • Supports using the vSphere Container Storage Interface (CSI) Driver on cluster worker nodes that are distributed across multiple data centers. For more information, see Configure CNS Data Centers in Deploying and Managing Cloud Native Storage (CNS) on vSphere.


Supports the Ubuntu Jammy Stemcell

Supports the Ubuntu Jammy Stemcell.

Support for the Ubuntu Jammy Stemcell replaces support for the Xenial Stemcell. TKGI Kubernetes cluster node VMs and the TKGI Control Plane now use the Ubuntu Jammy Stemcell.

Upgrading an existing TKGI cluster to TKGI v1.16 automatically switches the cluster to the Ubuntu Jammy Stemcell.

You must import a supported Ubuntu Jammy Stemcell before upgrading TKGI to TKGI v1.16.0 or later. For more information, see Download and Import Stemcells in Upgrading Tanzu Kubernetes Grid Integrated Edition.


Compatible with Ops Manager v3.0

TKGI v1.16 supports Ops Manager v3.0 and Ops Manager v2.10. For more information about the new features and improvements in Ops Manager v3.0, see Ops Manager v3.0 Release Notes in the Ops Manager documentation.

For information about the Ops Manager v3.0 and v2.10 patch release versions supported by TKGI v1.16.0, see Product Snapshot above.


Supports Migrating TKGI to NSX Policy API

Supports promoting TKGI and TKGI Kubernetes clusters and workloads from NSX Management Plane API to NSX Policy API on vSphere with VMware NSX v4.0.1.1 or later.

For more information, see Migrating the NSX Management Plane API to NSX Policy API - Overview.


Supports the Velero vSphere Plugin

Supports backing up and restoring TKGI and TKGI Kubernetes clusters on vSphere using the Velero vSphere Plugin.

For more information, see Installing Velero vSphere Plugin.


vSphere CSI Supports Multiple Data Centers

Supports using the vSphere Container Storage Interface (CSI) Driver on cluster worker nodes that are distributed across multiple data centers. For more information, see Configure CNS Data Centers in Deploying and Managing Cloud Native Storage (CNS) on vSphere.


Additional Features

TKGI v1.16.0 includes the following additional features:


Resolved Issues

TKGI v1.16.0 has the following resolved issues:


Deprecations

The following TKGI features have been deprecated or removed from TKGI v1.16:

  • Google Cloud Platform: Support for the Google Cloud Platform (GCP) is deprecated. Support for GCP will be entirely removed in a future TKGI version.

  • The log_dropped_traffic CNI Configuration parameter: In TKGI v1.16.0 and later, the log_dropped_traffic CNI Configuration parameter is ignored.

    To configure logging in a Network Profile, modify the log_firewall_traffic parameter. For more information, see log_settings in the cni_configurations Parameters section in Creating and Managing Network Profiles.

  • Pod Security Policy Support: Support for Kubernetes Pod Security Policy (PSP) has been entirely removed in Kubernetes v1.25. Kubernetes v1.25 instead supports Pod Security Admission. For more information, see Enabling and Configuring Pod Security Admission.

  • Flannel Support: Support for the Flannel Container Networking Interface (CNI) is deprecated. VMware recommends that you switch your Flannel CNI-configured clusters to the Antrea CNI. For more information about Flannel CNI deprecation, see About Switching from the Flannel CNI to the Antrea CNI in About Tanzu Kubernetes Grid Integrated Edition Upgrades.

  • In-Tree vSphere Storage Volume Support: In-Tree vSphere Storage volume support has been deprecated and will be entirely removed in a future Kubernetes version. The TKGI v1.17 upgrade will automatically migrate TKGI clusters from in-tree vSphere storage to vSphere CSI. VMware strongly recommends that you migrate your in-tree vSphere storage volumes to vSphere CSI volumes as soon as possible. For information on how to manually migrate In-Tree vSphere Storage volumes on existing TKGI clusters from In-Tree vSphere Storage to the automatically installed vSphere CSI Driver, see Migrate an In-Tree vSphere Storage Volume to the vSphere CSI Driver in Deploying and Managing Cloud Native Storage (CNS) on vSphere.


Known Issues

TKGI v1.16.0 has the following known issues.


TKGI MC Unable to Manage TKGI after Restoring the TKGI Control Plane from Backup

Symptom

After you restore Ops Manager and the TKGI API VM from backup, TKGI functions normally, but your TKGI MC tabs include the following error: “…product ‘pivotal-container service’ is not deployed…”.

Explanation

TKGI MC is associated with an Ops Manager with a specific name. If you rename Ops Manager with a new name while restoring, your TKGI MC will not recognize the restored Ops Manager and cannot manage it.


Kubernetes Pods on NSX-T Become Stuck in a Creating State

Symptom

The pods in your TKGI Kubernetes clusters on NSX-T become stuck in a creating state. The connections between nsx-node-agent and hyperbus repeatedly close, log Couldn't connect to 'tcp://...' (error: 111-Connection refused), and have a status of COMMUNICATION_ERROR.

Explanation

For information and workaround steps for this Known Issue, see Issue 2795268: Connection between nsx-node-agent and hyperbus flips and Kubernetes pod is stuck at creating state in NSX Container Plugin 3.1.2 Release Notes in the VMware documentation.


Error: Could Not Execute “Apply-Changes” in Azure Environment

Symptom

After clicking Apply Changes on the TKGI tile in an Azure environment, you experience an error ‘…could not execute “apply-changes”…’ with either of the following descriptions:

  • {“errors”:{“base”:[“undefined method ‘location’ for nil:NilClass”]}}
  • FailedError.new(“Resource Groups in region ‘#{location}’ do not support Availability Zones”))

For example:

INFO | 2020-09-21 03:46:49 +0000 | Vessel::Workflows::Installer#run | Install product (apply changes)
2020/09/21 03:47:02 could not execute "apply-changes": installation failed to trigger: request failed: unexpected response from /api/v0/installations:
HTTP/1.1 500 Internal Server Error
Transfer-Encoding: chunked
Cache-Control: no-cache, no-store
Connection: keep-alive
Content-Type: application/json; charset=utf-8
Date: Mon, 21 Sep 2020 17:51:50 GMT
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Pragma: no-cache
Referrer-Policy: strict-origin-when-cross-origin
Server: Ops Manager
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Frame-Options: SAMEORIGIN
X-Permitted-Cross-Domain-Policies: none
X-Request-Id: f5fc99c1-21a7-45c3-7f39
X-Runtime: 9.905591
X-Xss-Protection: 1; mode=block

44
{"errors":{"base":["undefined method `location' for nil:NilClass"]}}
0

Explanation

The Azure CPI endpoint used by Ops Manager has been changed and your installed version of Ops Manager is not compatible with the new endpoint.

Workaround

Run the following Ops Manager CLI command:

om --skip-ssl-validation --username USERNAME --password PASSWORD --target https://OPSMAN-API curl --silent --path /api/v0/staged/director/verifiers/install_time/IaasConfigurationVerifier -x PUT -d '{ "enabled": false }'

Where:

  • USERNAME is the account to use to run Ops Manager API commands.
  • PASSWORD is the password for the account.
  • OPSMAN-API is the IP address for the Ops Manager API

For more information, see Error ‘undefined method location’ is received when running Apply Change on Azure in the VMware Tanzu Knowledge Base.


VMware vRealize Operations Does Not Support Windows Worker-Based Kubernetes Clusters

VMware vRealize Operations (vROPs) does not support Windows worker-based Kubernetes clusters and cannot be used to manage TKGI-provisioned Windows workers.


TKGI Wavefront Requires Manual Installation for Windows Workers

To monitor Windows-based worker node clusters with a Wavefront collector and proxy, you must first install Wavefront on the clusters manually, using Helm. For instructions, see the Wavefront section of the Monitoring Windows Worker Clusters and Nodes topic.


Pinging Windows Worker Kubernetes Clusters Does Not Work

TKGI-provisioned Windows worker-based Kubernetes clusters inherit a Kubernetes limitation that prevents outbound ICMP communication from workers. As a result, pinging Windows workers does not work.

For information about this limitation, see Limitations > Networking in the Windows in Kubernetes documentation.


Velero Does Not Support Backing Up Stateful Windows Workloads

You can use Velero to back up stateless TKGI-provisioned Windows workers only. You cannot use Velero to back up stateful Windows applications. For more information, see Velero on Windows in Basic Install in the Velero documentation.


Tanzu Mission Control Integration Not Supported on GCP

TKGI on Google Cloud Platform (GCP) does not support Tanzu Mission Control (TMC) integration, which is configured in the Tanzu Kubernetes Grid Integrated Edition tile > the Tanzu Mission Control pane.

If you intend to run TKGI on GCP, skip this pane when configuring the Tanzu Kubernetes Grid Integrated Edition tile.


TMC Data Protection Feature Requires Privileged TKGI Containers

TMC Data Protection feature supports privileged TKGI containers only. For more information, see Plans in the Installing TKGI topic for your IaaS.


Windows Worker Kubernetes Clusters with Group Managed Service Account Do Not Support Compute Profiles

Windows worker-based Kubernetes clusters integrated with group Managed Service Account (gMSA) cannot be managed using compute profiles.


Windows Worker Kubernetes Clusters on Flannel Do Not Support Compute Profiles

On vSphere with NSX-T networking you can use compute profiles with both Linux and Windows worker‑based Kubernetes clusters. On vSphere with Flannel networking, you can apply compute profiles only to Linux clusters.


TKGI CLI Does Not Prevent Reducing the Control Plane Node Count

TKGI CLI does not prevent accidentally reducing a cluster’s control plane node count using a compute profile.

Warning: Reducing a cluster’s control plane node count can destroy the cluster. Do not scale out or scale in existing control plane nodes by reconfiguring the TKGI tile or by using a compute profile. Reducing a cluster’s number of control plane nodes might remove a control plane node and cause the cluster to become inactive.


Windows Cluster Nodes Not Deleted After VM Deleted

Symptom

After you delete a VM using the management console of your infrastructure provider, you notice a Windows worker node that had been on that VM is now in a notReady state.

Solution

  1. To identify the leftover node:

    kubectl get no -o wide
    
  2. Locate nodes on the returned list that are in a notReady state and have the same IP address as another node in the list.
  3. To manually delete a notReady node:

    kubectl delete node NODE-NAME
    

    Where NODE-NAME is the name of the node in the notReady state.


502 Bad Gateway After OIDC Login

Symptom

You experience a “502 Bad Gateway” error from the NSX load balancer after you log in to OIDC.

Explanation

A large response header has exceeded your NSX-T load balancer maximum response header size. The default maximum response header size is 10,240 characters and should be resized to 50,000.

Workaround

If you experience this issue, manually reconfigure your NSX-T request_header_size and response_header_size to 50,000 characters. For information about configuring NSX-T default header sizes, see OIDC Response Header Overflow in the Knowledge Base.


Difficulty Changing Proxy for Windows Workers

You must configure a global proxy in the Tanzu Kubernetes Grid Integrated Edition tile > Networking pane before you create any Windows workers that use the proxy.

You cannot change the proxy configuration for Windows workers in an existing cluster.


Character Limitations in HTTP Proxy Password

For vSphere with NSX-T, the HTTP Proxy password field does not support the following special characters: & or ;.


Error After Modifying Your Harbor Storage Configuration

Symptom

You receive the following error after modifying your existing Harbor installation’s storage configuration:

Error response from daemon: manifest for ... not found: manifest unknown: manifest unknown

Explanation

Harbor does not support modifying an existing Harbor installation’s storage configuration.

Workaround

To modify your Harbor storage configuration, re-install Harbor. Before starting Harbor, configure the new Harbor installation with the desired configuration.


Ingress Controller Statefulset Fails to Start After Resizing Worker Nodes

Symptom

Permissions are removed from your cluster’s files and processes after resizing the persistent disk during a cluster upgrade. The ingress controller statefulset fails to start.

Explanation

When resizing a persistent disk, Bosh migrates the data from the old disk to the new disk but does not copy the files’ extended attributes.

Workaround

To resolve the problem, complete the steps in [Ingress controller statefulset fails to start after resize of worker nodes with permission denied] (https://community.pivotal.io/s/article/5000e00001nCJxT1603094435795?language=en_US) in the VMware Tanzu Knowledge Base.


Azure Default Security Group Is Not Automatically Assigned to Cluster VMs

Symptom

You experience issues when configuring a load balancer for a multi-control plane node Kubernetes cluster or creating a service of type LoadBalancer. Additionally, in the Azure portal, the VM > Networking page does not display any inbound and outbound traffic rules for your cluster VMs.

Explanation

As part of configuring the Tanzu Kubernetes Grid Integrated Edition tile for Azure, you enter Default Security Group in the Kubernetes Cloud Provider pane. When you create a Kubernetes cluster, Tanzu Kubernetes Grid Integrated Edition automatically assigns this security group to each VM in the cluster. However, on Azure the automatic assignment might not occur.

As a result, your inbound and outbound traffic rules defined in the security group are not applied to the cluster VMs.

Workaround

If you experience this issue, manually assign the default security group to each VM NIC in your cluster.


One Plan ID Longer than Other Plan IDs

Symptom

One of your plan IDs is one character longer than your other plan IDs.

Explanation

In TKGI, each plan has a unique plan ID. A plan ID is normally a UUID consisting of 32 alphanumeric characters and 4 hyphens. However, the Plan 4 ID consists of 33 alphanumeric characters and 4 hyphens.

Solution

You can safely configure and use Plan 4. The length of the Plan 4 ID does not affect the functionality of Plan 4 clusters.

If you require all plan IDs to have identical length, do not activate or use Plan 4.


Database Cluster Stops After a Database Instance is Stopped

Symptom

After you stop one instance in a multiple-instance database cluster, the cluster stops, or communication between the remaining databases times out, and the entire cluster becomes unreachable.

The following might be in your UAA log:

WSREP has not yet prepared node for application use

Explanation

The database cluster is unable to recover automatically because a member is no longer available to reconcile quorum.


Velero Back Up Fails for vSphere PVs Attached to Clusters on Kubernetes v1.20 and Later

Symptom

Backing up vSphere persistent volumes using Velero fails and your Velero backup log includes the following error:

rpc error: code = Unknown desc = Failed during IsObjectBlocked check: Could not translate selfLink to CRD name

Explanation

This is a known issue when backing up clusters on Kubernetes v1.20 and later using the Velero Plugin for vSphere v1.1.0 or earlier.

Workaround

To resolve the problem, complete the steps in Velero backups of vSphere persistent volumes fail on Kubernetes clusters version 1.20 or higher (83314) in the VMware Tanzu Knowledge Base.


Creating Two Windows Clusters at the Same Time Fails

Symptom

The first time that you try to create two Windows clusters at the same time, the creation of one of the clusters fails. If you run pks cluster CLUSTER-NAME to examine the last action taken on the cluster, you see the following:

 Last Action: Create Last Action State: failed Last Action Description: Instance provisioning failed: There was a problem completing your request. … operation: create, error-message: Failed to acquire lock … locking task id is 111, description: ‘create deployment’ 

Explanation

This is a known issue that occurs the first time that you create two Windows clusters concurrently.

Workaround

Recreate the failed cluster. This issue only occurs the first time that you create two Windows clusters concurrently.


Deleted Clusters are Listed in Cluster Lists

Symptom

After running tkgi delete-cluster and cluster deletion has completed, the deleted cluster continues to be listed when running tkgi clusters.

Workaround

You must manually remove the deleted cluster using a customized version of the ncp_cleanup script. For more information, see Deleting a Tanzu Kubernetes Grid Integrated Edition cluster with “tkgi delete-cluster” stuck “in progress” status in the VMware Tanzu Knowledge Base.


BOSH Director Logs the Error ‘Duplicate vm extension name’

Symptom

After you uninstall TKGI, then reinstall TKGI in the same environment, BOSH Director logs errors similar to the following:

.../gems/bosh-director-0.0.0/lib/bosh/director/deployment_plan/cloud_manifest_parser.rb:120:in `parse_vm_extensions': Duplicate vm extension name 'disk_enable_uuid' (Bosh::Director::DeploymentDuplicateVmExtensionName)

Explanation

The pivotal-container-service cloud-config was not removed when you uninstalled the TKGI tile, and it remained active. When you reinstalled the TKGI tile, an additional pivotal-container-service cloud-config was created, causing the metrics_server to fall into a crash-loop state.

Workaround

You must manually remove the pivotal-container-service cloud-config after removing your TKGI deployment, including after removing the TKGI tile from Ops Manager.

For more information, see “Duplicate vm extension name” error when metrics_server runs on Director VM in Tanzu Kubernetes Grid Integrated Edition in the VMware Tanzu Community Knowledge Base.


The TKGI API FQDN Must Not Include Trailing Whitespace

Symptom

Your TKGI logs include the following error:

'uaa'. Errors are:- Error filling in template 'uaa.yml.erb' (line 59: Client redirect-uri is invalid: uaa.clients.pks_cli.redirect-uri Client redirect-uri is invalid: uaa.clients.pks_cluster_client.redirect-uri)

Explanation

The TKGI API fully-qualified domain name (FQDN) for your cluster contains leading or trailing whitespace.

Workaround

Do not include whitespace in the TKGI tile API Hostname (FQDN) field.


TMC Cluster Data Protection Backup Fails After Upgrading TKGI

The TMC Cluster Data Protection Backup fails in TKGI environments upgraded from an earlier version.

Symptom

The TMC Cluster Data Protection Backup fails to back up your existing clusters and logs the following error:

error executing custom action (groupResource=customresourcedefinitions.apiextensions.k8s.io, namespace=, name=ncpconfigs.nsx.vmware.com): rpc error: code = Unknown desc = error fetching v1beta1 version of ncpconfigs.nsx.vmware.com: the server could not find the requested resource

Explanation

Kubernetes v1.22 disallows the spec.preserveUnknownFields: true configuration in your existing clusters and the creation of a v1 CustomResourceDefinitions configuration fails.


TMC Cluster Data Protection Restore Fails When Using Antrea CNI

The TMC Cluster Data Protection Restore operation can fail when restoring multiple Antea resources.

Symptom

The TMC Cluster Data Protection Restore fails and logs errors that requests to restore the admission webhook have been denied.

Explanation

Velero has encountered a race condition while operating a resource. For more information, see Allow customizing restore order for Kubernetes controllers and their managed resources in the Velero GitHub repository.


TKGI Does Not Support CVDS / NVDS Mixed Environments

TKGI does not support environments where there are multiple matching networks, such as a mixed CVDS/NVDS environment.

Symptom

TKGI logs errors similar to the following in an environment with multiple matching networks:

LastOperationstatus='failed', description='Instance provisioning failed:
There was a problem completing your request. Please contact your operations team providing the following information:
service: p.pks, service-instance-guid: ..., broker-request-id: ..., task-id: ..., operation: create,
error-message: Unknown CPI error 'Unknown' with message 'undefined method `mob' for <VimSdk::Vim::OpaqueNetwork:' in create_vm' CPI method

Explanation

TKGI cannot identify which of the matching networks you intend to use and has selected the wrong network.


Occasionally update-cluster Does Not Complete for Windows Workers

Occasionally, tkgi update-cluster hangs while updating a Windows worker node instance and the BOSH task cannot finish and exits.

Symptom

The ovsdb-server service has stopped but other processes report that it is running.

Explanation

The ovsdb-server.pid file uses the pid for a process that is not the ovsdb-server.

To confirm that this is the root cause for tkgi update-cluster to hang:

  • To verify the ovsdb-server service has actually stopped, run the PowerShell Get-services command on the Windows worker node.
  • To verify that other processes report the ovsdb-server service is still running:

    1. Review the ovsdb-server job-service-wrapper.err.log log file.
      The job-service-wrapper.err.log log file is located at:

      C:\var\vcap\sys\log\openvswitch-windows\ovsdb-server\job-service-wrapper.err.log
      
    2. Confirm that after the flushing processes, the log includes an error similar to the following:

      Pid-Guard : ovsdb-server is already runing, please stop it first
      At C:\var\vcap\jobs\openvswitch-windows\bin\ovsdb-server_ctl.ps1:30 char:5
      +     Pid-Guard $PIDFILE "ovsdb-server"
      +     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          + CategoryInfo          : NotSpecified: ( [Write-Error], WriteErrorException
          + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Pid-Guard
      
  • To verify the root cause:

    1. Run the following PowerShell commands on the Windows worker node:

      $RUN_DIR = "C:\var\vcap\sys\run\openvswitch-windows"
      $PIDFILE = "$RUN_DIR\ovsdb-server.pid"
      $pid1 = Get-Content $PidFile -First 1
      echo $pid1
      $rst = Get-Process -Id $pid1 -ErrorAction SilentlyContinue
      echo $rst
      
    2. Confirm the returned ProcessName is not ovsdb-server.

Workaround

To resolve this issue for a single Windows worker:

  1. SSH to the affected worker node.
  2. Run the following:

    rm C:\var\vcap\sys\run\openvswitch-windows\ovsdb-server.pid
    
  3. Wait for the ovsdb-server process to start.
  4. Confirm the dependent services also start.


Harbor Private Projects Are Inaccessible after Upgrading to TKGI v1.13.0

If LDAP is enabled, Harbor private projects are inaccessible after upgrading to TKGI v1.13.0. For more information, see Private projects become inaccessible after upgrading Harbor for TKGI to v2.4.x with LDAP feature enabled in the VMware Tanzu Knowledge Base.


Deployments Fail on TKGI Windows Worker-based Kubernetes Clusters after the January 2022 Microsoft Windows Security Patch

Microsoft changed Microsoft Windows’ support for tar file commands in the January 2022 Microsoft Windows security patch.

Packaging scripts that use tar commands for Windows worker-based Kubernetes Cluster deployments can fail after the Microsoft tar command patch update has been applied.

The BOSH agent used by vSphere stemcells built by stembuild v2019.43 and earlier use tar commands that are no longer supported and will fail if the Microsoft Windows security patch has been applied.

Workaround

stembuild v2019.44 and later include a version of the BOSH agent that does not use unsupported tar commands.

If you use vSphere stemcells, use stembuild 2019.44 or later to avoid the BOSH agent tar error.


TKGI Reallocates Network Profile-Allocated FIP Pool Addresses

Symptom

The pre-start script for tkgi create-cluster fails and logs floating IP pool allocation errors in the pre-start.stderr.log similar to the following:

level=error msg="operation failed with [POST /pools/ip-pools/{pool-id}][409] allocateOrReleaseFromIpPoolConflict  &{RelatedAPIError:{Details: ErrorCode:5141 ErrorData:<nil> ErrorMessage:Requested IP Address ... is already allocated. ModuleName:id-allocation service} RelatedErrors:[]}\n"

level=warning msg="failed to allocate FIP from (pool: ...: [POST /pools/ip-pools/{pool-id}][409] allocateOrReleaseFromIpPoolConflict  &{RelatedAPIError:{Details: ErrorCode:5141 ErrorData:<nil> ErrorMessage:Requested IP Address ... is already allocated. ModuleName:id-allocation service} RelatedErrors:[]}\n"

Error: an error occurred during FIP allocation

Explanation

TKGI administrators can allocate floating IP pool IP Addresses in a Network Profile configuration. The TKGI control plane allocates IP Addresses from the floating IP pool without accounting for the IPs allocated using Network Profiles.

Workaround

TKGI allocates IP addresses starting from the beginning of a floating IP pool range. When configuring a Network Profile, allocate IP Addresses starting at the end of the floating IP pool range instead of those at the beginning.



TKGI Clusters Fail after NSX Upgrade If They Use NSGroup Policy API Resources

TKGI supports clusters that use NSGroup Policy API resources, but Policy API NSGroups created in one NSX version will be empty after upgrading NSX to a newer version.

Workaround

BOSH reconfigures a deployment’s NSGroup members if the deployment is redeployed.

After upgrading NSX, redeploy affected deployments to reconfigure their NSGroup members:

  1. Re-Apply Changes on the Ops Manager UI to redeploy TKGI tile deployments.
  2. Re-deploy the affected cluster deployments.


Kubernetes API Server and etcd Daemon Occasionally Fail to Start During BBR Restore

The Kubernetes API server or the etcd daemon on a cluster control plane node might not start during a BBR restore, stopping the restore.

Symptom

During a BBR restore, the post-restore-unlock script occasionally times out while starting the etcd daemon or Kubernetes API server.

For example, the post-restore-unlock script shows the following when the etcd daemon fails to start:

Error attempting to run post-restore-unlock for job bbr-etcd on master...
+ NAME=post-restore-unlock
+ LOG_DIR=/var/vcap/sys/log/bbr-etcd
+ exec
++ tee -a /var/vcap/sys/log/bbr-etcd/post-restore-unlock.stdout.log
...
monit has started etcd
+ timeout 1200 /bin/bash
waiting for etcd daemon to start
Process 'etcd'     not monitored - start pending
...
waiting for etcd daemon to start
Process 'etcd'     initializing
etcd daemon was unable to start after 1200 seconds
+ exit 1 - exit code 1 

Workaround

Restart the BBR restore if the Kubernetes API server or the etcd daemon fails to start.


‘Input not an X.509 certificate’ When Applying Change on the TKGI Tile

The TKGI tile might report an error similar to the following when Applying Changes with a correctly formatted certificate.

Setting up key store, trust store and installing certs.
keytool error: java.lang.Exception: Input not an X.509 certificate
pre-start.stdout.log 

Explanation

The certificate contains one or more certificate keywords, for example, BEGIN or END, and does not validate.


Interoperability with VMware Aria Operations Management Pack for Kubernetes Is Unavailable

Interoperability with VMware Aria Operations Management Pack for Kubernetes is temporarily unavailable.

VMware Aria Operations Management Pack for Kubernetes is currently not compatible with TKGI v1.16. Interoperability between VMware Aria Operations Management Pack for Kubernetes and TKGI v1.16 is expected at a later time.


Interoperability with VMware Cloud Foundation Is Unavailable

Interoperability with VMware Cloud Foundation (VCF) is temporarily unavailable.

VCF is currently not compatible with TKGI v1.16. Interoperability between VCF and TKGI v1.16 is expected at a later time.


Interoperability with Tanzu Mission Control is Unavailable

Interoperability with Tanzu Mission Control (TMC) is temporarily unavailable.

TMC is currently not compatible with Kubernetes v1.25 and cannot manage TKGI v1.16 Kubernetes clusters. Interoperability between TMC and TKGI v1.16 is expected at a later time.


Limitations on Using the VMware vSphere CSI Driver

The VMware vSphere CSI Driver supports a limited set of VMware vSphere features. Before enabling the vSphere CSI Driver on a TKGI cluster, confirm the cluster and storage configuration are supported by the driver. For more information, see Unsupported Features and Limitations in Deploying and Managing Cloud Native Storage (CNS) on vSphere.


Limitations on Using a Public Cloud CSI Driver

If you enable a public cloud CSI Driver on a TKGI cluster, you must take additional steps before deleting,upgrading, or updating the cluster.

Deleting a Cluster on a Public Cloud

When deleting a cluster that uses a public cloud CSI Driver:

  1. Manually delete the workload PVCs and PVs before deleting the cluster.
  2. Delete the cluster. For more information on deleting clusters, see Deleting Clusters.

Upgrading a Cluster on a Public Cloud

When upgrading a cluster that uses a public cloud CSI Driver:

  • No preparation steps are needed when upgrading a multi-worker node cluster.
  • To prepare a single-worker node cluster for upgrading:

    1. Resize the cluster to two or more worker nodes before upgrading the cluster. For more information, see Scaling Existing Clusters.
    2. Upgrade the cluster. For more information on upgrading clusters, see Upgrading Clusters.

Updating a Cluster on a Public Cloud

When updating a cluster that uses a public cloud CSI Driver:

  • No preparation step are needed when updating a multi-worker node cluster.
  • To prepare a single-worker node cluster for updating:

    1. Resize the cluster to two or more worker nodes before updating the cluster. For more information, see Scaling Existing Clusters.
    2. Update the cluster.


The ‘kube-state-metrics’ ClusterRole Is Deleted during Cluster Upgrade

The wavefront-proxy-errand deletes the kube-state-metrics ClusterRole during cluster upgrade. The deleted ClusterRole must be manually restored after upgrading a cluster.




TKGI Management Console v1.16.0

Release Date: February 28, 2023

Note: Tanzu Kubernetes Grid Integrated Edition Management Console provides an opinionated installation of TKGI. The supported versions might differ from or be more limited than what is generally supported by TKGI.


Product Snapshot

Element Details
Version v1.16.0
Release date February 28, 2023
Installed TKGI version v1.16.0
Installed Ops Manager version v2.10.53 Release Notes
Component Version
Installed Kubernetes version v1.25.4* Release Notes
Installed Harbor Registry version v2.6.2* Release Notes
Ubuntu Jammy stemcell v1.83* Release Notes

* Components marked with an asterisk have been updated.


Upgrade Path

The supported upgrade paths to Tanzu Kubernetes Grid Integrated Edition Management Console v1.16.0 are from TKGI MC v1.15.2 and earlier TKGI v1.15 patches.


Breaking Changes

  • Existing Telemetry Program configuration settings are ignored and telemetry must be reconfigured.

    To reconfigure Telemetry, see VMware CEIP in the Installing Tanzu Kubernetes Grid Integrated Edition topic for your IaaS.

    For more information on the Telemetry enhancements in this release, see Telemetry Enhancements.


Features and Resolved Issues

TKGI Management Console v1.16.0 has the following features:

Telemetry Enhancements

Customers who participate in the CEIP receive proactive support benefits that include a weekly report based on telemetry data. Contact your Customer Success Manager to subscribe to this report. You can view a sample report at TKGI Platform Operations Report.


Deprecations

The following TKGI features have been deprecated or removed from TKGI Management Console v1.16:


Known Issues

The Tanzu Kubernetes Grid Integrated Edition Management Console v1.16.0 has the following known issues:


vRealize Log Insight Integration Does Not Support HTTPS Connections

Symptom

The Tanzu Kubernetes Grid Integrated Edition Management Console integration to vRealize Log Insight does not support connections to the HTTPS port on the vRealize Log Insight server.

Workaround

  1. Use SSH to log in to the Tanzu Kubernetes Grid Integrated Edition Management Console appliance VM.
  2. Open the file /lib/systemd/system/pks-loginsight.service in a text editor.
  3. Add -e LOG_SERVER_ENABLE_SSL_VERIFY=false.
  4. Set -e LOG_SERVER_USE_SSL=true.

    The resulting file should look like the following example:

    ExecStart=/bin/docker run --privileged --restart=always --network=pks
    -v /var/log/journal:/var/log/journal
    --name=pks-loginsight
    -e TYPE=gear2-vm
    -e LOG_SERVER_HOST=${LOGINSIGHT_HOST}
    -e LOG_SERVER_PORT=${LOGINSIGHT_PORT}
    -e LOG_SERVER_ENABLE_SSL_VERIFY=false
    -e LOG_SERVER_USE_SSL=true
    -e LOG_SERVER_AGENT_ID=${LOGINSIGHT_ID}
    pksoctopus/vrli-journald:v07092019
    
  5. Save the file and run systemctl daemon-reload.

  6. To restart the vRealize Log Insight service, run systemctl restart pks-loginsight.service.

Tanzu Kubernetes Grid Integrated Edition Management Console can now send logs to the HTTPS port on the vRealize Log Insight server.


vSphere HA causes Management Console ovfenv Data Corruption

Symptom

If you enable vSphere HA on a cluster, if the TKGI Management Console appliance VM is running on a host in that cluster, and if the host reboots, vSphere HA recreates a new TKGI Management Console appliance VM on another host in the cluster. Due to an issue with vSphere HA, the ovfenv data for the newly created appliance VM is corrupted and the new appliance VM does not boot up with the correct network configuration.

Workaround

  1. In the vSphere Client, right-click the appliance VM and select Power > Shut Down Guest OS.
  2. Right-click the appliance again and select Edit Settings.
  3. Select VM Options and click OK.
  4. Verify under Recent Tasks that a Reconfigure virtual machine task has run on the appliance VM.
  5. Power on the appliance VM.


Base64 encoded file arguments are not decoded in Kubernetes profiles

Symptom

Some file arguments in Kubernetes profiles are base64 encoded. When the management console displays the Kubernetes profile, some file arguments are not decoded.

Workaround

Run echo "$content" | base64 --decode


Network profiles not immediately selectable

Symptom

If you create network profiles and then try to apply them in the Create Cluster page, the new profiles are not available for selection.

Workaround

Log out of the management console and log back in again.


Real-Time IP information not displayed for network profiles

Symptom

In the cluster summary page, only default IP pool, pod IP block, node IP block values are displayed, rather than the real-time values from the associated network profile.

Workaround

None


Error After Modifying Your Harbor Storage Configuration

Symptom

You receive the following error after modifying your existing Harbor installation’s storage configuration:

Error response from daemon: manifest for ... not found: manifest unknown: manifest unknown

Explanation

Harbor does not support modifying an existing Harbor installation’s storage configuration.

Workaround

To modify your Harbor storage configuration, re-install Harbor. Before starting Harbor, configure the new Harbor installation with the desired configuration.


Windows Stemcells Must be Re-Imported After Upgrading Ops Manager

Symptom

After upgrading Ops Manager, your Management Console does not recognize a Windows stemcell imported when using the prior version of Ops Manager.

Workaround

If your Management Console does not recognize a Windows stemcell after upgrading Ops Manager:

  1. Re-import your previously imported Windows stemcell.
  2. Apply Changes to TKGI MC.


Your New Clusters Are Not Shown In Tanzu Mission Control

Symptom

After you create a cluster, Tanzu Mission Control does not include the cluster in cluster lists. You have a “Resource not found” error similar to the following in your BOSH logs:

Cluster Name in TMC: cluster-1
Cluster Name Prefix: tkgi-my-prefix-
Group Name in TMC: my-prefix-clusters
Cluster Description in TMC: VMware Enterprise PKS Attaching cluster ''tkgi-my-prefix-cluster-1'' to TMC
Fetching token successful
request POST:/v1alpha1/clusters,
response 404 Not Found:{"error":"Resource not found - clustergroup(my-prefix-clusters)
org id(d859dc9f-g622-426d-8c91-939a9f13dea9)",
"code":5,"message":"Resource not found - clustergroup(my-prefix-clusters)

Explanation

The cluster group you assign a cluster to must be defined in Tanzu Mission Control before you assign your cluster to the cluster group in the TKGI Management Console.

Workaround

To resolve the problem, complete the steps in Attaching a Tanzu Kubernetes Grid Integrated (TKGI) cluster to Tanzu Mission Control (TMC) fails with “Resource not found - clustergroup(cluster-group-name)” in the VMware Tanzu Knowledge Base.


check-circle-line exclamation-circle-line close-line
Scroll to top icon