VMware Telco Cloud Automation 3.1 Release Notes

VMware Telco Cloud Automation 3.1 \| 16 Apr 2024 \| Build - TCA: 23636884 \| TCA Cloud Native: 23636320 \| Airgap: 23637693 \| Release Code: R153

What's New

Workflow Hub
Airgap OVA
Enhanced Interoperability for VMware Telco Cloud Platform
Branding
CaaS Enhancements
VNF Enhancements
Certificate Observability
RAN Enhancements
Certificate Renewal
CNF and Dynamic Infrastructure Policy Enhancements
IPv6 Improvements
Licensing
K8S Policy Enhancements
Infrastructure Automation and Host Config Operator Enhancements
Telco Cloud Automation Appliance Enhancements
Alarms

Workflow Hub

Workflow Hub serves as an umbrella orchestrator that seamlessly integrates with VMware Telco Cloud Automation, various VMware telco cloud products, and third-party tools. This empowers network operators to effortlessly create customized multi-cloud workflows, bolstered by Workflow Hub's support for the Serverless Workflow DSL.

VMware Telco Cloud Automation 3.1 introduces the following improvements to Workflow Hub:

Queuing workflow runs

VMware Telco Cloud Automation supports queuing workflow runs when system is under load of runs.
Workflow input validation

Workflow input validation is supported through data input schema.

Airgap OVA

New built-in Visibility/Monitoring system is added in the UI.
Two new Photon 5 repos are added in appliance repo list.
Harbor version in airgap server is bumped up to v2.7.4.
Support for syncing images for two new TKG versions, TKG 2.5.0 and TKG 2.4.1.

Enhanced Interoperability for VMware Telco Cloud Platform

VMware Telco Cloud Automation 3.1 enhances / adds interoperability support for the following:

Product	Supported Versions
VMware vCenter Server	8.0u2
VMware vSphere	8.0u2
VMware NSX-T	4.1.2
VMware Tanzu Kubernetes Grid	2.1.1, 2.5.0 Kubernetes 1.24.10, 1.26.11, 1.27.8, 1.28.4
VMware Cloud Director	10.4.3, 10.5.1
VMware Aria Automation Orchestrator (Previously known as vRealize Orchestrator)	8.14, 8.16
VMware Aria Operations for Logs (Previously known as vRealize Log Insight)	8.14, 8.16
Harbor	2.6.3, 2.7.4, 2.8.4, 2.9.1 Note: Harbor 2.8 onward supports OCI based charts and images only.
VMware Integrated OpenStack	7.3

Important:

For detailed Cluster version compatibility, please refer to CaaS Upgrade Backward Compatibility.

Branding

Updated branding logos (VMware by Broadcom) and Copyright notices have been added.

CaaS Enhancements

Enabled viewing supported Workload Cluster Versions for the different Management Cluster versions.
Support upgrading clusters from v1.24.10 (based on TKG 2.1.1) to v1.28.4 (based on TKG 2.5) along the upgrading path.
Support upgrading v1.24.17, v1.25.13, and v1.26.8 (based on TKG 2.3.1) clusters from TCA 3.0 to versions supported by TKG 2.5.

Note:

These version should be upgraded to 1.26.8, then only can be upgraded to TKG 2.5 supported versions.

Added Move To operation to change the ownership of Workload Clusters to another Management Cluster.
Support classy cluster feature parities in TKG 2.5.
Improved Cluster Diagnosis with support for Pre-Upgrade and Post-Upgrade checks.

Upgrade Cluster operation with customizable Upgrade options
- Ability to upgrade entire Cluster or individual Node Pools.
- Ability to override VM Template for upgrades.
- Ability to specify custom Upgrade Strategy per Node Pool.
- Enhanced pre/post Upgrade checks.
Multi-TKG Support

VMware Telco Cloud Automation officially supports TKG 2.1.1 and TKG 2.5.
Photon 5 support for K8S 1.27 and 1.28

Kernel version 6 and various other enhancements are added.

VNF Enhancements

Ability to consume Imported, Routed, and Isolated Datacenter Group Networks for VMware Cloud Director from Network selection screen during VNF instantiation.
Support for refreshing DCG networks to fetch the latest DCG networks associated with the Org VDC from Network selection screen during VNF instantiation.
DCG networks refresh is required if there is any change in the association between DCG group and Org VDCs because VMware Cloud Director does not send any DCG networking related notifications to TCA.
Added support for creating New Draft from Catalog for existing CSARs.

Certificate Observability

Newly dedicated Connected Endpoints view in the UI.
Newly added Connected Endpoints tile on the TCA Dashboard with status views.
Summarized, quick, and easy view of all Systems connected to the TCA platform.
Automated monitoring of registered endpoints connectivity and certificate status.
Provides details of Certificates of various endpoints like K8S Clusters, vCenter, NSX, Harbor, Airgap, VMware Aria Operations for Logs, vRO, Syslog server, etc.
Color-coded status for Certificate expiry.
Ability to generate and download report of connected endpoints in the .csv format.

RAN Enhancements

Support for Intel vRAN Boost (vRB1) for SPR-EE-MCC, SPR-EE-LCC, and SPR-SP-MCC.
Increased limit of SRIOV networking adapters and PCI devices count that can be configured for a CSAR from 64 to 128.

Certificate Renewal

TCA will have provision in Partner system page to take SSL certificate as input during add/update Harbor Partner system.
TCA CAAS Harbor Addon page will take SSL certificate as input while adding or updating Harbor Addon.
Airgap server's cert renewal on CaaS clusters has been enhanced.

CNF and Dynamic Infrastructure Policy Enhancements

Granular Updates for CNF Operations with Detailed Events.
Prevention of deploying CNFs within System / Reserved Namespaces.
ENS Receive Side Scaling (RSS) support - Tech Preview.
Helm is upgraded from 3.8 to 3.13.3.
Support added for creating New Draft from Catalog for existing CSARs.
Restricted namespaces for CNF deployment for Kubernetes based VIMs.

Seamless, automated, and smart switch of DIP between Photon-3 and Photon-5

Cluster Upgrades automatically upgrade DIP to newer Photon-5 spec based on the CNF CSAR definitions.
Defining Dynamic Infrastructure Policy (DIP) for multiple Photon versions within CNF CSARs.

Capability to define Dynamic Infrastructure Policy (DIP) for multiple Photon versions within CNF CSARs is enabled.

Single CSAR can be used to deploy and configure CNFs across Photon-3 and Photon-5 environments.

IPv6 Improvements

Support for deploying IPv6, IPv4 dual-stack TKG 2.5/v1.28.4 management cluster, IPv6 is primary IPFamily, the endpoint IP version of cluster is IPv6.
Support for deploying IPv6, IPv4 dual-stack v1.28.4 Classy Standard workload cluster, IPv6 is primary IPFamily, the endpoint IP version is IPv6 when management cluster is IPv6, IPv4 dual-stack.
Support for deploying single stack Classy Standard workload cluster when management cluster is IPv6, IPv4 dual-stack.
Manage the lifecycle of IPv6, IPv4 dual-stack management cluster and workload cluster, in both air-gapped and non air-gapped environment.
Manage the lifecycle of addons on the dual stack management cluster and workload cluster, except istio. TKG standard add-ons follow TKG 2.5 supporting matrix.
Allocate both IPv6 and IPv4 addresses to cluster node of IPv6, IPv4 dual-stack management cluster and workload cluster via DHCP server.
Kubernetes Service/Pod network: Antrea/Calico support dual-stack.
CNI plugin (multus/whereabout): The secondary interface of pod supports dual-stack.
Supports NFS service wiring on a dedicated secondary single stack IPv6 interface.
Supports backup and restore dual stack workload cluster.

Licensing

Older CPU based licenses are deprecated, an alarm will be raised when CPU license key is applied. Customers are requested to get a new license key from license team.
Three license editions have been added. Moving forward, these are the only fully-supported licenses.
- Telco Automation Essentials
- Telco Automation Advanced
- VMware Telco Cloud Automation for Service Management and Orchestration
Multiple term licenses can be applied on TCA to increase the license capacity.
Workflow hub will be enabled when Telco Automation Advanced or Automation and Orchestration for Service Management and Orchestration license is applied.

K8S Policy Enhancements

Pod Security Isolation mode is newly added to VIMs, CNFs and Global Configuration mode in addition to existing Permissive and Restricted modes.
Independent configuration enabled for the PSA policy and the K8s policy.
Grant policies based on filters on VIM.

Infrastructure Automation and Host Config Operator Enhancements

UI enhancements.
Security and Resiliency improvements.
Support for configuring advanced ESXi settings on cell site hosts.
Integration with Workflow Hub.

Telco Cloud Automation Appliance Enhancements

TCA 3.1 supports direct migration from TCA 2.3 appliances without the need of going to TCA 3.0.
Updated Migration tool with interactive CLI, live status to enhance user experience.
Updated upgrade mechanism for upgrading from TCA 3.0 to TCA 3.1.
TCA 3.1 requires 275GB of disk, 80GB less compared to that of TCA 3.0.
Support for auto-rotation of TCA Single Node Cluster.
Upgraded TCA Cluster to 1.27.2 from 1.26.5.
Support for syslog server as an endpoint over TLS.

Alarms

Newly added action to Force Delete Alarms.
Automatic purge of stale Alarms.

Important Notes

Infrastructure Automation workflows and BMA workflows in WorkflowHub are supported from the TCA 3.1 release onwards.

Downloading the BYOI Template
Download Photon BYOI Templates for VMware Tanzu Kubernetes Grid
Download RAN Optimized BYOI Templates for VMware Tanzu Kubernetes Grid
Download RAN Optimized Single Node Cluster BYOI Templates for VMware Tanzu Kubernetes Grid

Downloading the BYOI Template

Important:

Ensure that you are using the latest ovftool version to upload the templates.

Download Photon BYOI Templates for VMware Tanzu Kubernetes Grid

To download the Photon BYOI templates:

Go to the VMware Customer Connect site at https://customerconnect.vmware.com/.
From the top menu, select Products and Accounts > All Products.
On the All Downloads page, scroll down to VMware Telco Cloud Automation and click Download Product.
On the Download VMware Telco Cloud Automation page, ensure that the version selected is 3.1.
Click the Drivers & Tools tab.
Expand the category VMware Telco Cloud Automation 3.0 Photon BYOI Templates for TKG.
Corresponding to Photon BYOI Templates for VMware Tanzu Kubernetes Grid 2.1.1, 2.2, 2.3.1, 2.4 and 2.5, click Go To Downloads.
In the Download Product page, download the appropriate Photon BYOI template.

Download RAN Optimized BYOI Templates for VMware Tanzu Kubernetes Grid

To download RAN optimized BYOI templates:

Go to the VMware Customer Connect site at https://customerconnect.vmware.com/.
From the top menu, select Products and Accounts > All Products.
On the All Downloads page, scroll down to VMware Telco Cloud Automation and click Download Product.
On the Download VMware Telco Cloud Automation page, ensure that the version selected is 3.1.
Click the Drivers & Tools tab.
Expand the category VMware Telco Cloud Automation 3.0 RAN Optimized BYOI Template for TKG.
Corresponding to RAN Optimized Photon BYOI Templates for VMware Tanzu Kubernetes Grid 2.5, click Go To Downloads.
On the Download Product page, download the appropriate Photon BYOI template.

Download RAN Optimized Single Node Cluster BYOI Templates for VMware Tanzu Kubernetes Grid

To download RAN optimized Single Node Cluster BYOI templates:

Go to the VMware Customer Connect site at https://customerconnect.vmware.com/.
From the top menu, select Products and Accounts > All Products.
On the All Downloads page, scroll down to VMware Telco Cloud Automation and click Download Product.
On the Download VMware Telco Cloud Automation page, ensure that the version selected is 3.1.
Click the Drivers & Tools tab.
Expand the category VMware Telco Cloud Automation 3.0 RAN Optimized BYOI Single Node Cluster template for TKG.
Corresponding to RAN Optimized Single Node Cluster Photon BYOI Templates for VMware Tanzu Kubernetes Grid 2.5, click Go To Downloads.
On the Download Product page, download the appropriate Photon BYOI template.

Discontinued Features

Troubleshooting node customization via CCLI (CCLI show nodepolicy/esxinfo/vmconfigset subcommand)
VMC
Airgap Appliance Management Interface

Backup and Restore

Issue 3008183: Passphrase content for CN is different from VM based setups.
Issue 3308649: Restored schedule does not have any effect after Backup-Restore or Migration. No periodic backups would be taken even if schedule is present.
Issue 3294837: Failed to backup pv with nfs-client backend using restic plugin in upgraded workload cluster.
Issue 3293997: Static routes are not persisted post backup restore.
Issue 3292755: When configuring FTP server settings, 'Use SSH keys' does not generate the key in the required format.

NF Workflow

Issue 3251591: Network Function page is showing empty with Error "Error while getting VNF instances. Reason: JSONObject[\"href\"] not found."}].

CaaS Cluster Automation

Issue 3288670: Autoscaler failed to take effect on legacy and standard workload clusters.

CaaS Automation - Addons

Issue 3113300: Failed to delete existed storage class in the backend after remove it via edit vsphere-csi addon configuration.
Issue 3310972: Use Reference Configs button is missing from deploying Prometheus addon page.

CaaS Automation - Test Automation

Issue 3266341: The management cluster's Control Plane nodes were tagged with the nodepool label after scale out or remediation via MHC.

Cluster Operator

Issue 3304068: Classy standard cluster updates to variables cannot take effect when cluster pod security is disabled.

Platform

Issue 3307853: Uploading new Web cert overrides App Mgmt cert as well.

ZTP Backend

Issue 3292455: Post resync operation, host remains in NOT_PROVISIONED state if parent vCenter domain is not resynced.

Orchestrator UI

Issue 3288032: Missing node pool name in CNF Instance expanded view.

RBAC

Issue 3310411: VIM alarm acknowledgment is not working as expected.
Issue 3308632: Owner of the resource will not be able to perform LCM operations if the matching filter criteria set for the logged in user is no longer met.

Known Issues

Workflow Hub
CaaS
Airgap Appliance Upgrade
Airgap Appliance Rsync
Migration
Certificate Management/Appliance Manager
Certificate Observability/Airgap
Certificate Observability/Multitenancy
Certificate Observability/Appliance Manager
Photon
Infrastructure Automation
Embedded Workflows
Techsupport bundle
Appliance Manager/Workflow Hub
Node Customization
CNF LCM
ZTP

Workflow Hub

Issue 3367969: Error message is not proper when model key missing in hosts template for create-csg-hcp-infrastructure workflow.

Workaround

Introduced new state which will take care of this issues, i.e., invalid_payload_model_missing_from_host in prepare-csg-hcp-paylaod.yaml workflow. Please refer to https://gitlab.eng.vmware.com/core-build/swf-runtime/-/merge_requests/427/diffs PR for code changes.

 - name: invalid_payload_model_missing_from_host
 condition: '${ (all(.candhost.host[]; has("model"))) | not }'
 transition: ERROR_INVALID_PAYLOAD_CANDIDATE_HOST_MODEL_MISSING

 - name: ERROR_INVALID_PAYLOAD_CANDIDATE_HOST_MODEL_MISSING
 type: operation
 actions:
    - functionRef:
         refName: failWorkflow
         arguments:
            errorMessage: '${ .candhost.host | map(select(has("model") | not) | .name) | " FAILED: Input Payload is missing key attribute model for following hosts : " + join(", ")}' 
 end: true

Issue 3366953: For some workflows that are not created via the WFH drag and drop UI, the edit page could show wrong data.

Workaround:

In the Workflow Listing page, click on the workflow to view the source. Copy the source.

In the Edit page, click on the code view, and paste the source. You can then make edits on this source and save your workflow.
Issue 3367587: Rarely, a workflow execution might be complete, but the state would still be shown as 'running'.

Workaround:

There is no workaround to update the state, but from the workflow steps tab the executor can identify that the workflow execution is complete.
Issue 3367583: A workflow run might rarely be stuck with no steps.

Workaround:

Edit the workflow and save without making any changes. The stuck run should start after some time.
Issue 3367613: The import/export workflows feature is only supported for workflows that do not use any schemas.

On importing a workflow that uses schemas, validations related to the workflow-schema relationship might not work. Workflows using dataInputSchemas will be stuck.

Workaround: NA
Issue 3367406: On editing a workflow, the workflow ID is shown as the name in the canvas view blocks.

Workaround: NA

Issue 3354213: Exclude diagnostic tests from execution if the user chooses to omit specific tests during the pre/post upgrade check procedure.

Workaround:

User can edit the workflow caas-upgrade-config and under the key excludedUpgradeTests, they can add list of the test cases that they want to exclude for management and workload cluster.

Example: Consider that user want to exclude the syslog server test case for both management and workload cluster, user can edit and update the excludeUpgradeTests in the upgrade config in following way:

excludeUpgradeTests
`excludedUpgradeTests: mgmtCluster: -Management Cluster Remote Logger Settings Diagnosis wkldCluster: -Management Cluster Remote Logger Settings Diagnosis -Workload Cluster Remote Logger Settings Diagnosis`

excludeUpgradeTests

excludedUpgradeTests:
  mgmtCluster:
    -Management Cluster Remote Logger Settings Diagnosis

  wkldCluster:
    -Management Cluster Remote Logger Settings Diagnosis
    -Workload Cluster Remote Logger Settings Diagnosis

CaaS

Issue 3362048: When TCA is upgraded from 2.3 to 3.1, PVC with the storageClass set to nfs-client remains in a pending state after creation. The corresponding PV will also not be automatically created.

Workaround:

This is due to a change in the provisioner of the nfs-client storageClass after the upgrade to TCA 3.1. To resolve this issue, delete the original storage class, and the Kapp controller will automatically create the correct storageClass within ten minutes.

Additionally, if you do not want to wait for Kapp controller to automatically create a new storageClass (within ten minutes), you can also manually create a new one. Ensure to provide the correct provisioner in storageClass.
Issue 3336570: When you are using the vSphere 8.0U2, if you run the velero backup create for backup, the upload CR will remain stuck in In Progress status, and in vCenter, many tasks will show the following error: Delete a virtual object snapshot. The operation is not allowed in the current state.

Workaround:

Option 1: Use the Restic instead to do backup.

Option 2: Try to manually clean the backup snapshot from vCenter inventory, as the backup process has actually been successfully completed.
Issue 3360361: Photon 5 do not identify the 'noproxy' string if space separated list is given.

Workaround:

Remove the space in the noproxy setting.

Issue 3366464: While upgrading IPv6 only workload cluster from 1.24 to 1.25, new control plane node is stuck on bootstrapping due to etcd certificate SAN extension validation failure.

The new control plane node is always Not Ready, and on UI it is always in Provisioning state.

Login to new control plane node with capv account from TCA-CP root, and check /var/log/cloud-init-output.log, it reports:

[2024-03-22 13:52:01] [etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s [2024-03-22 14:24:50] error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available

Check the etcd log under /var/log/containerd/etcd*.log, it keeps reporting warning on rejecting connection, like follows, and remote-addr value is the endpoint IP of the cluster.

2024-03-22T07:28:21.689160874Z stderr F {"level":"warn","ts":"2024-03-22T07:28:21.688Z","caller":"embed/config_logging.go:160","msg":"rejected connection","remote-addr":"[3083::701]:39578","server-name":"","ip-addresses":["3083::8f84","127.0.0.1","::1"],"dns-names":["localhost","wc-24-np1-control-plane-nkdg9"],"error":"tls: \"3083::701\" does not match any of DNSNames [\"localhost\" \"wc-24-np1-control-plane-nkdg9\"] (lookup wc-24-np1-control-plane-nkdg9 on [3083::1]:53: no such host)"}

Workaround:

Login to the endpoint ip residing cp node via ssh from TCA CP root and enter to root mode.
```
# ssh capv@<problematic cluster endpoint IP> 
# sudo su
```

Add policy route to force the OS selecting the node IP as source IP address instead of endpoint IP.

In this example, the node network is 3083::/64, node dhcp ip is 3083::484f/128 and endpoint IP is 3083::701/128.

root [ /home/capv ]# ip addrlabel add prefix 3083::/64 label 20 
root [ /home/capv ]# ip addrlabel add prefix 3083::8f84/128 label 20 
root [ /home/capv ]# ip addrlabel add prefix 3083::701/128 label 21 
root [ /home/capv ]# ip addrlabel list 
prefix 3083::701/128 label 21 
prefix 3083::8f84/128 label 20 
prefix ::1/128 label 0 
prefix ::/96 label 3 
prefix ::ffff:0.0.0.0/96 label 4 
prefix 3083::/64 label 20 
prefix 2001::/32 label 6 
prefix 2001:10::/28 label 7 
prefix 3ffe::/16 label 12 
prefix 2002::/16 label 2 
prefix fec0::/10 label 11 
prefix fc00::/7 label 5 
prefix ::/0 label 1

ssh capv@<management cluster IP> 
kubectl get machine -n <workload cluster name> 
kubectl delete machine -n <workload cluster name> <problematic machine>

Monitor if the problematic node is destroyed and new one is created, then upgrading continues.
Login to UI to trigger the upgrading again, and cluster shows Provisioned after a while.

Issue 3284154: While upgrading management cluster from k8s version 1.24 to 1.25, wrong template being used for CP and worker node deployment if there are more than 1 template of same version in vCenter. Node clone might fail in vCenter if the auto-selected node cannot be accessed by the selected vSphere Cluster. vCenter can report a Clone VM Task Failure with message: Cannot connect to host, and cluster creation fails.

This is a known TKG issue and is fixed in TKG 2.3.1. But cluster upgrading from 1.24. to 1.25 is using TKG 2.2, for which the fix is not applicable.

Workaround:

Delete redundant k8s cluster templates from vCenter server.
Issue 3365195: Workload cluster stuck in moving status when the workload cluster is inaccessible in the middle of moving process or vc certificate is changed.

CAPI/CAPV does not mark the cluster status to Provisioned during workload cluster inaccessibility in the middle of moving process, which causes the move failure.

Workaround:

1. SSH into source and target management cluster.

2. Restart CAPI/CAPV pod with kubectl command:

kubectl rollout restart deploy/capi-controller-manager -n capi-system

kubectl rollout restart deploy/capv-controller-manager -n capv-system.

3. It will automated move successfully after cluster is back to normal.
Issue 3356408: After upgrading TCA from 3.0 to 3.1, running the pre-upgrade diagnosis run prior to upgrading the management cluster addon fails with error: unsupported version of the tca-diagnosis-operator Add-On.

Workaround:

After TCA is upgraded to 3.1, pre-upgrade diagnosis cannot be run on Management Clusters created on TCA 3.0 that are not upgraded to the next kubernetes version (1.27.x). Pre-validation of the cluster before upgrading to 1.27.x can be done by running a Generic Cluster diagnosis. Do not select any caseselector from the Generic Diagnosis tab. This will diagnose the overall cluster health.
Issue 3358500: Workload cluster upgrade may sometimes leave few pods in Terminating state.
Workaround:

1. SSH into workload cluster control plane.

2. Run kubectl command to clear the node which is NotReady,SchedulingDisabled.
```
kubectl get node
kubectl delete node <node_name>
```
Issue TEAC-17962: The CaaS Operation could encounter an error, failedToLoginMaxUserSessionCountReached, indicating that the maximum session count has been exceeded on vCenter Server.
```
Cannot log into vCenter https://10.202.215.1 with provided credentials: {"status":"failure","statusCode":503,"details":"","result":{"type":"com.vmware.vapi.std.errors.service_unavailable","value":{"error_type":"SERVICE_UNAVAILABLE","messages":[{"args":["550","550","[email protected]"],"default_message":"User session count is limited to 550. Existing session count is 550 for user [email protected].","id":"com.vmware.vapi.endpoint.failedToLoginMaxUserSessionCountReached"}]}}}
```
By logging into vCenter Server, it is observed that k8s-csi-useragent and k8s-capv-useragent is holding a lot of sessions leading to maximum user session count.
Workaround:

Follow vCenter HTTP Sessions expiring sooner than configured (88668) to restart the vCenter Server, which will free up the sessions by removing long-lived idle sessions.
Issue TKG-27607: Forced deletion of a node pool may become stuck in the processing state if the host of the cell site goes down.

Workaround:

After removing the host from the vCenter inventory, the node pool deletion is completed successfully without any additional action required from TCA.
Issue 3353024: Cluster creation may fail with the below error if there are too many Kubernetes node templates on vCenter.
Error in TKG log reports.
```
Internal error occurred: failed calling webhook "tkr-vsphere-resolver-webhook.tanzu.vmware.com": failed to call webhook: Post "https://tkr-vsphere-resolver-webhook-service.tkg-system.svc:443/resolve-template?timeout=30s": context deadline exceeded
```
Workaround:

Delete all unused kubernetes node templates from vCenter Server and retry the cluster creation operation. As an another option, follow the TKG KB to increase resource of tkr-source-controller: https://kb.vmware.com/s/article/92524.

Issue 3354663: The Fluent-bit add-on is in a crash loopback state on a Dual-stack & IPv6 workload cluster with TKG version greater than or equal to 2.4.0.

Workaround:

Log in to the management cluster control plane and modify the configurations in the secret referenced by the Fluent-bit addon.

Obtain the original configurations saved in the secret.

kubectl get secret fluent-bit-tca-addon-secret -n <workload_cluster_name> -o jsonpath='{.data.values\.yaml}' | base64 --decode

Under the fluent-bit, at the same level as config, add ipv6Primary: true, and remove the HTTP_Listen configuration item from the [service] section in the "config". For example:

# please note retain line breaks at the end
fluent_bit:
 ipv6Primary: true
 config:
    service: |
       [Service]
           Flush               1
           Log_Level           info
           Daemon              off
           Parsers_File        parsers.conf
           HTTP_Server         On 
           HTTP_Port           2020

Encode the modified configurations above with base64 and update the secret.

kubectl patch secret fluent-bit-tca-addon-secret -n <workload_cluster_name> --type='json' -p='[{"op": "replace", "path": "/data/values.yaml", "value": "<base64_encoded_new_value>"}]'

Issue 3366288: Upgrade management cluster might fail due to TCA API default timeout (about 3.5 hours). In this case, if upgrade task is still running in the backend, user will find cluster upgrade status inconsistency issue between UI and backend.
Relevant log’s location:

From k8s-bootstrapper pod, could check the backend status of MC upgrade.
1. Login k8s-bootstrapper pod from tca-cp:
```
# kubectl exec -it <k8s-bootstrapper-pod-name> -ntca-cp-cn bash
```
2. Find out target MC id with MC name:
```
# curl http://localhost:8888/api/v1/managementclusters
```
3. Check target MC upgrade status with MC id:
```
# curl http://localhost:8888/api/v1/managementcluster/<target-mc-id>/status
```
For current MC upgrade operation, the default timeout config of TCA API (about 3.5 hours) is not exactly the same as backend. This will cause some cluster status inconsistency issue between UI and backend.
Workaround:

Check MC upgrade status from k8s-bootstrapper pod (refer to the relevant log section above).

If the cluster upgrade is complete and its status is running, retry management cluster upgrade from the UI.

If cluster upgrade is in progress and status is upgrading, wait until upgrade is complete and retry.
Issue 3368498: After Kubernetes Cluster Certificate auto-rotate, you will not be able to create pod on worker node with calico.

Upstream issue https://github.com/projectcalico/calico/issues/7846.

This issue impacts new application installation/ uninstallation.
Workaround:

Restart Calico Daemonset pod one by one in non-maintenance window, run the command below:
```
kubectl delete pod -n kube-system {$CALICO-NODE_POD}
```

Issue 3364693: Failed to update Airgap CA certificate on the original workload cluster in TCA 3.1.0 which is migrated from TCA 2.3.

After migration, addons of management cluster and workload cluster are not upgraded to the new addon versions yet.

Workaround:

If the background of test bed is TCA 2.3 is migrated to TCA 3.1.0, please ensure admin user has followed TCA Deployment Guide (Post Migration) to update the ssh key.

root@tca-cp [ /home/admin ]# cp -rf /root/.ssh/ /root/.ssh.bak root@tca-cp [ /home/admin ]# kubectl --kubeconfig /etc/kubernetes/admin.conf get cm override-kbs-app-cr-config -n tca-cp-cn -o jsonpath={.data."data\.yml"} | sed -n '/-----BEGIN OPENSSH PRIVATE KEY-----/,/-----END OPENSSH PRIVATE KEY-----/p'| sed -e 's/^[ \t]*//' > /root/.ssh/id_rsa root@tca-cp [ /home/admin ]# kubectl --kubeconfig /etc/kubernetes/admin.conf get cm override-kbs-app-cr-config -n tca-cp-cn -o jsonpath={.data."data\.yml}"| sed -n '/-----END OPENSSH PRIVATE KEY-----/, /managementCluster:/{/-----END OPENSSH PRIVATE KEY-----/!{/managementCluster:/!p}}'| sed -e 's/^[ \t]*//'| awk -v RS= '{$1=$1}1'| sed 's|.*: \(.*\)|\1|' > /root/.ssh/id_rsa.pub 

# check if ssh login 
root@tca-cp [ /home/admin ]# ssh capv@{$CLUSTER_NODE_IP_ADDRESS}

These steps are for migrated cluster which has not been upgrade yet. If clusters have been upgraded after migration, please follow the way mentioned in TCA 3.1 Deployment User Guide to update Airgap certificate of cluster.

Download script tarball here. Upload it to TCA-M appliance, and unpackage the tarball under root user.

root@tca-mgr [ /home/admin ]# tar -zxvf tca3.1.0_update_ca_ext.tar.gz 
v3.1.0-ext/ 
v3.1.0-ext/ansible/ 
v3.1.0-ext/update_ca.py 
v3.1.0-ext/ansible/update_node_ca.yml

Save certificate in a file and run update-ca.py update-cert-db command to update airgap certificate in TCA-M database.

Note:
This procedure will cause control plane rolling-update, if management cluster is Classy cluster, it will cause worker node of management cluster rolling update as well.
```
root@tca-mgr [ /home/admin/v3.1.0-ext ]# ./update_ca.py update-cert-db --fqdn {$AIRGAP_SERVER_FQDN} --cafile {$CERTIFICATE_FILE_PATH}
```
In case of connection errors while running ansible to update Airgap certificate in Node, user can run update command again.

Download script tarball here. Upload it to TCA-CP appliance and unpackage the tarball under root user.

root@tca-cp [ /home/admin ]# tar -zxvf tca3.1.0_update_ca_ext.tar.gz 
v3.1.0-ext/ 
v3.1.0-ext/ansible/ 
v3.1.0-ext/update_ca.py 
v3.1.0-ext/ansible/update_node_ca.yml

Save certificate in a file and run update-ca.py update-mgmtcluster command to update airgap trusted root certificate certificate of specified management cluster.
```
root@tca-cp [ /home/admin/v3.1.0-ext ]# ./update_ca.py update-mgmtcluster --cafile {$CERTIFICATE_FILE_PATH} --name {MANAGEMENT_CLUSTER_NAME}
```

Run update-ca.py update-workloadcluster command to update airgap trusted root certificate of specified workload cluster.

root@tca-cp [ /home/admin/v3.1.0-ext ]# ./update_ca.py update-workloadcluster --mc {MANAGEMENT_CLUSTER_NAME} --name {WORKLOAD_CLUSTER_NAME}

Issue 3356478: User capv is locked when three unsuccessful logon attempts in 15 minutes.
Journal log sample:
```
Mar 27 07:15:55 cp-stardard-cluster-1-control-plane-zdfgm sshd[3767202]: pam_faillock(sshd:auth): Consecutive login failures for user capv account temporarily locked
```
Per Photon 5 STIG requirement (PHTN-50-000108), the Photon operating system must automatically lock an account until the locked account is released by an administrator when three unsuccessful logon attempts occur in 15 minutes.
Workaround:
1. Use key authentication to ssh using capv user from TCA-CP.
2. Release the locked account:
```
 # faillock --user capv --reset
```
Issue 3298251: Failed to restore Testnf with PV attached with velero. The Testnf needs node customization to make pod running and the node customization is triggered by CNF remedy. The CNF remedy should be triggered after velero restore is complete, but the velero restore cannot complete if there is PV pending to be restored.
Workaround:
1. Click Instantiate target Testnf on target new workload cluster (all input parameters should be the same as the previous input when instantiated on old workload cluster).
2. Wait on completion and click Terminate.
3. Wait for termination to complete.
4. Restore velero via velero CLI.
5. Click Remedy on the old instance to remedy it to the new target workload cluster.
Issue 3362331: Etcd/Kube-API Pods are exited on cluster when the control plane node DHCP IP changed.

User should follow TKG guide on Node Networking to make control plane nodes' addresses static and never expires to avoid the issue. If not, and the control plane DHCP address changes, apply the workaround.
Workaround:

For each cluster control plane node that has node IP changed:
1. On DHCP server, bind the control plane node MAC address and original IP address.
2. ssh to the control plane node with capv account from TCA CP root mode and run:
```
sudo dhcpclient
```
3. Check whether the original IP is back by running:
```
ifconfig eth0
```

Airgap Appliance Upgrade

Issue 3365805: During airgap upgrades, failures may occur with the error message iso does not exist.This typically happens when the airgap ISO file name is provided exactly as it is, for example: TCA_AIRGAP_APPLIANCE-upgrade-bundle-3.1.0-23473011.iso.
Workaround:
1. Modify the name of the upgrade ISO file to a simpler text, such as update.iso.
2. Edit the file /usr/local/airgap/scripts/vars/user-inputs.yml. Locate the field local_iso_path: and modify the value to include the correct path along with the updated ISO image name.
3. Save the changes to the user-inputs.yml file.
4. Rerun the command agctl upgrade to initiate the upgrade process with the corrected ISO file name.

Airgap Appliance Rsync

Issue 3367593: Airgap rsync operation may encounter occasional failures when executed multiple times.
Workaround:

Run the following commands on airgap server as root user:
1. rm -f /etc/yum.repos.d/*
2. cp /usr/local/airgap/backup_repo/* /etc/yum.repos.d/
3. agctl rsync

Migration

Issue 3367899: If compute cluster domain(s) exist in TCA 2.3.x Infrastructure Automation, migration to TCA 3.0/ 3.1 will not be supported.

This is a prerequisite for migration. Compute Clusters functionality in the Infrastructure Automation is deprecated and migration is not supported for Compute Clusters.

You need to delete Compute Clusters in the Infrastructure Automation (TCA Manager Web UI (443) > login > Infrastructure Automation > Domains > Compute Cluster).
Workaround:
1. Revert the partially migrated appliances (tcamigctl revert CLI).
2. In TCA 2.3 UI, delete compute cluster domains (TCA Manager Web UI (443) > login > Infrastructure Automation > Domains > Compute Cluster).
3. Retry the migration.

Certificate Management/Appliance Manager

Issue 3360387: For a TCA appliance, if invalid pair of public-cert-chain and private-key combination are provided, secret contains malformed certificate/key string. Due to this, both ports (443 and 9443) for a TCA appliance will not start.

Workaround:

Manual edit of secret in k8s would be needed:

User will need to run kubectl, edit secret ingress-tls-secret -n tca-mgr (or tca-cp-cn), and put proper key/certificate.

Certificate Observability/Airgap

Issue 3360752: For customers who have upgraded from TCA 2.3/3.0 to TCA 3.1 and are using Airgap server, "Connected Endpoints" UI in TCA Manager may show Airgap server as "untrusted".

Workaround:

There are couple of workarounds for this. Please see below.

Option 1: Go to TCA Control Plane Appliance Management 9443 portal > Administration > Certificate > Trusted Certificate option and save the CA cert of Airgap server using file or content option if the CA cert of Airgap server is missing in the current trusted CA certificates.

Option 2: Find a Workload cluster deployed using Airgap server and perform Cluster Lifecycle related operation like Upgrade of cluster.

Certificate Observability/Multitenancy

Issue 3364575: Multitenancy is not supported for Certificate Observability service.

Unless a non-default Tenant shares the Endpoint with the Default Tenant, or gets inherited as a part of parent-child relationship, the Endpoint will not be shown in the view for a Default Tenant login.For the Default Tenant login, though the Endpoint owned by other Tenants(non-default) is not listed in the portal, the Endpoint may end up getting listed in the Connected Endpoints listing.

Workaround: NA

Certificate Observability/Appliance Manager

Issue 3361603: Upon successful upgrade from TCA 3.0 to 3.1, Log Management endpoint is not automatically registered with the Certificate Observability service.

Workaround:

Upon successful upgrade of TCA to 3.1, in the TCA Appliance Management(9443 portal), edit the Log management and re add the VMware Aria Operations for Logs (vRLI) details.

Photon

Issue 3362102: While creating workload clusters with IPv6, pulling images fails when HTTPS proxy is used.

In IPv6 environment, it uses the https proxy to access the internet, failed to access https proxy as failed to connect the https proxy server, it is get error Can't use SSL_get_servername when do openssl connect to the proxy server.

Workaround:

User needs to provide concatenated ca.crt along with <server>.crt & <server>.key as proxy cert.

Infrastructure Automation

Issue 2957277: The issue is seen if you register vCenter on TCA-CP appliance manager by IP and add the same vCenter in ZTP by FQDN. Only FQDNs are to be used while adding vCenters in ZTP. But appliance manager supports vCenter addition by IP.

Workaround:

TCA 3.0 onwards, Infrastructure Automation mandates the use of FQDN when adding vCenter(s). However, vCenter registration in TCA-CP allows using IP addresses. In such a scenario, failure will be seen when applying host config profile on a cell site group. In order to avoid this failure, the recommendation is to use vCenter FQDN when registering the same in the appliance manager.
Issue 3357082: Delete Host APIs /ztp/v1/csgs/<csg-id>/hosts/<host-id> and /ztp/v1/clusters/<cluster-id>/hosts/<host-id> do not process the query parameters wipeDisks and forceDelete correctly.

If the API call /ztp/v1/csgs/<csg-id>/hosts/<host-id>?wipeDisks=false&forceDelete=false is invoked, it is treated as wipeDisks=true and forceDelete=true. In the triggered workflow, wipe disk and force delete is attempted for the host. Wipe Disks causes the datastores to be removed on the host. Force Delete causes the host to be cleaned from the ZTP inventory even if there was any failure during the deletion process.

When hosts are present in a CSG domain, invoke the host deletion API call by providing the query parameters wipeDisks and forceDelete as false.

Workaround:

For the delete API call, only provide the query parameters if the value is true. For false, do not include the query parameters.

Alternatively, use the POST API for bulk host deletion - /ztp/v1/hosts/deletion with relevant payload.

Embedded Workflows

Issue 3366555: Editing the input value of a workflow in CSAR doesn't reflect changes when saving the CSAR or saving the workflow.

Workaround:

Clicking Update Package after editing each input value should solve the issue.

Alternatively, appropriate input value can be provided during run-time.

Techsupport bundle

Issue 3370110: Tech support bundle generation for CaaS clusters if run parallel, may fail.

Support bundle service allows a user to trigger multiple support bundle requests. KBS allows only one CaaS cluster log collection request at a time. On the support bundle side, this is mitigated by showing a warning to the user that a subsequent request to collect CaaS cluster logs will fail if one is already running.

Workaround:

Retry generating tech support bundle for CaaS clusters.

Appliance Manager/Workflow Hub

Issue 3370483: Switching the Authentication from Active Directory to vCenter for TCA-Manager without providing SSO details in the TCA-Manager appliance manager UI (9443 portal) would lead to login failure for existing and new logins to TCA-Manager orchestrator UI. This may also lead to failures with workflows from Workflow Hub.

When switching the Authentication from Active Directory to vCenter for TCA-Manager, allowing vCenter details to be persisted without adding SSO in the TCA-Manager appliance manager UI (9443 portal) would lead the system to have both vCenter and Active Directory details may confuse the customer on which is the current active authentication.
Workaround:
1. Log into the TCA-Manager appliance manager UI (9443 portal).
2. Edit the Active Directory settings again and save.
3. Delete the vCenter details.

Node Customization

Issue 3358499: If CSAR has any SRIOV Network Adpaters without [igb_uio/vfio_pci] drivers without selecting 'Upgrade Hardware Version' explicitly, Node Customization fails with Error.

vmconfig status is Failed, nodeconfig status is Configuring.vmconfig

failed: plugin vmReconfigPlugin reconcile failed: reconfigure VM failed with

error *types.DeviceUnsupportedForVmVersion.

This is because TCA will not upgrade the VM Harware implicitly. And the supported hardware version for SRIOV is 17.

Workaround:

If any of the NetworkAdapters require SRIOV, User should also enable "Upgrade Hardware Version" because Virtual Machines require minimum version of 'vmx-17' for this operation to succeed.

CNF LCM

Issue 3371402: CNF upgrade retry skips the node customization if the previous node customization failed during CNF upgrade.

Workaround:

Rollback the failed upgrade and perform a fresh upgrade instead of clicking on Upgrade Retry.

ZTP

Issue 3359732: If a host is installed with more than one accelerator tools/drivers such as both vrantools and ibbdtools, using either tool to configure an accelerator device will result in the following error: Config support for this device is not available.

As a result of this issue, the host will fail to be added into a cell site group.

Workaround:

If a host has both vranpf and ibbd drivers installed for Intel Accelerators, then uninstall the ibbd driver before applying host profile by running commands below:

esxcli software vib remove -n ibbdtools

esxcli software vib remove -n ibbd-pf

Security Fixes

Issue Number	Description
TEAC-16684	Fix the iptables rules to block internal repository port 5000
TEAC-16020	Syslog over TLS is supported
TEAC-16019	VMware Aria Operations for Logs configuration hardened to do not require credentials
PR 3359030	Hardened API response to exclude sensitive data when rolling back a CNF
PR 3356582	Hardened VIM Tenants API to exclude cluster kubeconfig
PR 3346570	Hardened multitenancy IDPs API to exclude sensitive information
PR 3347711	Hardened job API to exclude passwords
PR 3365884	Hardened tech support bundle to exclude sensitive data
TEAC-15414	STIG hardening for photon5 byoi template
TEAC-15830	Capability to update new Harbor Certificates for CaaS components
PR 3347962	Encryption of IDP credentials for Multi Tenancy usecases
PR 3363032	Hardened sensitive information protection in audit logs