VMware Telco Cloud Automation 2.1 | 21 JUL 2022 | Build - VM-based: 20142564, HA-based: 20142097 Check for additions and updates to these release notes. |
VMware Telco Cloud Automation 2.1 | 21 JUL 2022 | Build - VM-based: 20142564, HA-based: 20142097 Check for additions and updates to these release notes. |
Support for Helm Dependency
VMware Telco Cloud Automation supports Helm dependency for scaling, re-configuring, and terminating a CNF instance.
Improved Concurrency and Throughput Capabilities within VMware Telco Cloud Automation
VMware Telco Cloud Automation 2.1 improves concurrency-handling across the product. Features such as Infrastructure Automation (Cell Site Hosts), CNFs, and Cluster Automation now have improved numbers for quicker and stable production-facing deployments. For more information on the scale and concurrency limits of VMware Telco Cloud Automation, see https://configmax.vmware.com.
Diagnosis on a CaaS Cluster or a Node Pool
The user can now run diagnosis on a CaaS Cluster or a Node Pool from the VMware Telco Cloud Automation UI, with multiple preset tests to choose from. The user can use these tests for a health check before an operation such as pre-upgrade or post-upgrade, or perform health check on a specific component such as ETCD or CNI. VMware Telco Cloud Automation provides users with separate diagnosis tests for Management Clusters, Workload Clusters, and Node Pools, enabling a thorough diagnosis throughout the stack.
SSH Access to Appliances
The user can access the SSH terminal of the backing VMware Telco Cloud Automation Control Plane (TCA-CP) appliance through the Virtual Infrastructure tab.
NOTE: This functionality is available for users with System Administrator permissions.
CaaS Infrastructure Upgrade and Redesign
VMware Telco Cloud Automation 2.1 offers improved CaaS operability by upgrading the current CaaS infrastructure API version to v2. With this upgrade, users now have increased control over cluster failures and can view the status of all the components at a granular level, giving them the ability to act on a failure while the cluster creation is in progress. Also, during the cluster creation process, the user can edit or delete a node pool or an add-on at any point in the event of an error.
The v2 upgrade provides the following enhancements:
NOTE: These new features are only available on v2 clusters.
Granular upgrade support - The user can now upgrade VMware Telco Cloud Automation Control Plane and Worker node pools separately.
AKO Operator support for clusters - AVI Load-Balancer Kubernetes Operator Lifecycle Management offers improved network life cycle management.
Support for Tanzu Kubernetes Grid extensions Prometheus and Fluentbit.
The user can now edit Calico and Antrea while the cluster creation is in progress.
Removal of tasks moves events to a Declarative Model.
Stretch Cluster Support on the UI - Stretched cluster on VMware Cloud allows a Service Provider to manage multiple vCenter Kubernetes clusters on the VMware Cloud.
A new user interface that is based on the upgraded v2 APIs.
NOTE: The existing CaaS v1 APIs care still available with VMware Telco Cloud Automation 2.1.
Kubectl Access
Secure Kubectl access enables restricted access to VMware Tanzu Kubernetes Grid through VMware Telco Cloud Automation. This feature gives users the ability to issue one-time passwords and tokens. These tokens, moreover, have expiry dates and system administrators have the ability to revoke them.
Active Directory Support
Integration with external Active Directory system for user authentication, in addition to vCenter Server authentication.
VIO Support Extension
VIO support extension allows enhanced platform awareness features to be configured automatically on VMware Telco Cloud Platform through VMware Telco Cloud Automation. Importantly, automated VIO configurations and customizations support new high-performance applications and related services (5G, vRAN) that require these applications. New configuration options are now available through the VNFD and CSAR.
HA-Based VMware Telco Cloud Automation Improvements
Support for upgrading cloud-native VMware Telco Cloud Automation deployed in an airgapped environment.
Support for deploying cloud-native VMware Telco Cloud Automation with customized cluster sizes.
IPV6 Support
VMware Telco Cloud Automation supports IPv6 for a new deployment in an airgapped environment. The user can either use IPv6 or IPv4, but cannot use a combination of both IPv6 and IPv4.
Note:
When using the IPv6, the user can register components only with FQDN.
The user must deploy DNS, DHCP, and NTP using IPv6.
The user must deploy vCenter and VMware ESXi server using IPv6.
VMware NSX-T and VMware vRO do not support IPv6. Any feature of VMware Telco Cloud Automation that uses these products cannot work in an IPv6 environment.
At present, IPv6 does not support the following:
Network Slicing
Cloud-native deployment using user interface
Infrastructure Automation
EKS-A Validation Support
You can now instantiate CNFs on EKS-A Kubernetes clusters. This improvement in EKS functionality allows the Service Provider to extend the environments in which their workloads can be managed, supporting lifecycle management of CNFs running on EKS-A.
TKG Management Cluster - Backup and Restore
VMware Telco Cloud Automation 2.1 supports using VM snapshot to back up and restore the entire TKG management cluster nodes on top of the same infrastructure (vCenter, network configurations, and datastores). Partial backup and restore of TKG management cluster nodes is not supported. No persistent volumes are allowed to be added into the TKG management cluster. If added, the VM snapshot-based backup and restore will fail.
Infrastructure Automation
Starting from Release 2.3, VMware Telco Cloud Automation terminates the support for creating a central data center, regional data center, and compute cluster using Infrastructure Automation. The feature will enter a maintenance mode starting from releases 2.1 and 2.2. Post termination, users will have the option to add the pre-deployed data centers through Infrastructure Automation in a VM-based deployment.
Starting from 2.2, VMware Telco Cloud Automation deprecates the support for CN bootstrapping using Infrastructure Automation.
Delete Domain Behavior Changed
In VMware Telco Cloud Automation 2.1, the process to delete the domains has changed. Current and previous behaviors are as follows:
Previous behavior:
To delete a domain, the user had to remove the domain definition from the domains
list and the back-end used to take care of the deletion.
Below is sample cloud spec file with two domains test1
and test2
.
{
"domains": [
{
"name": "test1",
...
},
{
"name": "test2",
...
}
],
"settings": {
...
},
"appliances": [
...
],
"images": {
...
}
}
Below is modified cloud spec to be posted for deleting the domain test1
(remove test1
from domains
list).
{
"domains": [
{
"name": "test2",
...
}
],
"settings": {
...
},
"appliances": [
...
],
"images": {
...
}
}
Current behavior:
To delete domains, the user has to add a list of strings (names of the domains you need to delete) to the deleteDomains
field in the cloud spec. For example,"deleteDomains" : ["cdc1", "rdc1"]
.
It is optional to include/exclude the domain to be deleted in the domains
list. Below is an example for the new behavior. We provide test1
in deleteDomains
list to delete that domain.
{
"domains": [
{
"name": "test1",
...
},
{
"name": "test2",
...
}
],
"settings": {
...
},
"appliances": [
...
],
"images": {
...
},
"deleteDomains": ["test1"]
}
Customize FluentBit Add-on
VMware Telco Cloud Automation 2.1 only exposes a few variables of FluentBit to the UI. However, through the YAML content, you can customize additional variables that are not available through the UI. The steps are:
Log on to the Management cluster and download imgpkg
executable binary from https://vmwaresaas.jfrog.io/artifactory/generic-cnf-dev/airgap/imgpkg-linux-amd64-v0.12.0
.
Run kubectl get package -A | grep fluent
to get the FluentBit package name with version:
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ kubectl get package -A | grep fluent
tanzu-package-repo-global fluent-bit.tanzu.vmware.com.1.7.5+vmware.1-tkg.1 fluent-bit.tanzu.vmware.com 1.7.5+vmware.1-tkg.1 22h43m33s
tanzu-package-repo-global fluent-bit.tanzu.vmware.com.1.7.5+vmware.2-tkg.1 fluent-bit.tanzu.vmware.com 1.7.5+vmware.2-tkg.1 22h43m33s
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$
Select one package name and set image_url
environment variable that points to the FluentBit package image url with the command image_url=$(kubectl -n tanzu-package-repo-global get packages fluent-bit.tanzu.vmware.com.1.7.5+vmware.2-tkg.1 -o jsonpath='{.spec.template.spec.fetch[0].imgpkgBundle.image}')
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ image_url=$(kubectl -n tanzu-package-repo-global get packages fluent-bit.tanzu.vmware.com.1.7.5+vmware.2-tkg.1 -o jsonpath='{.spec.template.spec.fetch[0].imgpkgBundle.image}')
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ echo $image_url
projects-stg.registry.vmware.com/tkg/packages/standard/fluent-bit@sha256:264bfbefb2430c422cb691637004ed5dbf4a4d0aac0b0cb06ee19a3c81b1779e
Use imgpkg
tool to download the FluentBit bundle and get the default values.yaml
file for the FluentBit add-on.
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ imgpkg pull -b $image_url -o /tmp/fluent-bit.tanzu.vmware.com.1.7.5+vmware.2-tkg.1
Pulling bundle 'projects-stg.registry.vmware.com/tkg/packages/standard/fluent-bit@sha256:264bfbefb2430c422cb691637004ed5dbf4a4d0aac0b0cb06ee19a3c81b1779e'
Extracting layer 'sha256:9117de69d77240fa52e38e1a434bcda69b9185f5fafd2a85eedbd06c06beb57c' (1/1)
Locating image lock file images...
One or more images not found in bundle repo; skipping lock file update
Succeeded
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ cp /tmp/fluent-bit.tanzu.vmware.com.1.7.5+vmware.2-tkg.1/config/values.yaml ./fluent-bit-data-values.yaml
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ ls fluent-bit-data-values.yaml
fluent-bit-data-values.yaml
To install the FluentBit add-on with the customized configuration, customize the default values.yaml
and copy the new YAML content (from the line fluent-bit:
onwards) to the VMware Telco Cloud Automation UI .
#@data/values
#@overlay/match-child-defaults missing_ok=True
---
namespace: "tanzu-system-logging"
#! Required params for supported output plugins
fluent_bit: --------------- from this line onwards
config:
#! https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/variables
service: |
[Service]
Flush 1
Log_Level info
Daemon off...
Customize Prometheus add-on
VMware Telco Cloud Automation 2.1 only exposes a few variables of Prometheus to the UI. However, through the YAML content, you can customize additional variables that are not available through the UI. The steps are:
Log on to the management cluster and download imgpkg
executable binary from https://vmwaresaas.jfrog.io/artifactory/generic-cnf-dev/airgap/imgpkg-linux-amd64-v0.12.0
.
Run kubectl get package -A | grep prometheus
to get Prometheus package name with version:
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ kubectl get package -A | grep prometheus
tanzu-package-repo-global prometheus.tanzu.vmware.com.2.27.0+vmware.1-tkg.1 prometheus.tanzu.vmware.com 2.27.0+vmware.1-tkg.1 22h23m13s
tanzu-package-repo-global prometheus.tanzu.vmware.com.2.27.0+vmware.2-tkg.1 prometheus.tanzu.vmware.com 2.27.0+vmware.2-tkg.1 22h23m13s
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$
Select one package name and set image_url
environment variable that points to the Prometheus package image URL with command image_url=$(kubectl -n tanzu-package-repo-global get packages prometheus.tanzu.vmware.com.2.27.0+vmware.1-tkg.1 -o jsonpath='{.spec.template.spec.fetch[0].imgpkgBundle.image}')
.
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ image_url=$(kubectl -n tanzu-package-repo-global get packages prometheus.tanzu.vmware.com.2.27.0+vmware.1-tkg.1 -o jsonpath='{.spec.template.spec.fetch[0].imgpkgBundle.image}')
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ echo $image_url
projects-stg.registry.vmware.com/tkg/packages/standard/prometheus@sha256:27af034c1c77bcae4e1f7f6d3883286e34419ea2e88222642af17393cd34e46a
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$
Use imgpkg
tool to download the Prometheus bundle and get the default values.yaml
file for the Prometheus add-on.
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ imgpkg pull -b $image_url -o /tmp/prometheus-package-2.27.0+vmware.1-tkg.1
Pulling bundle 'projects-stg.registry.vmware.com/tkg/packages/standard/prometheus@sha256:27af034c1c77bcae4e1f7f6d3883286e34419ea2e88222642af17393cd34e46a'
Extracting layer 'sha256:44798ebd112b55ea792f0198cf220a7eaed37c2abc531d6cf8efe89aadc8bff2' (1/1)
Locating image lock file images...
One or more images not found in bundle repo; skipping lock file update
Succeeded
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ cp /tmp/prometheus-package-2.27.0+vmware.1-tkg.1/config/values.yaml prometheus-data-values.yaml
capv@mgmt-sb-sbm-control-plane-5wqtn [ ~ ]$ ls prometheus-data-values.yaml
prometheus-data-values.yaml
To install the Prometheus add-on with the customized configuration, customize the default values.yaml
and copy the new YAML content (from the line prometheus:
onwards) to the VMware Telco Cloud Automation UI.
#@data/values
---
#! The namespace in which to deploy prometheus.
namespace: tanzu-system-monitoring
prometheus: --- from this line onwards
deployment:
#! Number of prometheus servers
replicas: 1
#! Replaces the prometheus-server container args with the ones provided below (i.e. Does not append).
containers:
args: []
resources: {}
podAnnotations: {}
podLabels: {}
configmapReload:
containers:
args: []
resources: {}
#! Prometheus service configuration
service:
type: "ClusterIP"
port: 80
targetPort: 9090
labels: {}
annotations: {}
pvc:
annotations: {}
storageClassName: null
accessMode: ReadWriteOnce
storage: "150Gi"
#! The prometheus configuration
config:
prometheus_yml: |
global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
rule_files:
- /etc/config/alerting_rules.yml
...
Note:
Prometheus add-on depends on CSI add-on. Make sure that the CSI add-on is provisioned normally before you install Prometheus add-on.
VMware Telco Cloud Automation vsphere-csi
add-on only supports create ReadWriteOnce
PVC by default. If you want to use ReadWriteMany
or ReadOnlyMany
accessMode for Prometheus PVC, you must use VMware Telco Cloud Automation nfs-client
add-on or other CSI to make sure that PVC can be created successfully.
AVI objects (aviinfrasetting, gatewayclass, gateway)
VMware Telco Cloud Automation 2.1 does not support creation of AVI objects (aviinfrasetting, gatewayclass, gateway CRs) on Workload clusters. You can still leverage the NF onboarding ability for a customized PaaS configuration until a generic cluster policy is available.
Appliance Size Limit Updates and One off Upgrade for 2.1
VMware Telco Cloud Automation 2.1 has updated the VM Appliance Size limits. CPU, Memory, and HDD are all increased from the default limits that were present until VMware Telco Cloud Automation 2.0. The limits are increased for improved scale performance numbers, new services, and better support.
All newly deployed VMware Telco Cloud Automation 2.1 appliances (TCA-M, TCA-CP and TCA-Bootstrapper) have the new limits.
Upgrading from VMware Telco Cloud Automation 2.0 to 2.1 comes with an additional step to ensure that the new limits are applied correctly on the new appliances.
For details on upgrade, see the VMware Telco Cloud Automation User Guide.
VM-Based New Resource Requirements
The VM-based VMware Telco Cloud Automation deployments have additional resource requirement options. For more information, see the System Requirements > VM-Based Deployment section of the VMware Telco Cloud Automation 2.1 Deployment Guide.
Infrastructure Automation Cloud-Native Upgrade - Release specific TKG image URL must be manually updated in the Images section of Infrastructure Automation.
Upload the TKG image URL manually in the Images section of Infrastructure Automation.
CSG creation on Predeployed domain fails with error message root password cannot be empty for VC in domain cdc
.
On a freshly installed VMware Telco Cloud Automation or a VMware Telco Cloud Automation that has no cloud spec defined so far, if we upload a valid cloud spec without appliance IDs, the cloud spec file gets accepted. However, further cloud spec change/uploads operations always require appliance passwords as there are no appliance IDs.
From the API, we can upload passwords each time. But from the UI, the passwords are not included on any change and save operation. VMware Telco Cloud Automation displays the error message root/admin/audit password needed for {appliance} for {domain}
.
Re-uploading the same cloud spec generates the appliance IDs. To use the latest cloud spec with appliance IDs, from the Domain section, click Refresh. After completing the steps, you can perform the operations normally.
Infrastructure Automation expects a password with minimum 13 characters for vCenter on the Global Appliance Configuration page. This might create a problem for pre-deployed setups where the vCenter password can be less than 13 characters.
On the Global Appliance Configuration page, provide a dummy 13 character password and save the configuration.
On the domain level, go to the Appliance Override section and change the vCenter password as required.
Infrastructure Automation requires to know the Single Sign On (SSO) credential of the VMware Telco Cloud Automation Manager, without which, the host profile application fails.
The current implementation fetches this information from Central Data Center (CDC), which is configured with the vCenter account used in the VMware Telco Cloud Automation Manager SSO configuration. There could be deployment topologies where the user may not provision Central Data Center but directly deploy Regional Data Center (RDC). In such a circumstance, Infrastructure Automation fails to apply Host Profile.
Create a Central Data Center by importing the vCenter and the account used to configure single sign on the VMware Telco Cloud Automation Manager.
When adding a pre-deployed host to a pre-deployed Cell Site Group domain by its IP Address, the host gets provisioned, but the Hostconfig Profile fails to apply.
Add the ESXi host to the vCenter Server using FQDN only. Then add the pre-deployed host to Infrastructure Automation using FQDN.
Deleting multiple ESXi hosts at once is not recommended due to vSAN limitations. This is specifically for non-Cell Site hosts.
Delete each ESXi host individually and let the deletion process complete before deleting the next ESXi host.
If the logon username is more than 20 characters, the login itself will work but the group retrieval of the user will fail, causing the login to VMware Telco Cloud Automation to fail.
VMware Telco Cloud Automation does not support Active Directory users with username that include more than 20 characters.
Ensure that the usernames are less than 20 characters in length.
New Active Directory users with the option to 'Change Password on next logon' cannot log in to VMware Telco Cloud Automation.
Set the user password in Active Directory before logging in to VMware Telco Cloud Automation.
Unable to obtain Node Pool during CNF re-instantiation on CaaS V2 clusters.
None
On v2 clusters, if node pools and TBRs are not upgraded, VMware Telco Cloud Automation displays an error icon instead of info icon.
VMware Telco Cloud Automation displays the following as an error: Current TBR is not compatible. Upgrade to latest TBR version. You can perform Edit Nodepool/Cluster and update the TBR reference. This message is displayed under the following scenarios:
After transforming the imported clusters from v1 to v2.
After upgrading the control plane in v2 clusters (on node pools).
Cause: The availability of latest TBRs causes the existing TBRs to be marked as not compatible for cluster control plane and node pools.
Purpose: The purpose of the error message is to indicate to users that newer versions of TBRs are available.
The error message provides information that an upgrade is available. Users need not perform any action as long as the cluster is within the supported Kubernetes version.
Diagnosis on Management/Workload cluster fails.
Diagnosis on Management/Workload cluster fails with the error message diagnosis on non-running management cluster cdc-mgmt-cluster-v1-22-9 is not supported. Status:
When running diagnostics, VMware Telco Cloud Automation may display an error that the Management cluster is not running even when the cluster is running.
Log in to the Appliance Management portal for VMware Telco Cloud Automation Control Plane.
Click Appliance Summary tab.
Under Telco Cloud Automation Services, restart the TCA Diagnosis API Service.
Unable to add Harbor as add-on.
This issue can be reproduced by performing the following steps:
Create or update the Harbor add-on of a cluster.
Add or update Harbor on the same cluster through Partner Systems.
Use the following workaround options:
Option 1: Add, update, or remove Harbor associations only through the Partner Systems tab.
Option 2: Perform add, update, or remove operations only on the Harbor add-on.
Collect Tech-Support Bundle UI issue.
After deleting a transformed cluster, the Collect Tech-Support bundle UI still lists the deleted cluster names.
None.
After deleting a transformed cluster, creating a v1 cluster using the same name or IP address fails.
Transform a Workload cluster abc with endpoint IP 10.10.10.10. After transforming the cluster, delete the Workload cluster abc using the v2 Delete API option. Now, when you create a cluster using v1 API with the name abc or with endpoint IP 10.10.10.10, it fails.
None.
On v2 clusters, the Delete Node Pool operation hangs if the v2 cluster contains a label with a dot in it.
If delete Node Pool operation is not initiated, follow the steps:
Edit the node pool and delete the label that contains a dot within it.
Delete the node pool.
If delete Node Pool operation is initiated and the delete operation gets hang, follow the steps from KB <<link to be updated>>.
Workload Cluster upgrade failson a VM-based VMware Telco Cloud Automation environment.
To fix this issue, log in to either the bootstrapper VM or Telco Cloud Automation Control Plane and remove duplicate entries or fix misaligned lines in the /root/.config/tanzu/config.yaml
file.
Prometheus/Alertmanager has issues when deployed on TKG.
Within the Prometheus add-on, the alertmanager pod might be in a CrashLoopBackOff
state in vCenter 70u2 and vCenter 70u3 deployments.
Provide the cluster.advertise-address
in the alertmanager deployment YAML.
containers:
- name: prometheus-alertmanager
image: "prom/alertmanager:v0.20.0"
imagePullPolicy: "IfNotPresent"
args:
- --config.file=/etc/config/alertmanager.yml
- --storage.path=/data
- --cluster.advertise-address=127.0.0.1:9093
Update AKO configuration with AKODeploymentConfig does not take effect.
Updating load-balance-and-ingress-service (AKO) add-on configuration is not supported.
Uninstall and re-install the load-balance-and-ingress-service (AKO) add-on with the new configuration.
Fluent-bit pod remains in the CrashLoopBackoff
state on Worker nodes where the cpu-manager-policy
is set to static.
Change the cpu-manager-policy
to none.
vsphere-csi-node is CrashLoopBackOff in TKG 1.5.3 RC1 build for Workload cluster spanning vCenters.
vsphere-csi is not supported in scenarios where the Cluster is deployed across multiple vCenters.
None.
Workload Cluster upgrade failson a cloud-native environment.
Cluster Upgrade fails intermittently with the following error message:
unknown command "login" for "tanzu"
Did you mean this?
plugin
: Error: unknown command "login" for "tanzu"
Did you mean this?
plugin
Run 'tanzu --help' for usage.
Log in to the VMware Telco Cloud Automation Control Plane SSH session using admin user and then switch to root user.
Edit /root/ .config/tanzu/config.yaml
.
kind: ClientConfig
metadata:
creationTimestamp: null
network-separation-beta: "false"
standalone-cluster-mode: "false"
current: cdc-mgmt-cluster
kind: ClientConfig
metadata:
creationTimestamp: null
Remove the duplicate line for kind: ClientConfig
.
Also fix the indentation for network-separation-beta
and standalone-cluster-mode
.
nodeconfig-operator add-on stays in Configuring and its AddonDeployReady condition stays in False.
Cluster connection is not ready for a while and nodeconfig-operator add-on stays in configuring phase.
Restart the tca-kubecluster-operator
pod.
Node policy is still configuring after 30 minutes in CaaS kubecluster operator pipeline.
Cluster node stays on a step during boot up with no IP Address being assigned to the node. This results in the node being in a NotReady status.
Reset the node Virtual Machine through the vCenter client.
Workload cluster of the Control Plane node fails when scaling out from 1 to 3.
Retry the scale out operation on Control Plane nodes.
Workload Cluster upgrade failes on Kubernetes version v1.20.8 to v1.22.9.
It is strongly recommended to create / ensure that the Management Clusters have 3 or more Control Plane nodes before upgrading to VMware Telco Cloud Automation to 2.1.
Case 1: Upgrade fails and reports an error message "timeout: poll control plane ready for removing SCTPSupport=true
"
Workaround:
Retry upgrade on the user interface.
If retry fails, check the CAPI log to find logs related to the new Control Plane machine and status.
The new Control Plane machine status is Running
. The new Control Plane node cloud-init log contains "could not find a JWS signature in the cluster-info ConfigMap for token ID
".
Contact tech-support.
# Confirm the new Control Plane
[root@10 /home/admin]# kubectl get machine -n cdc-work-cluster1-v1-21-2
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
cdc-work-cluster1-v1-21-2-work-master-control-plane-29xgb cdc-work-cluster1-v1-21-2 cdc-work-cluster1-v1-21-2-work-master-control-plane-29xgb vsphere://423b959d-59aa-47d9-56cf-d657b165ab0a Running 2d14h v1.21.2+vmware.1
cdc-work-cluster1-v1-21-2-work-master-control-plane-9ptnm cdc-work-cluster1-v1-21-2 cdc-work-cluster1-v1-21-2-work-master-control-plane-9ptnm vsphere://423b1c01-4b7f-c27d-6458-fb6a4e3c3984 Running 2d10h v1.21.2+vmware.1
cdc-work-cluster1-v1-21-2-work-worker-np1-575bc4c56-h89d6 cdc-work-cluster1-v1-21-2 cdc-work-cluster1-v1-21-2-work-worker-np1-575bc4c56-h89d6 vsphere://423b7444-e75f-066a-eb67-d9631c939e3d Running 2d14h v1.21.2+vmware.1
cdc-work-cluster1-v1-21-2-work-worker-np10-65b5c46bd6-2kcqm cdc-work-cluster1-v1-21-2 cdc-work-cluster1-v1-21-2-work-worker-np10-65b5c46bd6-2kcqm vsphere://423b993d-dbda-2dbd-0943-d0800ac6a0aa Running 2d14h v1.21.2+vmware.1
# Look for the IP address of the new Control Plane node
[root@10 /home/admin]# kubectl get machine -n cdc-work-cluster1-v1-21-2 cdc-work-cluster1-v1-21-2-work-master-control-plane-9ptnm -oyaml
# log in through SSH
[root@10 /home/admin]# ssh [email protected]
capv@cdc-work-cluster1-v1-21-2-work-master-control-plane-9ptnm [ ~ ]$ sudo su
root [ /home/capv ]# cat /var/log/cloud-init-output.log
# If you see "could not find a JWS signature in the cluster-info ConfigMap for token ID" message in the cloud-init log, and if kube-vip does not float on this faulty Control Plane, you can delete the faulty Control Plane.
***If kube-vip floats on the faulty Control Plane, contact TKG team to debug and provide a workaround.
# Log in to TCA-CP and switch to root user
[root@10 /home/admin]# kubectl edit deployment -n capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager
# Update bootstrap-token-ttl=15m to bootstrap-token-ttl=25m. This will cause capi-kubeadm-bootstrap-system pod to restart
# Delete faulty Control Plane machine
kubectl delete machine -n cdc-work-cluster1-v1-21-2 cdc-work-cluster1-v1-21-2-work-master-control-plane-9ptnm
# After the faulty Control Plane machine is deleted, the new Control Plane is created automatically. After the faulty Control Plane is deleted and the new Control Plane is running, retry upgrade on the user interface.
# If the New Control Plane node cloud-init still reports error "could not find a JWS signature ", increase the value of bootstrap-token-ttl.
Case 2: Error: could not find server "cdc-mgmt-cluster
"
Workaround:
Navigate to /root/.config/tanzu/config.yaml
and confirm if Management cluster context exists. If not, contact TKG tech-support to debug.
After finding the root cause, perform the following workaround.
Old Value:
[root@10 /home/admin]# cat /root/.config/tanzu/config.yaml
apiVersion: config.tanzu.vmware.com/v1alpha1
clientOptions:
cli:
discoverySources:
- oci:
image: projects.registry.vmware.com/tkg/packages/standalone/standalone-plugins:v0.11.6-1-g90440e2b_vmware.1
name: default
edition: tkg
features:
cluster:
custom-nameservers: "false"
dual-stack-ipv4-primary: "false"
dual-stack-ipv6-primary: "false"
global:
context-aware-cli-for-plugins: "true"
management-cluster:
custom-nameservers: "false"
dual-stack-ipv4-primary: "false"
dual-stack-ipv6-primary: "false"
export-from-confirm: "true"
import: "false"
network-separation-beta: "false"
standalone-cluster-mode: "false"
kind: ClientConfig
metadata:
creationTimestamp: null
New Value:
[root@10 /home/admin]# cat /root/.config/tanzu/config.yaml
apiVersion: config.tanzu.vmware.com/v1alpha1
clientOptions:
cli:
discoverySources:
- oci:
image: projects.registry.vmware.com/tkg/packages/standalone/standalone-plugins:v0.11.6-1-g90440e2b_vmware.1
name: default
edition: tkg
features:
cluster:
custom-nameservers: "false"
dual-stack-ipv4-primary: "false"
dual-stack-ipv6-primary: "false"
global:
context-aware-cli-for-plugins: "true"
management-cluster:
custom-nameservers: "false"
dual-stack-ipv4-primary: "false"
dual-stack-ipv6-primary: "false"
export-from-confirm: "true"
import: "false"
network-separation-beta: "false"
standalone-cluster-mode: "false"
current: cdc-mgmt-cluster
kind: ClientConfig
metadata:
creationTimestamp: null
servers:
- managementClusterOpts:
context: cdc-mgmt-cluster-admin@cdc-mgmt-cluster
path: /root/.kube-tkg/config
name: cdc-mgmt-cluster
type: managementcluster
Verify if the information is correct:
[root@10 /home/admin]# tanzu login --server cdc-mgmt-cluster
✔ successfully logged in to management cluster using the kubeconfig cdc-mgmt-cluster
Checking for required plugins...
All required plugins are already installed and up-to-date
[root@10 /home/admin]#
It is strongly recommended to create / ensure that the Management Clusters have 3 or more Control Plane nodes before upgrading to VMware Telco Cloud Automation to 2.1.
Transforming Management and Workload cluster results in the Node Policy status temporarily going to a Configuring state.
This automatically transfers to a normal state after a few minutes.
None.
Failed to upgrade Management Cluster: One control plane node displays the status as SchedulingDisabled.
You cannot end the pods of an existing Control Plane node because it can not remove cgroup path. This is similar to the upstream Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/97497.
Restart the affected Control Plane node. Post this, the upgrade will continue automatically in the backend.
Retry the upgrade operation for the Cluster from VMware Telco Cloud Automation UI.
New vCenterSub configurations cannot be added to existing stretched Workload Clusters that already have other vCenterSub configurations.
None.
New Cluster deployments might become slow if there are other clusters with many Node Pools.
None.
Updating ako-operator add-on configuration is not supported. For example: Avi Controller credentials, certificates, and so on.
Uninstall and re-install the ako-operator add-on with the new configuration.
Objects on Avi Controller are not deleted automatically when the load-balance-and-ingress-service add-on is uninstalled from the Workload cluster.
Delete the objects from the Avi Controller UI directly.
Cluster LCM operations might fail due to a root cause where the api-server of the corresponding Management Cluster is restarted.
This happens because the Control Plane nodes run on low performance hosts, which results in some pods with CrashLoopBackOff
error messages such as no route to host
, or leader election lost
.
Restart the pods within the Management Cluster that are in a CrashLoopBackOff
state.
Management Cluster IPv6 endpoint IP cannot be accessed at times.
When a user logs in to the Control Plane VM console, and then runs ip addr
, it displays eth0 with dadfailed tentative
.
[root@tca /home/admin]# kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
* lcm-mc-admin@lcm-mc lcm-mc lcm-mc-admin
minikube minikube minikube default
[root@tca /home/admin]# kubectl get pods -n tca-system
kubectl get nodepolicy -A
^C
[root@tca /home/admin]# kubectl get nodes
^C
[root@tca /home/admin]# kubectl config use-context lcm-mc-admin@lcm-mc
Switched to context "lcm-mc-admin@lcm-mc".
[root@tca /home/admin]# kubectl get nodes
Unable to connect to the server: dial tcp [2013:930::700]:6443: i/o timeout
Reboot or restart the Cluster Control Plane nodes from the vCenter client.
Deployment of IPv6 based Workload Clusters with Kubernetes versions prior to 1.22.x will result in TKG known limitations where the VSPHERE_CONTROL_PLANE_ENDPOINT IP Address is assigned to the node and host network pods.
This may further result in scenarios where there are potential IP Address conflicts when a Workload Cluster has more than one Control Plane node.
It is recommended that IPv6 base Workload Cluster is deployed with Kubernetes version 1.22.x (or later).
Node Pool Scale Out operation does not work on Workload Clusters (v2) that do not have their Control Plane and Node Pools upgraded to 1.22.x.
Upgrade the Control Plane and Node Pool to 1.22.x and then perform the Scale Out operation.
The vsphere-csi
daemonset does not always load the latest configuration after restart, due to which the nodes are not labeled with multi-zone information (zone and region information). In addition, the topologyKeys are not populated either.
Login to the Workload Cluster Control Plane node.
Restart the vsphere-csi node daemonset manually after applying the csi params
from VMwareTelco Cloud Automation and verifying that the operation is complete from VMware Telco Cloud Automation.
Run the following command to restart vsphere-csi
daemonset: kubectl rollout restart ds vsphere-csi-node -n kube-system
Verify the node labels after 2 or more minutes.
Unable to add VMware Cloud Director VIM.
VIM registration is not working in a cloud-native VMware Telco Cloud Automation deployed with VMware Cloud Director.
This issue will be addressed in an upcoming patch release of VMware Telco Cloud Automation.
Configuring Active Directory with TLS in Cloud-Native VMware Telco Cloud Automation environment fails.
HA-based VMware Telco Cloud Automation does not support configuring Active Directory with TLS.
None.
Cannot switch from ActiveDirectory to vCenter authentication for Appliance Manager UI if Cloud-Native VMware Telco Cloud Automation is deployed with Active Directory.
You cannot switch the authentication model from Active Directory to vCenter authentication for HA-based VMware Telco Cloud Automation.
None.
Post Control Plane BOM Upgrade, the Edit Node Pool operation lists Control Plane Kubernetes version templates instead of Node Pool Kubernetes version templates.
Re-select the VMware Telco Cloud Automation BOM release version to the existing version. The VM templates will now be listed correctly.
In the CSAR Designer, Add Network Adapter with device type vmxnet3 does not show Resource Name.
Design a CSAR with targetDriver
set to [igb_uio/vfio_pci]
and resourceName
.
Onboard the CSAR.
Download the CSAR and delete from VMware Telco Cloud Automation.
Edit the download CSAR in a text editor:
Remove 'targetDriver
'.
Add 'interfaceName
' [the interface name inside the GUEST OS for this adapter]. For example:
network:
devices:
- deviceType: vmxnet3
networkName: network1
resourceName: res2
interfaceName: net1
count: 1
isSharedAcrossNuma: false
5. Upload the edited CSAR file to VMware Telco Cloud Automation.
To edit a CSAR file:
Unzip the CSAR.
Edit the Definitions/NFD.yaml
with the changes listed in the steps above.
Run zip -r <new_name>.csar TOSCA-Metadata/ Definitions/ Artifacts/ NFD.mf
vmconfig-operator only support access to vCenter Server with port 443.
vmconfig-operator supports Clusters deployed within vCenter Server environments that run on port 443 only. vCenter Server environments with custom ports are not supported.
None.
Open Terminal fails if TCA-M logged-in user has a different domain
Open Terminal fails if TCA-M logged-in user has a domain different from the one configured within TCA-M Appliance Management.
Workaround:
Use the same domain name for the primary user configured within TCA-M Appliance Management.
The primary user must have vCenter admin access to read users and groups.
kubectl command overwrites from the beginning of the line after few characters on the remote ssh terminal.
Resize the SSH terminal window before using lengthy commands on the console.
On a vim editor, open terminal contents display repeated characters at the end.
The terminal opened through VMware Telco Cloud Automation displays repeated characters at the end within the vim editor.
Within vim, set the following: set :nottyfast
For details, see vim.
Network slicing is not supported for Active Directory Authentication-enabled environments in greenfield and brownfield scenarios.
None.
Adding custom port Harbor to the v2 clusters is causing Harbor add-on to freeze.
Add the custom port Harbor to Partner Systems first and then register the same to the cluster as an extension.
Kernel Arguments applied to Node pool during NF instantiation are not visible on the UI.
Cluster Automation CaaS UI does not render the list of Kernel arguments applied to Workload Clusters → Node Pools as part of Node Customizations.
The customizations are applied and are visible through the APIs.
Terminating and then instantiating an existing CNF on a transformed cluster fails.
After transforming a cluster from from v1 to v2 , terminating and reinstatiating the network function displays the error Unable to find node pool: np1 for cluster
.
Reselect the cloud and the node pool during CNF reinstantiation.
CSAR deployment fails for 38+ helm charts at a time through VMware Telco Cloud Automation
VMware Telco Cloud Automation does not support more than 25 values.yaml
files in a single instantiation.
Split the CSAR into 2 with at most 25 values file in each.
After backing up and restoring from one VMware Telco Cloud Automation appliance to another, operations on the CaaS clusters fail.
See the VMware KB article https://kb.vmware.com/s/article/89083.
When configuring the FTP server, Backup Directory Path must be relative path.
The tool tip suggests to provide absolute path, whereas the user must provide relative path.
None.
Configuring SFTP server with a passphrase throws an error.
When configuring the SFTP server with a passphrase for Backup and Restore, the following error is displayed:
“java.io.FileNotFoundException: /common/appliance-management/backup-restore/id_rsa_backup_restore.pub (No such file or directory)”, “Failed to configure FTP server!"
None.
Domain can be provisioned without any global settings present, but won't proceed further, status not conveyed in UI.
Not providing passwords for appliance overrides during domain modification is successful but fails during provisioning.
Apply Hostconfig profile task for firmware upgrade, failing with error "503 Service Unavailable".
When pre-deployed domain is added without using appliance overrides, state of domain should not become provisioned.
CDC deployment requires the user to manually upload the certificate for provisioning to start.
Must allow get/post/put/patch/delete API calls to VMware Telco Cloud Automation/Control Plane for clusters after Infrastructure Automation is migrated successfully.
Multiple issues post backup and restore.
No mongo dump in Techsupport bundle for VMware Telco Cloud Automation Control Plane only deployment.
WLD domain deployment failed during VMware Telco Cloud Automation Control Plane with error Failed to deploy TCA CLUSTER 69247203-c518-4bdf-95e4-1e77cf3d078d. Reason: compatibility file (/root/.config/tanzu/tkg/compatibility/tkg-compatibility.yaml) already exists.
VMware Telco Cloud Automation shows the scaling details when reconfiguring a CNF.
CNF instance should not have Heal and Scale to level in the list of available operations.
No IP when you restart a Worker Node for CPI with multi vCenter.
httpd does not start after upgrading from 1.9.1 to 2.0.
VMware Telco Cloud Automation Manager and Control Plane is exposed to Clickjack Attack.
Privilege escalation to root user.
Insecure sudo config - Privilege Escalation from admin user to root.