Troubleshooting procedure for infrastructure automation.
Cluster Creation Failure
etcdserver
leader change.etcdserver
timeout.- Internet connectivity problem (DNS configuration is incorrect in the bootstrapper VM), due to which the images for the containers cannot be downloaded from the internet.
- For air-gapped configuration, not able to connect to the airgap server and the airgap server not able to connect to the cluster nodes.
- Unable to connect to Bootrapper Service.
- Cluster creation timeout issues.
Workaround
- If you have already created bootstrapper cluster, then delete the bootstrapper cluster before deleting the management cluster.
- To delete the VMware Telco Cloud Automation Cluster, use clusterType=MANAGEMENT.
-
To delete Bootstrapper Cluster, use clusterType=WORKLOAD.
- Use the
GET API
of the appliance manager to fetch the cluster details.- For example,
GET API: https://<VM IP>:9443/api/admin/clusters?clusterType=<MANAGEMENT or WORKLOAD>
. - For the response, use the ID as the clusterID.
- For example,
- Use the id from the API and pass it to the
DELETE API
.- For example,
https://<VM IP>:9443/api/admin/clusters/<clusterId>?clusterType=<MANAGEMENT or WORKLOAD>
. - This API deletes the cluster details from VMware vCenter and Tanzu Kubernetes Grid.
- For example,
- Use the status API to monitor the delete cluster.
- For example,
GET API: https://<VM IP>:9443/api/admin/clusters/<clusterId>/status?clusterType=<MANAGEMENT or WORKLOAD>
. - If the output shows
{"status":"Deleted","phase":"Teardown"}
, then system has deleted the cluster and released the related resources.
- For example,
- Use the force delete option to remove from the local DB to reuse the
controlPlaneEndpointIP
.- For example,
https://<VM IP>:9443/api/admin/clusters/<clusterId>?clusterType=<MANAGEMENT or WORKLOAD>&forcedDelete=true
.
- For example,
Service Installation Failure
Helm API failed: release: <service_name> not found
Helm API failed: release name <service_name> in namespace <namespace_name> is already in use
Release "service_name" failed: etcdserver: leader changed
Failed to deploy <service_name>-readiness
Workaround
VMware Telco Cloud Automation automatically uninstalls any failed service. User can perform a Resync to continue with the deployment.
If the automatic uninstall of the service fails and the error message includes the line Manual cleanup of the failed service may be necessary before performing a resync
, uninstall the failed service manually and perform Resynchronisation from Infrastructure Automation.
Steps to manually uninstall a failed service
To check the installation failure of the service on the bootstrapper virtual machine (VM), execute the command helm ls -n <Namespace-name>
Example
helm ls -n tca-mgr
[root@tca-b-cdc1 /home/admin]# helm ls -n tca-mgr NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION istio-ingress tca-mgr 1 2021-11-20 17:18:25.309656522 +0000 UTC deployed istio-ingress-2.0.0 1.10.3 kafka tca-mgr 1 2021-11-20 17:15:43.734275545 +0000 UTC deployed kafka-2.0.0 2.12-2.5.0 mongodb tca-mgr 1 2021-11-20 17:11:17.794039072 +0000 UTC deployed mongodb-2.0.0 3.2.5 redisoperator tca-mgr 1 2021-11-20 17:15:55.431688075 +0000 UTC deployed redisoperator-2.0.0 1.0.0 redisservice tca-mgr 1 2021-11-20 17:16:58.087135033 +0000 UTC deployed redisservice-2.0.0 6.0-alpine tca tca-mgr 1 2021-11-20 17:18:46.328884407 +0000 UTC deployed tca-2.0.0 2.0.0 zookeeper tca-mgr 1 2021-11-20 17:15:34.075735519 +0000 UTC deployed zookeeper-2.0.0 3.4.9
To recover from the installation failure of the service, uninstall the failed service specifically on the Bootstrapper VM terminal. Use the command helm uninstall <Helm Service Name> -n <Namespace-name>
Example:
helm uninstall tca -n tca-mgr
To verify the successful uninstallation of the service, re-execute the command helm uninstall <Helm Service Name> -n <Namespace-name>
. If the list does not shows Helm service, the uninstallation is successful.
[root@tca-b-cdc1 /home/admin]# helm ls -n tca-mgr NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION istio-ingress tca-mgr 1 2021-11-20 17:18:25.309656522 +0000 UTC deployed istio-ingress-2.0.0 1.10.3 kafka tca-mgr 1 2021-11-20 17:15:43.734275545 +0000 UTC deployed kafka-2.0.0 2.12-2.5.0 mongodb tca-mgr 1 2021-11-20 17:11:17.794039072 +0000 UTC deployed mongodb-2.0.0 3.2.5 redisoperator tca-mgr 1 2021-11-20 17:15:55.431688075 +0000 UTC deployed redisoperator-2.0.0 1.0.0 redisservice tca-mgr 1 2021-11-20 17:16:58.087135033 +0000 UTC deployed redisservice-2.0.0 6.0-alpine zookeeper tca-mgr 1 2021-11-20 17:15:34.075735519 +0000 UTC deployed zookeeper-2.0.0 3.4.9
Perform the resynchronisation through Infrastructure Automation using Resync.
Site-Pairing Failure
- etcdserver leader change
- etcdserver timeout
- Socket-timeout issue
Workaround
Perform the resynchronisation through Infrastructure Automation using Resync.
Cloud Builder Validation Failure
During the Management or Workload domain deployment, Cloudbuilder 4.3 performs a set of validations. Some validations could fail as expected, but have no impact of the domain deployment. For example, cloudbuilder validates the gateway configuration for vMotion and vSAN network. It is possible that the user may not have configured the gateway for vMotion and vSAN as they are configured in the same respective L2 domains. In such a situation, while Infrastructure Automation fails the domain deployment (due to cloudbuilder validation failing), the user can skip cloudbuilder validation using the following procedure, after which a user can perform a resync on the failed domain to continue further.
- Login to VMware Telco Cloud Automation manager using SSH.
- Switch to the root user.
- Open the file
/common/lib/docker/volumes/tcf-manager-config/_data/cloud_spec.json
. - Set the field
validateCloudBuilderSpec
tofalse
."settings": { "ssoDomain": "vsphere.local", "pscUserGroup": "Administrators", "saas": "10.202.228.222", "enableCsiZoning": true, "validateCloudBuilderSpec": true, "csiRegionTagNamingScheme": "region-{domainName}", "clusterCsiZoneTagNamingScheme": "zone-{domainName}", "hostCsiZoneTagNamingScheme": "zone-{hostname}", "dnsSuffix": "telco.example.com", "ntpServers": [ "10.166.1.120" ],
- Resync the failed domain.
- Login to the bootstrapper VM using SSH.
- Switch to the root user.
- Open the file
/common/lib/docker/volumes/tcf-manager-config/_data/cloud_spec.json
. - Set the field
validateCloudBuilderSpec
tofalse
."settings": { "ssoDomain": "vsphere.local", "pscUserGroup": "Administrators", "saas": "10.202.228.222", "enableCsiZoning": true, "validateCloudBuilderSpec": true, "csiRegionTagNamingScheme": "region-{domainName}", "clusterCsiZoneTagNamingScheme": "zone-{domainName}", "hostCsiZoneTagNamingScheme": "zone-{hostname}", "dnsSuffix": "telco.example.com", "ntpServers": [ "10.166.1.120" ],
- Resync the failed domain.
- Login to the bootstrapper VM using SSH.
- Switch to the root user.
- Navigate to the tcf-manager docker using the
kubectl exec -it tca-tcf-manager-0 -n tca-mgr bash
command. - Open the file
/opt/vmware/tcf/config/cloud_spec.json
. - Set the field
validateCloudBuilderSpec
tofalse
."settings": { "ssoDomain": "vsphere.local", "pscUserGroup": "Administrators", "saas": "10.202.228.222", "enableCsiZoning": true, "validateCloudBuilderSpec": true, "csiRegionTagNamingScheme": "region-{domainName}", "clusterCsiZoneTagNamingScheme": "zone-{domainName}", "hostCsiZoneTagNamingScheme": "zone-{hostname}", "dnsSuffix": "telco.example.com", "ntpServers": [ "10.166.1.120" ],
- Resync the failed domain.