Infrastructure Automation Troubleshooting

Troubleshooting procedure for infrastructure automation.

Cluster Creation Failure

The following can cause cluster creation failure:

etcdserver leader change.
etcdserver timeout.
Internet connectivity problem (DNS configuration is incorrect in the bootstrapper VM), due to which the images for the containers cannot be downloaded from the internet.
For air-gapped configuration, not able to connect to the airgap server and the airgap server not able to connect to the cluster nodes.
Unable to connect to Bootrapper Service.
Cluster creation timeout issues.

Workaround

Delete Cluster and perform Resynchronisation from Infrastructure Automation.

Note:

If you have already created bootstrapper cluster, then delete the bootstrapper cluster before deleting the management cluster.
To delete the VMware Telco Cloud Automation Cluster, use clusterType=MANAGEMENT.
To delete Bootstrapper Cluster, use clusterType=WORKLOAD.

Steps to Delete a VMware Telco Cloud Automation Cluster

Use the GET API of the appliance manager to fetch the cluster details.
- For example, GET API: https://<VM IP>:9443/api/admin/clusters?clusterType=<MANAGEMENT or WORKLOAD>.
- For the response, use the ID as the clusterID.
Use the id from the API and pass it to the DELETE API.
- For example, https://<VM IP>:9443/api/admin/clusters/<clusterId>?clusterType=<MANAGEMENT or WORKLOAD>.
- This API deletes the cluster details from VMware vCenter and Tanzu Kubernetes Grid.
Use the status API to monitor the delete cluster.
- For example, GET API: https://<VM IP>:9443/api/admin/clusters/<clusterId>/status?clusterType=<MANAGEMENT or WORKLOAD>.
- If the output shows {"status":"Deleted","phase":"Teardown"}, then system has deleted the cluster and released the related resources.
Use the force delete option to remove from the local DB to reuse the controlPlaneEndpointIP.
- For example, https://<VM IP>:9443/api/admin/clusters/<clusterId>?clusterType=<MANAGEMENT or WORKLOAD>&forcedDelete=true.

Service Installation Failure

The following can cause service installation failures:

Helm API failed: release: <service_name> not found
Helm API failed: release name <service_name> in namespace <namespace_name> is already in use
Release "service_name" failed: etcdserver: leader changed
Failed to deploy <service_name>-readiness

Workaround

VMware Telco Cloud Automation automatically uninstalls any failed service. User can perform a Resync to continue with the deployment.

If the automatic uninstall of the service fails and the error message includes the line Manual cleanup of the failed service may be necessary before performing a resync, uninstall the failed service manually and perform Resynchronisation from Infrastructure Automation.

Steps to manually uninstall a failed service

To check the installation failure of the service on the bootstrapper virtual machine (VM), execute the command helm ls -n <Namespace-name>

Example

helm ls -n tca-mgr

[root@tca-b-cdc1 /home/admin]# helm ls -n tca-mgr
NAME            NAMESPACE   REVISION    UPDATED                                 STATUS      CHART               APP VERSION
istio-ingress   tca-mgr     1           2021-11-20 17:18:25.309656522 +0000 UTC deployed    istio-ingress-2.0.0 1.10.3    
kafka           tca-mgr     1           2021-11-20 17:15:43.734275545 +0000 UTC deployed    kafka-2.0.0         2.12-2.5.0
mongodb         tca-mgr     1           2021-11-20 17:11:17.794039072 +0000 UTC deployed    mongodb-2.0.0       3.2.5     
redisoperator   tca-mgr     1           2021-11-20 17:15:55.431688075 +0000 UTC deployed    redisoperator-2.0.0 1.0.0     
redisservice    tca-mgr     1           2021-11-20 17:16:58.087135033 +0000 UTC deployed    redisservice-2.0.0  6.0-alpine
tca             tca-mgr     1           2021-11-20 17:18:46.328884407 +0000 UTC deployed    tca-2.0.0           2.0.0     
zookeeper       tca-mgr     1           2021-11-20 17:15:34.075735519 +0000 UTC deployed    zookeeper-2.0.0     3.4.9

To recover from the installation failure of the service, uninstall the failed service specifically on the Bootstrapper VM terminal. Use the command helm uninstall <Helm Service Name> -n <Namespace-name>

Example:

helm uninstall tca -n tca-mgr

To verify the successful uninstallation of the service, re-execute the command helm uninstall <Helm Service Name> -n <Namespace-name>. If the list does not shows Helm service, the uninstallation is successful.

[root@tca-b-cdc1 /home/admin]# helm ls -n tca-mgr
NAME            NAMESPACE   REVISION    UPDATED                                 STATUS      CHART               APP VERSION
istio-ingress   tca-mgr     1           2021-11-20 17:18:25.309656522 +0000 UTC deployed    istio-ingress-2.0.0 1.10.3    
kafka           tca-mgr     1           2021-11-20 17:15:43.734275545 +0000 UTC deployed    kafka-2.0.0         2.12-2.5.0
mongodb         tca-mgr     1           2021-11-20 17:11:17.794039072 +0000 UTC deployed    mongodb-2.0.0       3.2.5     
redisoperator   tca-mgr     1           2021-11-20 17:15:55.431688075 +0000 UTC deployed    redisoperator-2.0.0 1.0.0     
redisservice    tca-mgr     1           2021-11-20 17:16:58.087135033 +0000 UTC deployed    redisservice-2.0.0  6.0-alpine    
zookeeper       tca-mgr     1           2021-11-20 17:15:34.075735519 +0000 UTC deployed    zookeeper-2.0.0     3.4.9

Perform the resynchronisation through Infrastructure Automation using Resync.

Site-Pairing Failure

The following can cause the site-pairing issue:

etcdserver leader change
etcdserver timeout
Socket-timeout issue

Workaround

Perform the resynchronisation through Infrastructure Automation using Resync.

Cloud Builder Validation Failure

During the Management or Workload domain deployment, Cloudbuilder 4.3 performs a set of validations. Some validations could fail as expected, but have no impact of the domain deployment. For example, cloudbuilder validates the gateway configuration for vMotion and vSAN network. It is possible that the user may not have configured the gateway for vMotion and vSAN as they are configured in the same respective L2 domains. In such a situation, while Infrastructure Automation fails the domain deployment (due to cloudbuilder validation failing), the user can skip cloudbuilder validation using the following procedure, after which a user can perform a resync on the failed domain to continue further.

For VM based deployment, use the following procedure:

Login to VMware Telco Cloud Automation manager using SSH.
Switch to the root user.
Open the file /common/lib/docker/volumes/tcf-manager-config/_data/cloud_spec.json.

Set the field validateCloudBuilderSpec to false.

   "settings": {
        "ssoDomain": "vsphere.local",
        "pscUserGroup": "Administrators",
        "saas": "10.202.228.222",
        "enableCsiZoning": true,
        "validateCloudBuilderSpec": true,
        "csiRegionTagNamingScheme": "region-{domainName}",
        "clusterCsiZoneTagNamingScheme": "zone-{domainName}",
        "hostCsiZoneTagNamingScheme": "zone-{hostname}",
        "dnsSuffix": "telco.example.com",
        "ntpServers": [
            "10.166.1.120"
        ],

Resync the failed domain.

For HA based deployment, use the following procedure:

Login to the bootstrapper VM using SSH.
Switch to the root user.
Open the file /common/lib/docker/volumes/tcf-manager-config/_data/cloud_spec.json.

Set the field validateCloudBuilderSpec to false.

   "settings": {
        "ssoDomain": "vsphere.local",
        "pscUserGroup": "Administrators",
        "saas": "10.202.228.222",
        "enableCsiZoning": true,
        "validateCloudBuilderSpec": true,
        "csiRegionTagNamingScheme": "region-{domainName}",
        "clusterCsiZoneTagNamingScheme": "zone-{domainName}",
        "hostCsiZoneTagNamingScheme": "zone-{hostname}",
        "dnsSuffix": "telco.example.com",
        "ntpServers": [
            "10.166.1.120"
        ],

Resync the failed domain.

For Management or Workload domain deployment, use the following procedure:

Login to the bootstrapper VM using SSH.
Switch to the root user.
Navigate to the tcf-manager docker using the kubectl exec -it tca-tcf-manager-0 -n tca-mgr bash command.
Open the file /opt/vmware/tcf/config/cloud_spec.json.

Set the field validateCloudBuilderSpec to false.

   "settings": {
        "ssoDomain": "vsphere.local",
        "pscUserGroup": "Administrators",
        "saas": "10.202.228.222",
        "enableCsiZoning": true,
        "validateCloudBuilderSpec": true,
        "csiRegionTagNamingScheme": "region-{domainName}",
        "clusterCsiZoneTagNamingScheme": "zone-{domainName}",
        "hostCsiZoneTagNamingScheme": "zone-{hostname}",
        "dnsSuffix": "telco.example.com",
        "ntpServers": [
            "10.166.1.120"
        ],

Resync the failed domain.