Problems can occur within different layers of the VMware Telco Cloud Service Assurance stack. The symptoms can manifest as application unavailability, incorrect operations, or degradation in performance.These problems can occur during deployment as well as after during operation.

Deployment Container Issues

  • Cannot connect to the kubernetes cluster from the deployment container: Running kubectl get nodes --kubeconfig /root/.kube/<your-kubernetes-cluster-kubeconfig-file> from inside the deployment container sometimes hangs. Follow the procedure to resolve this:
    1. Ensure that, you have the right KUBECONFIG file on your deployment container using command:
      1. ls /root/.kube/<your-kubernetes-cluster-kubeconfig-file>
      2. If does not present, verify it is present on your deployment host/VM, using command:
        ls $HOME/.kube/<your-kubernetes-cluster-kubeconfig-file>
    2. Verify the IP address of the kubernetes API server is reachable from your deployment host/VM:
      1. Get the IP address of the server from the KUBECONFIG file and ping the IP address:
        apiVersion: v1
        clusters:
        - cluster:
            certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeU1EUXhPREl6TXpZMU1sb1hEVE15TURReE5USXpNelkxTWxvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBS1RlCmcvcFNaVG0wdnNiU2g2RFlZazJ1ZFRtZW51VytTdGNDV0RPb2NUYk1YbDBrclh0S2lDQ0R2UWdlTlVoT3RRZ0UKU2VPUzRnZEcweXJvSkswY09qMndjSHVlc0E3bGVYOVNxZUpacklhYUlUZTh5eEVtckJPbHVOZjdhdUhoS2UwYgprNFQwSlZja2F1T2VxTDB0YmQ0UTE1T2F4SS9VOEIyc2I5VTNvS1oxWUhQNlNWS05rWUthVjFFS1AxYXoxWjg1CmJnVTIzbS9KUy9URGR1aXV1aVkrU3lJd2c4dW1VSXlvbVBLbzFRU1RSblRmK2haSDduUEk1aEhwK2dtdXpudUYKK0NuWDNqQmgxcUpSenhLNkxiMGRITWZHT2NwbExsYTg1eTJDcTBVd3BQdmwzcDZIUkswZFdjcnhuSE1wMzNuOApIb0FnS3NRK0Q5NVV3QllzeThFQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZIOWJnMzMySXhWNTdNZ0FlbmxYYmxyRGhlenBNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFBb2NZRHFvblRyR0RVTExtZkpFRUt4dHVPdWdwd1VaWFo5QXBNdnNNNFJBR09FbG1DaQpTOTdtQU15dkE2dndjQlZDL3BOaXNIditaSGdJbHF4UWJJMXptWnJGb0g2bkE4Q1BCN2VwcDAxbFlvVEsxOGxOClNyN0ZyNlZMRlVYcWNsaFhKYndDNDVWVm4zYk9OQ2pobGFLV0pzV0hsNDBGV0RwNllmYnhWbzVQa0FEQjhGYUgKY3p1Q0pab3VXUGJ1R2M5VjBuOVB2dG1wbVh6NG1rS2xwaTlVTlF4TGx2bWtDRVlVYW1Na1Y2QzlxQ3pDSkp6bwowbzdDN3FHc2RZR0Q1d3VqY2JFSEE5a2ZVek9kSFRBalBrYmw5RUQyUGFCdFlONHJaT1h4L3RwaW94N3BYRjBzCjR2bTFuc3VLK2FFT2pwSHRoV25WQm4zTmkzb0VhZENsSGxuSwotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
            server: https://10.180.13.14:6443  <- Ping this server IP from the deployment host
      2. If the IP address is not reachable through ping, there may be connectivity issues between your deployment VM and the Kubernetes cluster. Contact IT administrator for resolution.
    3. If the IP address is reachable but the kubectl command still hangs inside the container, exit and relaunch the container with the --network host option:
      $ docker run \
          --rm \
          -v ${TCSA_WORK_SPACE}/tcx-deployer:/root/tcx-deployer \
          -v $HOME/.ssh:/root/.ssh \
          -v $HOME/.kube:/root/.kube \
          -v /var/run/docker.sock:/var/run/docker.sock \
          -v $(which docker):/usr/local/bin/docker:ro \
          -v $HOME/.docker/config.json:/root/.docker/config.json:ro \
          -v /etc/docker:/etc/docker:rw \
          --network host \
          -it $DEPLOYMENT_IMAGE \
          bash
      1. Then rerun the kubectl get nodes --kubeconfig /root/.kube/<your-kubernetes-cluster-kubeconfig-file> command. If it still hangs there may be more fundamental connectivity issues between your deployment VM and the Kubernetes cluster. Contact IT admin for resolution.
  • Contents of the deployer bundle are not visible inside the container: Recreating the folder by extracting the tar.gz file again or creating the folder using mkdir does not display the contents within the container. Also the contents may not be visible if the tcx-deployer folder was somehow deleted on the host. To resolve this:
    1. Exit the deployment container.
    2. Extract the deployer tar.gz bundle on the deployment host (if not done).
    3. Restart the container by running the docker run command again.

Elasticsearch Kibana Troubleshooting

Elasticsearch goes down in the middle of Kibana initialization, and this failure during initialization is critical to Kibana. If it fails, you must manually remove the index because it is in a broken state.

Perform the following steps to delete the Kibana index and restart the pod:
  1. Delete Kibana index.
    curl -XDELETE http://elasticsearch:9200/.kibana*
  2. Delete pod.

    After you delete the pod, Kubernetes creates a new pod.

    kubectl delete pods <POD-ID>

VMware Telco Cloud Service Assurance Installation issues

VMware Telco Cloud Service Assurance installation is triggered by the execution of the tcx_app_deployment.sh script. This script executes two main stages: The initialization and the installation stage.
  • Troubleshooting initialization issues:
    1. VMware Telco Cloud Service Assurance initialization (pushing artifacts, deploying core components) are executed by a python script called tcx_install.zip.
    2. If the tcx_app_deployment.sh script exits with a failure and the failure message includes a pattern similar to the following example:
      Traceback (most recent call last):
      04:09:08   File "/tmp/Bazel.runfiles_405hciz4/runfiles/tcx/scripts/tcx_install.py", line 215, in <module>
      04:09:08     main()
      Then, the deployment failed during the initialization stage. Attach the initialization logs in the ticket, and contact IT administrator for resolution.
    3. How to get product initialization logs: Errors during the execution of the tcx_install.zip python script are logged to a log file named, tcx_installer_log.log under the scripts directory of the unpacked deployer bundle on the deployment host. You can attach these logs while filing a support ticket.
  • Troubleshooting VMware Telco Cloud Service Assurance installation issues:
    1. If the tcx_app_deployment.sh script exits with a failure message:
      'failed to deploy all apps successfully..Current product status is:.."

      Then, the deployment failed during the product installation stage.

      Follow the procedure:
      1. Launch the deployment container, so that you can use kubectl. Refer, VMware Telco Cloud Service Assurance Deployment guide.
      2. Set the KUBECONFIG variable to your cluster's kubeconfig file using command:
        export KUBECONFIG=/root/.kube/<your-kubernetes-cluster-kubeconfig-file>
      3. Use the below kubectl commands in the deployment container to help narrow down the issue. Also, you can attach the output of each kubectl command in your support request:
        1. Get the current product status:
          kubectl get tcxproduct tcsa
        2. If the message appears: The following App CRs are still reconciling" or The following App CRs failed, check the status of each App in the message by running the following command:
          kubectl describe app <app-name>
        3. In the output of the above command, look for the Useful Error Message at the bottom. This message provides adequate information about the exact resource (Deployment, StatefulSet, ReplicaSet, Job, and so on..) that is failing.
        4. Depending on the resource that is failing or stuck Reconciling, run the kubectl describe command for that resource to get more information:
          kubectl describe deployment <deployment-name> or,
          kubectl describe service <service-name> or,
          kubectl describe statefulset <statefulset-name> or,
          kubectl describe daemonset <daemonset-name> or,
          kubectl describe job <job-name> or,
        5. Once you have narrowed down to the appropriate resource, if the above commands do not provide adequate information get information from the pods owned by the resource:
          kubectl get pods -A | grep <app-name>
          kubectl describe pod <pod-name>  -n <pod-namespace>   # where pod-name and pod-namespace is the name and namespace of the pod obtained from previous command
          and..
          kubectl logs <pod-name>  -n <pod-namespace>
        6. Get product installation logs. Product installation is executed by a service called Admin Operator. The logs of the installation can be obtained from the admin-operator pod by running the following steps:
          kubectl get pods | grep admin-operator
          kubectl logs <admin-operator-pod-name> # Where <admin-operator-pod-id> is the pod id from the previous command

        Attach these logs while filing a support request.

  • Post deployment, if hdfs-datanode pod crashes, follow the procedure to resolve:
    • Workaround steps on a TKG cluster:
      • Identify the hdfs datanode pod with the issue:
        kubectl get pods | grep hdfs
      • Delete the corresponding pvc:
        kubectl get pvc | grep hdfs
        kubectl delete pvc <pvc_id> --wait=false
      • Delete the corresponding pv:
        kubectl get pv | grep hdfs
        kubectl delete pv <pv_id> --wait=false
      • Delete the failing pod:
        kubectl delete pod <pod_id>
        Note: The pv and pvc are recreated, when the pod is recreated.

Getting Additional Information for CNFs

VMware Telco Cloud Automation manager UI provides information about CNF instantiation. To obtain additional information, run helm CLI commands against the CNFs.

  1. To list all the CNFs:
    root [ ~/tcx/scripts ]# helm list
    NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
    admin-opera-39526-teo8r default         1               2022-06-27 22:21:23.436735731 +0000 UTC deployed        admin-operator-0.2.1    1.2.0
    tcx-init-00-ca7ba-gxhcq default         1               2022-06-27 22:11:55.083527832 +0000 UTC deployed        tcx-init-0.0.1          0.0.1
  2. To get the Helm deployment configuration for a CNF:
    helm get all tcx-init-00-ca7ba-gxhcq

Resolving Edge Services Port Conflict Issue

During initial deployment, the kafka-edge service assigns a random port to the ingress gateway in charge of exposing the kafka-edge to external clients. This random port, in certain circumstances, may conflict with another port assigned in other parts of the deployment. Due to the port conflict, the kafka-edge service may not be deployed properly and can continue to report a Reconciled Failed state until the conflict is resolved. To determine the availability of the edge services and whether the port conflict is the reason for the failure, execute the following command from the deployer container:
kubectl get app kafka-edge
If the status shown is not Reconcile succeeded, then look at the details of the app by describing the app with the following command:
kubectl describe app kafka-edge 
At the bottom of the output, look for the status or error message, which may state the port conflict with a message similar to this:
Useful Error Message: kapp: Error: Applying create service/istio-edge-ingressgateway (v1) namespace: kafka-edge:
Creating resource service/istio-edge-ingressgateway (v1) namespace: kafka-edge:
Service "istio-edge-ingressgateway" is invalid: spec.ports[3].nodePort:
Invalid value: 32097: provided port is already allocated (reason: Invalid)
Events: <none>
Once you have this information and the port number causing the conflict, determine which service is using the port:
kubectl get svc -A | grep <port>
For example:
[root@wdc-10-220-142-191 ~]# kubectl get svc -A | grep 32097
istio-system   istio-ingressgateway                                 NodePort       100.68.40.18     <none>                                            
                                        15021:32737/TCP,80:32097/TCP,443:30002/TCP      3h28m 
The preceding example points to the istio-ingressgateway service under the istio-system namespace. To correct this, you must delete the service and wait for it to get recreated again, at which point it gets assigned to another random port. To delete the service for the preceding example, use the following command:
kubectl delete svc -n istio-system istio-ingressgateway

Verify the deleted service has been restored and that the kafka-edge service is reporting a Reconciled state.

Support Bundle for Offline Troubleshooting

Another way to gather troubleshooting information is using the Application Support Bundle as follows:

  1. Go to https://Telcocloud serviceassurance-ui-IP.
  2. Select Administration > Application Support Bundle.
  3. The Collection Start Time field for service log allows you to enter a date and time so that service logs of one hour are collected after the specified time. The service log repository is using UTC time zone, the default timestamp is in UTC (Coordinated Universal Time) time zone. The date can be selected from the date picker and the time is a user entered field in the format of HH:mm. For example, 15:45. You can change the date and time according to your desired time slot..
    Note: Only the service logs in the support bundle are affected by the Collection Start Time field. However, the Kubernetes cluster information in the support bundle reflects the latest deployment only.
  4. Click Create Application Support Bundle button.
    Note: While the collection is in progress, you can click Cancel button to cancel the support bundle collection.
  5. After the support bundle is collected, click Download to download the bundle. The support bundle is of the format tar.gz file with the following details:
    • Node details.
    • Service details.
    • Pod details for VMware Telco Cloud Service Assurance applications.
    • Logs for VMware Telco Cloud Service Assurance.
    • Pod details for K8s.
    • Pod logs for the services.
    • Persistent volume details.
    • Config map details.
    • K8s service account and corresponding role binding information.

Service Logs for Troubleshooting

Service Logs are collected through ELK pipeline and presented in the Service Logs page. You can search and explore the logs through the embedded Kibana log browser.

On the Service Logs page, you can perform the following operations.
  • Customize the fields that you want to view.
  • Filter the logs by criteria.
  • Search the log by text message.

For more detailed functions, see Kibana Discover documentation.

ElasticSearch data and the Events pods are crashing in longevity setup

To check the probe readiness, use the command:
#kubectl describe pod/elasticsearch-data-x

Displays the latest event that Readiness probe failed, as below:

Root Cause Analysis: The master nodes in the ElasticSearch cluster are not in sync.
  1. Log in to one of the ElasticsSearch master pod, by executing the command:
    #kubectl exec -it elasticsearch-master-x – bash
  2. Check the cluster master node status, by using the command:
    #curl http://elasticsearch:9200/_cat/nodes

    Output:

  3. To logout, enter command Exit.
  4. Compare the IP's address for:
    • elasticsearch-master-0
    • elasticsearch-master-1
    • elasticsearch-master-2
    • With, the IP's address of the elasticsearch-master pods, using command:
      #kubectl get pods -o wide | grep elasticsearch-master

      Output:

      If the IP's of any master node mismatch, there might be problem with the master cluster formation.

Workaround:
  1. Shut down the master nodes, using command:
    #kubectl scale --replicas=0 statefulset.apps/elasticsearch-master
  2. Ensure that, all the master nodes are down.
  3. Restart the master nodes, using command:
    # kubectl scale --replicas=3 statefulset.apps/elasticsearch-master
  4. Ensure that, all the master nodes and data nodes are up and running and ready.

Arango database cluster not reconciled

To troubleshoot pods, follow the steps:
  1. Edit all the data pods (pods starts with arangodb-prmr*), using command:
    kubectl delete pod <podname>
    Note: Run for all the three data pods.
  2. Ensure that, all pods are terminated.
  3. If pods are taking longer time to repond, then terminate forcefully using command:
    kubectl patch pod/<podname>  --type json -p $'- op: remove\n  path: /metadata/finalizers'
    For example:
    kubectl patch pod/arangodb-prmr-kkmjqqxc-b542f9  --type json -p $'- op: remove\n  path: /metadata/finalizers

Postgress and dependent services not reconciled state

Sometimes postgress and dependent services like Keycloak, Grafana, Apiservice, Analytics-service, Alerting-rest, and Admin-api are not getting reconciled during the deployment.

If the deployment fails due to this issue, follow the procedure to resolve the issue:
  1. Run the following command:
    kubectl delete job postgres-init-db
    kubectl delete postgres postgres
    kubectl delete job/tcx-grafana-deployer-job
    kubectl delete job/tco-grafana-deployer-job
  2. Wait for sometime (five to ten minutes), so that all applications can be reconciled, and execute the command:
    kubectl get tcxproduct

    Output: All App CRs reconciled successfully.

VMware Telco Cloud Service Assurance user interface displays an error message “Internal Server Error”

After successful login, intermittently VMware Telco Cloud Service Assurance user interface displays “Internal Server Error” message.

Cause:

Check in keycloak logs: Caused by: org.postgresql.util.PSQLException: ERROR: cannot execute INSERT in a read-only transaction.

There was a switchover in Postgress from primary to secondary (read-only), because the monitor marked the Postgress pod unhealthy. As the Postgress connections are stateful in nature (once established they are not terminated until either of the parties close the connection). As the session termination did not happened, and the keycloak service established a connection. And, later because of the switchover it became read-only and displays an error message.

Workaround:

Try to terminate the connections with Postgres, by deleting one keycloak pod at a time:
kubectl delete pod keycloak-0
Note: If the keycloak-0 pod is up and the issue still persists, then subsequently try to delete keycloak-1 and keycloak-2 one-at-a-time. So that there are no downtime.

Flink service not reconciled

Flink service does not get reconciled after stopping and starting of AKS cluster on Azure. Run the following commands to resolve the issue.
kubectl exec -it zookeeper-0 -- bash -c '/opt/vmware/vsa/apache-zookeeper-3.7.1-bin/bin/zkCli.sh deleteall /flink/flink'
kubectl delete pod flink-jobmanager-0 flink-jobmanager-1