VMware Telco Cloud Service Assurance 2.1.0 | 19 JAN 2022

Check for additions and updates to these release notes.

About VMware Telco Cloud Service Assurance

VMware Telco Cloud Service Assurance is a real-time automated service assurance solution designed to holistically monitor and manage complex virtual and physical infrastructure and services end to end, from mobile core to the RAN to the edge. From single pane of glass, VMware Telco Cloud Service Assurance provides cross‑domain, multi‑layer, automated assurance in a multi‑vendor and multi‑cloud environment. It provides operational intelligence to reduce complexity, perform rapid root cause analysis and see how problems impacts services and customers across all layers lowering costs, and improved customer experience.

For information about setting up and using VMware Telco Cloud Service Assurance, see the VMware Telco Cloud Service Assurance Documentation.

What's New

VMware Telco Cloud Service Assurance release 2.1.0 brings together various features and enhancements across platform, networking, and virtual infrastructure management areas. This release introduces following major features:

  • One of the significant features is the introduction of Closed Loop Remediation Actions that can be invoked manually or automatically to Telco Cloud Automation or 3rd party OSS/BSS tools based on RCA and pre-defined policies.

  • vRAN Assurance now extends topology visibility and RCA to CellSites.

  • Network Slicing Assurance is now a GA release integrating with VMware Telco Cloud Automation (TCA) and Network Slices are monitored automatically.

  • User interface enhancements such as Geo-Map and Traffic Map views and customized multi-tenant views to represent infrastructure connectivity.

  • Programmable Data Collector SDK part of Unified Data Collector Framework to onboard 3rd party data sources for Fault and Performance monitoring.

  • Platform Modernization features supporting deployment on AWS EKS, VMware Telco Cloud Service Assurance 2.0.1 to 2.1 upgrade, and Backup and Restore.

    Note:

    Upgrade is not supported for demo footprint.

  • Scalable footprints supporting monitoring upto 75K, 100K, 125K, 150K, 175K, and 200K managed devices.

  • Revised sizing calculator based on events, metrics, and number of the managed devices for quick sizing of VMware Telco Cloud Service Assurance deployments.

  • From this release, all VMware Smart Assurance customers can deploy VMware Telco Cloud Service Assurance to interop with existing VMware Telco Cloud Service Assurance (10.1.5 and above) and MnR (7.x and above) deployment for unified monitoring in VMware Telco Cloud Service Assurance.

  • Network Configuration Manager (NCM) Reporting in VMware Telco Cloud Service Assurance interop with existing NCM deployments.

  • Pipeline Reporting provides a single pane of glass visibility for Network Operations Center (NOCs) from Day 0 or Day 1 deployment or configuration stage to Day 2 operations stages of their environment.

Note: VMware Telco Cloud Service Assurance v2.1.0 does not support Kubernetes 1.20.x deployments.

Closed Loop Remediation

This release introduces an early access feature to drive policy driven closed loop and automated remediation actions based on RCA and Analytics in VMware Telco Cloud Service Assurance. The new Policy Driven UI in the VMware Telco Cloud Service Assurance 2.1 release provides users capability to create remediation actions based on root cause and anomaly events. The remediation action can be configured to be automated or manual based on user preference. Also, with remediation action, you can invoke workflow in Orchestrator, such as VMware Telco Cloud Automation (TCA). For example, scale-up, scale-down K8s cluster, integrate with ticketing system such as ServiceNow to open trouble ticket, slack message, email, and so on.

Network Slicing Assurance

  • Automated discovery and monitoring of Network Slice and underlying infrastructure integrating with orchestrators such as Telco Cloud Automation (TCA).

  • Monitor health of virtual Network Slices spanning Radio, Mobile Core networks delivered over physical and virtual infrastructure and underlay transports such as IP/MPLS.

  • RCA and service impact analysis to isolate network degradations across a multi-domain service delivery stack – RAN infrastructure, Network Slice, service, virtual, physical device, and WAN transports proactively pinpoint the root cause.

  • Trigger closed-loop remediation to Orchestrators using standard APIs that proactively optimize the Network Slice.

vRAN Assurance Enhancements

  • Extend the ability to visualize CellSite connectivity from vDU to CellSite in Topology Map or Topology Explorer.

  • Enhanced Root Cause and Impact Analysis to visualize the impact of failures on vRAN Infrastructure such as DU, CU, CaaS, VIM, and Physical to CellSites.

Unified Data Collection Framework

  • Programmable Data Collector SDK enables the collection of new custom collectors from 3rd party data sources.

  • Provides capabilities to collect and export infrastructure data such as metrics, topology, and events for monitoring and assurance use cases.

  • Onboard new 3rd party data sources based on various protocols such as REST, KAFKA, SNMP, and so on.

  • Simple easy to use user interface experience enables users to integrate new data using KAFKA and enrich data streams.

  • Mapper, Mediation, or Enrichment framework to onboard data sources to a common model for Assurance, Fault, and Performance Management.

  • Supports cloud native K8s platform for distributed data collection.

User Interface Enhancements

  • Ability to customize and share Notification Log, Topology Explorer, and Map Explorer views with multiple users for multi-tenant use cases.

  • Map Explorer enhancements now include geographical and traffic maps for L2/L3 infrastructure maps.

  • Browse Details View is enhanced with Topology Map information.

  • The metric Catalog feature now supports the CRUD operations. Users can now add customized metrics and events from the VMware Telco Cloud Service Assurance.

VMware Smarts Assurance Interoperability

  • Support for VMware Telco Cloud Service Assurance interoperability with

    • Domain Managers (IP, MPLS, ESM, NPM) and Service Assurance Domain Managers (supporting hierarchical and aggregated SAM).

    • Management and Reporting (MnR) for dashboarding and reporting.

    • Network Configuration Manager (NCM).

  • Network Configuration Manager interop with currently deployed NCM offers following reports :

    • Device Compliance Reports

    • Configuration Change Reports

    • Device Summary Reports

    Note:

    VMware Telco Cloud Service Assurance is interoperable with 10.1.5.x, 10.1.7.x, and 10.1.9.x versions of the Domain Managers, NCM 10.1.11.0 and 10.1.8.0, and MnR 7.4.1.1 and 7.3.05.

  • Upgrade and migrate support for VMware Telco Cloud Service Assurance Domain Managers.

    Note:
    • Upgrade support: Domain Managers version 11.0.1 to 11.1.0 upgrade is supported.

    • Migration support: Domain Managers version 10.1.5.x, 10.1.7.x, 10.1.9.x, and 11.0.1 to 11.1.0 migration is supported.

NCM Reporting Support

  • Network Configuration Manager Reporting offers detailed reports for:

    • Device Compliance Reports

    • Configuration Change Reports

    • Device Summary Reports

Platform Modernization

  • VMware Telco Cloud Service Assurance can now be deployed on AWS using Elastic Kubernetes Service (EKS) cluster in addition to Microsoft Azure Kubernetes Service (AKS) and VMware Tanzu Kubernetes Grid (TKG) cluster.

  • Supports monitoring of new footprint sizes such as 75K, 100K, 125K, 175K, and 200K.

  • Supports Backup and Restore of VMware Telco Cloud Service Assurance with VMware vSAN and across clusters and footprints.

  • Revised easy-to-use sizing calculator to identify exact footprint recommendations based on various factors such as the number of devices, events, metrics, and retention interval.

Additional Features

  • Kafka Collector is enhanced to support M&R format through Kafka Mapper.

  • Remediation Rules and Actions are now available in the Administration UI. The UI is integarted with the events to provide all the known root causes and Anomaly events. You can pick any of the events and create a rule entry to remediate that event.

  • TMF 642 compliant APIs for Alarm/Notification Management.

Fixed Issues

  • Get user federation authentication management APIs are having permission issue in the TKG and Azure deployments.

  • API to get the Kubernetes cluster status is not working in Azure deployment.

  • Occasionally, ElasticSearch data service pods are crashing in longevity setup.

  • Post Incremental scale from one footprint to another footprint, VMware Telco Cloud Service Assurance user interface shows older base footprint instead of destination upgraded footprint (Footprint : 50k instead of 100k).

Known Issues

  • Kibana-init reconcile failed, post upgrade from VMware Telco Cloud Service Assurance 2.0.1 to 2.1.0.

    There is no functional impact.

  • In Longevity environment, ElasticSearch, Nginx, and Prometheus pods are down in the VMware Telco Cloud Service Assurance 2.1.0 setup.

    After some days, few services are down although the TKG cluster shows healthy. 

    To restore the pods, follow the procedure:

    1. Get the list of pvc’s by executing:

      kubectl get pvc | grep elasticsearch

    2. Once you determine which pvc needs to be removed, first delete the finalizers by editing the pvc with the following command:

      kubectl edit pvc data-elasticsearch-data-1Y

    3. In edit mode, the page looks like the following, remove the highlighted lines and save:

      # Please edit the object below. Lines beginning with a '#' will be ignored,
      # and an empty file will abort the edit. If an error occurs while saving this file will be
      # reopened with the relevant failures.
      #
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        annotations:
          kapp.k14s.io/delete-strategy: orphan
          pv.kubernetes.io/bind-completed: "yes"
          pv.kubernetes.io/bound-by-controller: "yes"
          volume.beta.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
          volume.kubernetes.io/storage-provisioner: csi.vsphere.vmware.com
        creationTimestamp: "2022-09-13T19:27:52Z"
        finalizers:
        - kubernetes.io/pvc-protection
        labels:
          adminoperator.tcx.product/delete-strategy: "true"
          es-operator-dataset: elasticsearch-data
      

    After that, the deletion of the pvc is successful and the creation of a new pvc allows the pod to be restored.

  • After upgrade and migration, the ESM Server hangs.

    The ESM Server hangs for the first time, after performing an upgrade or migrate from earlier versions. Server does not responds to any dmctl commands and the SAM Console gets freezed . You are unable to naviagte in the SAM Console.

    You can stop the ESM Server using dmquit, or kill the ESM Server after the ESM Server is started for the first time post upgrade or migrate.

    Once the server is completely stopped, restart the ESM Server. After the Server is restarted, the hang issue is no more observed, and the dmctl commands start responding.

  • Instance count mismatch observed.

    Instance count mismatched observed between SAM and VMware Telco Cloud Service Assurance.

    Execute the following script from the ArangoDB coordinator pod:

    Kubectl get pods | grep arangodb-crdn
    kubectl exec -it <<arangodb crdn pod>>  – bash
    cd /opt/vmware/vsa/infra/install/Scripts
    ./viewUpdate.sh
  • ElasticSearch Grafana dashboard cluster status shows YELLOW with number 23 in the upgraded setup

    Wrong expression for yellow - it must be +2, not +22.

    "expr": "elasticsearch_cluster_health_status{job=\"$job\",instance=\"$instance\",cluster=\"$cluster\",color=\"red\"}==1 or (elasticsearch_cluster_health_status{job=\"$job\",instance=\"$instance\",cluster=\"$cluster\",color=\"green\"}==1)+4 or (elasticsearch_cluster_health_status{job=\"$job\",instance=~\"$instance\",cluster=\"$cluster\",color=\"yellow\"}==1)+22",
  • Prometheus pod is down in Longevity environment.

    Unable to use VMware Telco Cloud Service Assurance interface in Longevity environment due to down Prometheus pod.

    The issue happens due to storage. The volume was getting mounted read-only. The workaround is to delete the pods and let them get recreated.

    To delete the pods, use the following command:

    kubectl delete pod <POD_NAME>

  • The SDK REST Custom collector pod is spinned up and in still state when the REST simulator is down.

    The SDK REST Custom collector is up and running, after the simulator is up.

    But the spinned up REST pods which are in error state remains in still sate. These pods are not consuming any memory or CPU resources, and there is no functional impact.

  • Uninstall is not removing default-tcops-scheduledbackup pod.

    This pod is generated as part of Kubernetes job resource, which is in completed state. This completed status pod does not have any impact in the cluster.

    root [ ~/tcx-deployer/scripts/deployment ]# kubectl get pods 
    NAME                                           READY   STATUS      RESTARTS   AGE 
    default-tcops-scheduledbackup-27832260-wffvd   0/1     Completed   0          9h 
    root [ ~/tcx-deployer/scripts/deployment ]# kubectl get jobs 
    NAME                                     COMPLETIONS   DURATION   AGE 
    default-tcops-scheduledbackup-27832260   1/1           2m1s       9h
    

    To delete the job, run the following command:

    # kubectl delete job default-tcops-scheduledbackup-<suffix>

    Where, <suffix> = Generated random number by job.

  • Topology Maps or Topology Explorer does not have any ingested Topology data post upgrade from VMware Telco Cloud Service Assurance 2.0.1 to 2.1.0.

    Topology Maps or Topology Explorer does not have any ingested data after upgrade from the VMware Telco Cloud Service Assurance 2.0.1 to 2.1.0.

    All the pods are up and running. But none of the devices are showing Topology Map or Explorer window. SAM Servers are shown as running in the SAM Integration Page.

    Notification and Metrics data are displayed in the respective user interface.

    Execute the script from the Arango DB coordinator pod:

    kubectl exec -it <arangodb-crdn-pod> – bash
    cd /opt/vmware/vsa/infra/install/Scripts
    ./viewUpdate.sh

    This runs the viewUpdate.js script. Post execution, the ingested Topology data is visible in Topology Maps or Topology Explorer.

  • VMware Telco Cloud Service Assurance job Instantiation and Terminate status is not showing correct status in VMware Telco Cloud Automation user interface.

    VMware Telco Cloud Service Assurance job Instantiation and Terminate status is shown as success, eventhough the VMware Telco Cloud Service Assurance deployment is in progress.

    To check the VMware Telco Cloud Service Assurance job Instantiation & Terminate status, use the following kubectl command:

    root [ ~/tcx-deployer/scripts ]# kubectl get tcxproduct
  • Getting an error message during migration of IP, SAM, and ESM domain managers.

    Following error message appears, during migration of IP, SAM, and ESM. And, .conflict files are created for sm_merge and version.pm:

    ----
    Merge Process aborted/opt/InCharge11/IP/smarts/local/bin/system/sm_merge: 
    line 1: $'\177ELF\002\001\001': command not found/opt/InCharge11/IP/smarts/local/bin/system/sm_merge: 
    line 2: $'w\267P\343\316\301\024W\026': command not found
    ------

    There is no functional impact, and errors can be ignored. It is safe to delete the sm_merge.conflict and version.pm.conflict files, before starting the server.

    The sm_merge.conflict and version.pm.conflict file must be deleted before starting the server.

  • Incremental scale fails when VMware Telco Cloud Service Assurance scale is triggered without the Node or VM scale up.

    Post deployment if the incremental scale is triggered without scaling up the VM or Node, the incremental scale fails with error: Insufficient CPU capacity.

    Post that, increases the VM or Node capacity as per the footprint and re-trigger the incremental scale. Again, the incremental scale fails even though sufficient resource capacity is provided.

    Ensure that the Node or VM scale is done as per the destination footprint:

    1. Run the command: kubectl delete validatingwebhookconfiguration admin-operator-webhook.

    2. Re-trigger the incremental scale operation.

    Incremental scale passed and all the apps are scaled up as per the destination footprint specified.

  • The log_level messages are displaying 'unknown' in Service logs (Kibana logs).

    When user navigates to Administration > Service Logs, and clicks on the application service logs, the filter log level displays 'unknown' fields for log_level messages for any service. For example: Apiservice, elasticsearch, and so on.. .

  • Unable to delete the cloned console of default Summary View.

    Note: You are able to perform all required operations using Edit option.

  • VMware Telco Cloud Service Assurance currently does not support connections to SAM server when Broker is configured in secure mode.

    Currently there is no workaround. Broker must configured in non-authenticate mode.

    Note: EDAA related operations including the Acknowledge, Ownership, Server Tools, Browse Details > Containment and Browse Details > Domain Manager are not supported when Broker is configured in secure mode.

  • When the number of hops of connectivity is increased, you may experience performance issues in the topology maps.

    There might be performance issues in the rendering of Redundancy Group and SDN connectivity map types in the Map Explorer view. This issue is observed on deployments with a complex topology where the topology maps may stop working when the number of hops of connectivity is increased.

  • Broker failover is not supported in VMware Telco Cloud Service Assurance.

    Primary Broker fails in the Domain manager failover environment.

    Currently when a Broker (multi-broker) failover happens, then it requires a manual intervention where you need to log in to VMware Telco Cloud Service Assurance  and change the Broker IP address to point to the new Broker IP.

    Procedure:

    1. Go to https://IPaddress of the Control Plane Node.

    2. Navigate to Administration > Configuration > Smarts Integration

    3. Delete the existing Smarts Integration Details.

    4. Re-add the Smarts Integration Details by pointing it to secondary Broker.

  • Weekly indexes are not displayed while creating custom reports, only daily and hourly index are shown part of reports.

    Procedure for workaround:

    1. Select  Configurations > Data Sources from the left side menu bar

    2. Click Add Data Source.

    3. Select Elasticsearch.

    4. Enter relevant name based on the metric-type for which the weekly index needs to be created (for example: Week-Network-Interface) and the Elastic http url as  http://elasticsearch:9200, refer any other VMware Telco Cloud Service Assurance data sources

    5. Enter Index Name based on the metric type for which the weekly index needs to be create ([vsametrics-week-networkinterface-]YYYY.MM) and select Pattern "Monthly"

    6. Enter the Time Field Name timestamp and Version 7+.

    7. Keep the rest of the fields to default  value.

    8. Click Save & Test.

  • Notification count mismatch between SAM and VMware Telco Cloud Service Assurance UI due to non-filtering of notification with Owner field set to SYSTEM​. By default in VMware Telco Cloud Service Assurance there are no filters set.

    Manually apply the filter to remove notifications with Owner field not containing SYSTEM in VMware Telco Cloud Service Assurance Notification Console window by following below steps:

    1. Go to Default Notification Console.

    2.  Click Customize View.

    3. Go to Filters and provide Filter Set Name, for example Filterout SYSTEM Notifications.

    4. Filter Section Add Attribute with below condition:

      Property = Owner

      Expression = regex

      Value = ~(SYSTEM+)

    5. Click Update.

    Verify the Default Notification Console has only those notifications whose owner not set to SYSTEM. The default notification count must match between SAM and VMware Telco Cloud Service Assurance UI.

  • The Containment, Browse detail, Notification Acknowledge/Unacknowledge does not work when the primary Tomcat server fails in a HA environment.

    In a Failover deployment, when the primary Tomcat fails, the UI operations including the Notification Acknowledgement, Containment, Browse Detail, and Domain Managers fail.

    When the primary Tomcat instance fails in a failover environment, then you can manually point the VMware Telco Cloud Service Assurance to a secondary Tomcat instance.

    Procedure:

    1. Go to https://IPaddress of the Control Plane Node.

    2. Navigate to Administration > Configuration > Smarts Integration

    3. Delete the existing Smarts Integration Details.

    4. Re-add the Smarts Integration Details by editing the EDAA URL and pointing it to the secondary Tomcat Instance.

  • The SAM server is getting listed in the Domain Manager section instead of Presentation SAM section.

    During Smart integration and configuration, INCHARGE SA (SAM server) is getting listed in the Domain Manager sectionThis problem occurs only when, the SAM server is started in Non-EDAA Mode.

    To get listed under Presentation SAM section, start the SAM server in EDAA Mode.

  • While starting server, the Map error warning message appears for INCHARGE-SA and INCHARGE-OI in respective logs.

    No functional impacts.

  • User needs to mandatorily  discover ESX Servers for getting Virtual Machine Down event. Currently the Virtual Machine Down event is not generated if the corresponding ESX Servers are not discovered in IP Server. So, its recommended to discover Virtual Machines to get proper Root cause events.

  • On RHEL 7.8 version machine when SAM services are started, brcontrol shows IPv6 entry for servers due to which communication between servers is getting impacted.

    On RHEL 7.8 version, if you start any domain manager as a service, the domain gets registered to a broker using both v4 and v6 IP address space. Due to this issue domain manager v6 entry will go to DEAD state in brcontrol output and the communication between the servers is failing sometimes due to this issue.

    Note: Issue also detected on some machines with RHEL 7.2 and 7.6.

    To avoid a domain running in v6 mode, allow only v4, by setting the below flag in runcmd_env.sh file:

    SM_IP_VERSIONS=v4

    Restart the domain manager, after updating runcmd_env.sh file.

  • Topology synchronization is taking more than 10 minutes for 25k devices, when latency between SAM and VMware Telco Cloud Service Assurance is more than 5 milliseconds.

    When the latency increases topology synchronization time increases. 

    Ensure that the latency between VMware Telco Cloud Service Assurance (Topology Collector) and SAM Presentation server is less than 5 milliseconds.

  • Notification processing rate is slower, when the latency between SAM and VMware Telco Cloud Service Assurance is greater than 5 milliseconds.

    When the latency between VMware Telco Cloud Service Assurance and SAM Presentation server increases, notification processing rate goes down.

    Ensure that the latency between VMware Telco Cloud Service Assurance (Notification Collector) and SAM Presentation server is less than 5 milliseconds.

  • KPI feature will be supported in upcoming release of VMware Telco Cloud Service Assurance.

check-circle-line exclamation-circle-line close-line
Scroll to top icon