The release notes cover the following topics:
About VMware Telco Cloud Operations
VMware Telco Cloud Operations is a real-time automated service assurance solution designed to bridge the gap between the virtual and physical worlds. It provides holistic monitoring and network management across all layers for rapid insights, lowers costs, and improved customer experience. Powered by machine learning (ML) capabilities, VMware Telco Cloud Operations automatically establishes dynamic performance baselines, identifies anomalies, and alerts operators when abnormal behavior is detected.
VMware Telco Cloud Operations simplifies the approach to data extraction, enrichment, and analysis of network data across multi-vendor environments into actionable notifications and alerts to manage the growing business needs of Telco in an SDN environment.
For information about setting up and using VMware Telco Cloud Operations, see the VMware Telco Cloud Operations Documentation.
What's New in this Release
The VMware Telco Cloud Operations v1.3 introduces the following enhancements:
-
Refined Enrichment User Experience:
-
Refined wizard type enrichment user interface to tag metrics, events, and topology data in VMware Telco Cloud Operations based on external data.
-
Simplified external data upload to allow the user to upload a CSV file containing external data through the enrichment user interface.
-
-
Kafka Collector and Mapper:
-
Provides ability to consume infra metric data into VMware Telco Cloud Operations through Kafka open messaging interface.
-
Provides ability to map consumed metric data in VMware Telco Cloud Operations metric format for KPI computation, anomaly detection, reporting and dashboarding.
-
-
VMware Telco Cloud Operations Services High Availability Support:
-
The HA support for Event, Catalog, DM adapter, Esdb-proxy, and Persistence services has been introduced.
-
-
SDWAN - VeloCloud support:
-
VeloCloud versions up to 4.2.0 is supported in this release.
-
For information about system requirements, hardware requirements, patch installation, and sizing guidelines, see the VMware Telco Cloud Operations Deployment Guide.
Resolved Issues
- Enrichment stream name field is not editable.
If user wants to edit stream name after creating the enrichment stream, the option to edit enrichment name is not available.
Known Issues
- A possible cause for the deployment to fail is when you use the automated deployment tool.
When you deploy VMware Telco Cloud Operations using the automated deployment tool, the deployment of the worker node may fail with the error:
Failed to send data
.Workaround: Modify the
VCENTER_IP
configuration parameter in thedeploy.settings
file to use the fully qualified domain name (FQDN). For more information about modifying thedeploy.settings
file, see the VMware Telco Cloud Operations Deployment Guide. - When the number of hops of connectivity is increased, you may experience performance issues in the topology maps.
There might be performance issues in the rendering of Redundancy Group, MPLS, Metro-E, and SDN connectivity map types in the Map Explorer view. This issue is observed on deployments with a complex topology where the topology maps may stop working when the number of hops of connectivity is increased.
VMware Telco Cloud Operations currently does not support connections to SAM server with broker authentication, EDAA authentication, and Edge Kafka authentication.
For a workaround, see the Security Recommendation section in the VMware Telco Cloud Operations Deployment Guide.
Note: EDAA related operations including the Acknowledge, Ownership, Server Tools, Browse Details > Containment and Browse Details > Domain Manager are not supported when Smarts Broker is configured in secure mode.
- Broker failover is not supported in VMware Telco Cloud Operations.
Primary Broker fails in the Smart Assurance failover environment.
Workaround: Currently when a Broker (multi-broker) failover happens in Smart Assurance, then it requires a manual intervention where you need to log in to VMware Telco Cloud Operations and change the Broker IP address to point to the new Broker IP.
Procedure:- Go to
https://IPaddress of the Control Plane Node.
- Navigate to Administration > Configuration > Smarts Integration
- Delete the existing Smarts Integration Details.
- Re-add the Smarts Integration Details by pointing it to secondary Broker.
- Go to
Statistics - Tunnel reports for SDWAN displays unknown elastic error, if the specific device is not selected in Edge filter.
Workaround: To avoid the error, remove ALL option for Edge.
Procedure to disable the ALL option: Statistics Tunnel > Dashboard Settings > Variables > Edge > Disable Include All option.
When Smarts is restarted without repos for multiple times, the Viptela ControlNode controller status is going to
OTHER/UNKNOWN
state.Workaround: Use below command in control plane node to delete the respective Viptela stale collectors:
kubectl delete deployments.apps <viptela deployment app instance>
VMware Telco Cloud Operations Health Status Pod report displays empty value for some pods. They indicate that some pods ran for sometime, consumed some CPU and Memory resources, but no longer exist.
Workaround: To select a small range, you can go to the Gear icon on the top right of the reports and uncheck the option Hide time picker and go back to the reports.
Weekly indexes are not displayed while creating custom reports, only daily and hourly index are shown part of reports.
Workaround:
- Select Configurations > Data Sources from the left side menu bar
- Click Add Data Source.
- Select Elasticsearch.
- Enter relevant name based on the metric-type for which the weekly index needs to be created (for example: Week-Network-Interface) and the Elastic http url as http://elasticsearch:9200, refer any other VMware Telco Cloud Operations data sources
- Enter Index Name based on the metric type for which the weekly index needs to be create ([vsametrics-week-networkinterface-]YYYY.MM) and select Pattern "Monthly"
- Enter the Time Field Name timestamp and Version 7+.
- Keep the rest of the fields to default value.
- Click Save & Test.
Notification count mismatch between SAM and VMware Telco Cloud Operations UI due to non-filtering of notification with Owner field set to SYSTEM. By default in VMware Telco Cloud Operations there are no filters set.
Workaround: Manually apply the filter to remove notifications with Owner field not containing SYSTEM in VMware Telco Cloud Operations Notification Console window by following below steps:
- Go to Default Notification Console.
- Click Customize View.
- Go to Filters and provide Filter Set Name, for example Filterout SYSTEM Notifications.
- Filter Section Add Attribute with below condition:
Property = Owner
Expression = regex
Value = ~(SYSTEM+)
- Click Update.
Verify the Default Notification Console has only those notifications whose owner not set to SYSTEM. The default notification count must match between SAM and VMware Telco Cloud Operations UI.
Netflow-9 Statistics, Netfow-9 Trends, Netflow-5 Statistics, and Netflow-5 Trends reports display error message - Failed to parse query with the Default Time interval of 3 hours.
Workaround: You need to select smaller time intervals. For example: 15 minutes, 30 minutes, 1 hour, etc.
When the Kafka server is configured to a wrong IP or the Kafka node goes down during discovery, then the Velocloud discovery hangs for 20 minutes before exiting the discovery. This is the case even when the messagePollTimeout of the VCO Access setting is set to a lower value.
Workaround: In the
esm-param.conf
file add the below line replacing the<kafka ip address>
and<time in seconds>
, and restart the server.MessagePollTimeoutPeriodInSeconds-<kafka ip address> <time in seconds>
- The Containment, Browse detail, Notification Acknowledge/Unacknowledge does not work when the primary Tomcat server fails in a Smart Assurance HA environment.
In a Smart Assurance Failover deployment, when the primary Tomcat fails, the UI operations including the Notification Acknowledgement, Containment, Browse Detail, and Domain Managers fail.
Workaround: When the primary Tomcat instance fails in a Smart Assurance failover environment, then you can manually point the VMware Telco Cloud Operations to a secondary Tomcat instance.
Procedure:- Go to
https://IPaddress of the Control Plane Node.
- Navigate to Administration > Configuration > Smarts Integration
- Delete the existing Smarts Integration Details.
- Re-add the Smarts Integration Details by editing the EDAA URL and pointing it to the secondary Tomcat Instance.
- Go to
- An error message appears in Grafana report.
When user logs out from Operational UI and tries to launch report from Grafana user interface, an error message appears.
Workaround: Refresh or relaunch Grafana UI to logout.
- The SDWAN Flow Top N Summary reports displays an error message.
In case of SDWAN Flow Top N Summary report, the Grafana Bar Gauge widget does not support substantial time intervals.
Workaround: You need to set smaller time interval (24 hour) for the flow reports. Follow the procedure to set the substantial time interval:
- Click Edit from the report.
- Expand the Interval in the last row (Date Histogram) of query, and set it to higher interval like (7d or so on).
- Save the report.
- The SAM server is getting listed in the Domain Manager section instead of Presentation SAM section.
During Smart integration and configuration, INCHARGE SA (SAM server) is getting listed in the Domain Manager section. This problem occurs only when, the SAM server is started in Non-EDAA Mode.
Workaround: To get listed under Presentation SAM section, start the SAM server in EDAA Mode.
- Disk usage is not mentioned in the VMware Telco Cloud Operations Health Status Node report.
In Health Status Node Report, the disk usage is not specified for which kubernetes cluster node (Controlplane, Arango, ElsticSearch, Domain Manager, Kafka, and so on) the disk is used.
Workaround:
- Click Edit in the panel of Disk usage.
- Click Field tab.
- Click Display Name and No value fields (no need to enter any value).
Node names appear.
- Click Save and Apply.
- Some of the DataCenter Summary reports are taking longer time to display.
On 100k footprint with 10 Million records sent per polling, the DataCenter reports are taking more than usual time to display
Workaround: Perform the following procedure on report side:
- Reduce the default time interval from 24hr to 12hr or 6hr.
- If the issue still persists, point the datasource to hourly index for the panel which is showing the error.
- Topology pod is down, due to Redis service failure.
In one of the 100k deployment, Topology pod is down due to redis service failure, and notification sync in VMware Telco Cloud Operations is very slow.
Workaround: Following procedure can be applied to restart Redis cluster and restart dependent services. On control plane node perform below steps:
- Scale down events pods using command:(
kubectl scale deployment <events_POD> --replicas=0
) - Scale down topology pods using command:(
kubectl scale deployment <topology_POD> --replicas=0
) - Delete Redis deployment using command:(
kubectl delete deployment redis
) cd to /home/clusteradmin/kubernetes
kubectl apply -f redis.yaml
.
- Once Redis comes up, Scale up Topology and Events Pods.
kubectl scale deployment <events_POD> --replicas=1
kubectl scale deployment <topology_POD> --replicas=1
- Scale down events pods using command:(
- Security Vulnerability
CVE-2021-3449 -- An OpenSSL TLS server may crash if sent a maliciously crafted renegotiation ClientHello message from a client. If a TLSv1.2 renegotiation ClientHello omits the signature_algorithms extension (where it was present in the initial ClientHello), but includes a signature_algorithms_cert extension then a NULL pointer dereference will result, leading to a crash.
- When Arangoworker node hosting the Flink services (Job Manager/Task Manager) goes down, Ingestion of Topology, Metrics, and Events might not work correctly.
Flink services are not deployed in HA mode. If an Arango worker node goes down, the enrichment service might not be fully operational, which results in ingestion services to stop processing until the node is restored.
Workaround: You need to bring up the Arango worker node and restart the enrichment streams post the node is up. Refer, VMware Telco Cloud Operations Troubleshooting Guide for more information.
- Authorization error message appears in html code, when user does not have Grafana edit permission.
When a Role is created for a user with only "Dashboard & Reporting" view permission, and the user attempts to edit any Dashboard or Reporting settings in Grafana, the authorization error appears in html format .
- A workaround must be applied to a VMware Telco Cloud Operations 1.2 cluster before installing VMware Telco Cloud Operations 1.3 update.
A defect in VMware Telco Cloud Operations 1.2 prevents installation of the VMware Telco Cloud Operations 1.3 patch, unless the following workaround is applied.
Workaround: Extend expiry of the patcher account on all nodes in the VMware Telco Cloud Operations cluster. Run the following script on the Control Plane Node:
#!/bin/sh if [ "$SSH_PASSWORD" = "" ]; then echo "Please set the SSH_PASSWORD environment variable to the root password" exit 1 fi for ip in $(kubectl get nodes -o wide | awk '{print $6}' | grep -v 'INTERNAL-IP'); do sshpass -p "$SSH_PASSWORD" ssh root@$ip -o StrictHostKeyChecking=no "chage -m 0 -M 99999 -I -1 -E -1 patcher" done
The root password must be same for all the nodes in this script, and to be exported as SSH_PASSWORD, in the environment before running the script.
For example, if the script was created in /tmp as file 'unexpire.sh', and the password was 'rootpassword', the script must be run as:
# export SSH_PASSWORD=rootpassword # /tmp/unexpire.sh
- On 50k and 100k footprint deployments, disk space may exhausted on the arangoworker nodes due to checkpoint files not being removed.
In some situations, checkpoint files used in the stream processing service are not removed when no longer required. This can eventually result in disk space exhaustion on some of the arangoworker nodes, which can lead to stream processing tasks, such as KPI computation and enrichment, failing.
Workaround: Unwanted checkpoint files must be removed.
If disk exhaustion occurs, execute the following script on all arangoworker nodes:
echo "Cleaning up old checkpoints..." CHECKPOINT_PATH="/var/vmware/flink/checkpoints" if [ -d $CHECKPOINT_PATH ]; then JOB_IDS=$(ls $CHECKPOINT_PATH) if [ ! -z "$JOB_IDS" ]; then for JOB_ID in $JOB_IDS; do echo "cleaning up job $JOB_ID" if [ "$(ls -A $CHECKPOINT_PATH/$JOB_ID)" ]; then find $CHECKPOINT_PATH/$JOB_ID/* -maxdepth 0 -mtime +1 -exec rm -rf {} \; fi done fi fi
If pods are evicted from an arangoworker due to disk pressure, the taskmanagers will not move back once the checkpoints are cleaned up. To force them to move back, the pods need to be deleted and kubernetes will recreated them evenly across the arangoworkers.
kubectl delete pod -l run=taskmanager
To remove any extra evicted pods, run the following command:
kubectl get pods | grep Evicted | awk '{print $1}' | xargs kubectl delete pod
To avoid disk exhaustion occurring in the first place:
- ssh to the control plane node as the clusteradmin user
- Change to the /home/clusteradmin/kubernetes directory
- Create a file called flink-cleanup.yaml with the following content:
apiVersion: apps/v1 kind: DaemonSet metadata: name: flink-cleanup namespace: vmware-smarts spec: selector: matchLabels: run: flink-cleanup template: metadata: labels: run: flink-cleanup spec: containers: - name: flink-cleanup image: registry.cluster.omega.local:8443/omega/omega-patching-runner:1.3.0-9 command: - sh - "-c" - | echo "Starting flink cleanup script." while true; do echo "Cleaning up old checkpoints..." CHECKPOINT_PATH="/var/vmware/flink/checkpoints" if [ -d $CHECKPOINT_PATH ]; then JOB_IDS=$(ls $CHECKPOINT_PATH) if [ ! -z "$JOB_IDS" ]; then for JOB_ID in $JOB_IDS do echo "cleaning up job $JOB_ID" if [ "$(ls -A $CHECKPOINT_PATH/$JOB_ID)" ]; then find $CHECKPOINT_PATH/$JOB_ID/* -maxdepth 0 -mtime +1 -exec rm -rf {} \; fi done fi fi echo "going to sleep..." sleep 6h; done volumeMounts: - name: flink-data mountPath: /var/vmware/flink nodeSelector: runin: arango volumes: - name: flink-data persistentVolumeClaim: claimName: flink-pvc
- Apply the configuration by running the command:
kubectl apply -f flink-cleanup.yaml
- Enrichment key content assist dropdown menu is displayed partially.
In the Enricher configuration UI, when user enters a back slash ( \ ) is in the Enrichment Key field, the content assist dropdown menu is displayed partially. Only the first few characters of each property name is displayed.
Workaround: User can still click the partially displayed entry or enter the first few characters in the Enrichment Key field after the back slash ( \ ), upon which it is completely displayed in the Enrichment Key field with the correct syntax. Refer the following lists to get names and description of all the entries in the dropdown to assist in making choices:
For data type VMware Telco Cloud Operations Metric and MnR metric, here are the property names in the same order of the dropdown list:
Data Source: An IP address or a name indicates the event data source
Device Name: Name of the device where metric is collected
Device Type: Type of the device where metric is collected
Entity Name: Name of the entity on a device
Entity Type: Type of the entity on a device
Instance: Event instance
Metric Type: Metric type under the event type
Tags: Event tags
Type: Type of the event, usually indicates the event type per vender interfaceFor data type VMware Telco Cloud Operations Event, here are the property names in the same order of the dropdown list:
Acknowledged: Indicates if this event has been acknowledged
Active: Indicates if this event is currently active
Category: Category of this event. The event category represents a broad categorization of the event, for example: availability vs. performance.
Certainty: The certainty of this event.
Class Display Name: Display name for the event class.
Class Name: Class name of the object where this event occurred. This attribute along with InstanceName and EventName uniquely identify this event.
Clear On Acknowledge: Indicates if this event should be cleared when it is acknowledged. Set this to TRUE only for events that do not expire nor have sources that generate a clear.
Closed At: ClosedAt
Element Class Name: The class name of the topology element associated with the event in the repository where this event resides. This may or may not have the same value as ClassName.
Element Name: The name of the topology element associated with the event in the repository where this event resides. This may or may not have the same value as InstanceName. The string is empty if there is no related element.
Event Display Name: Display name for the event Name.
Event Name: Name of the event. This attribute along with ClassName and InstanceName uniquely identify this event.
Event State: The current state of this event. ACTIVE: The event is currently active. WAS_ACTIVE: The event was active, but we lost contact with the event source. INACTIVE: The event is inactive. UNINITIALIZED: The event has not been notified yet; the object does not yet represent a notified event.
Event Text: The textual representation of the event.
Event Type: Indicates the nature of the event. A MOMENTARY event has no duration. An authentication failure is a good example. A DURABLE event has a period during which the event is active and after which the event is no longer active. An example of a durable event is a link failure.
First Notified At: First notification time
Impact: A quantification of the impact of this event on the infrastructure and/or business processes. There are no pre-defined semantics to the value of this attribute other than a larger numeric value indicates a larger impact.
In Maintenance: Indicate if this event occurs during maintenance.
Instance Display Name: Display name for the event instance.
Instance Name: Instance name of the object where this event occurred. This attribute along with ClassName and EventName uniquely identify this event.
Is Problem: A notification is a problem when all of the original event types are PROBLEM or UNKNOWN. There must be at least one PROBLEM, i.e. UNKNOWN by itself is not a problem.
Is Root: Is this a root notification?
Last Changed At: Time of last event change.
Name: Name of object.
Occurrence Count: The number of occurrences of this event starting from FirstNotifiedAt until LastNotifiedAt.
Opened At: The number of occurrences of this event starting from FirstNotifiedAt until LastNotifiedAt.
Owner: The name of the user that is responsible for handling this event.
Polling State: The name of the user that is responsible for handling this event.
Severity: An enumerated value that describes the severity of the event from the notifier's point of view: 1 - Critical is used to indicate action is needed NOW and the scope is broad, e.g. an outage to a critical resource.2 - Major is used to indicate action is needed NOW.3 - Minor should be used to indicate action is needed, but the situation is not serious at this time.4 - Unknown indicates that the element is unreachable, disconnected or in an otherwise unknown state.5 - Normal is used when an event is purely informational.
Source: Source of this event.
Source Domain Name : The name(s) of the domain(s)or domainGroups that have originally diagnosed and notified - directly or indirectly - current occurrences of this event. If there are more than one original domain, the attribute lists each separated by a comma. When the notification is cleared, the last clearing domain stays in the value.
Source Event Type: The type(s) of the events(s), i.e. 'PROBLEM', 'EVENT', 'AGGREGATE' in the source domains that have notified current occurrences of this event. If there is more than one domain the attribute lists each separated by a comma, in the same order as SourceDomainName.
Source Info: The number of occurrences of this event starting from FirstNotifiedAt until LastNotifiedAt.
Source Specific: Source Specific.
Trouble Ticket ID: Trouble ticket ID
User Defined1: User defined field1.
User Defined10: User defined field10.
User Defined11: User defined field11.
User Defined12: User defined field12.
User Defined13: User defined field13.
User Defined14: User defined field14.
User Defined15: User defined field15.
User Defined16: User defined field16.
User Defined17: User defined field17.
User Defined18: User defined field18.
User Defined19: User defined field19.
User Defined2: User defined field2.
User Defined20: User defined field20.
User Defined3: User defined field3.
User Defined4: User defined field4.
User Defined5: User defined field5.
User Defined6: User defined field6.
User Defined7: User defined field7.
User Defined8: User defined field8.
User Defined9: User defined field9.For data type VMware Telco Cloud Operations Topology, here are the property names in the same order of the dropdown list:
Action: Action of the topology record
Collector Name: Name of the collector that collects the topology record
Collector Type: Type of the Collector where the topology record is collected
Creation Class Name: The name of the most-derived class of this instance
Description: A textual description of the object
Discovery ID: Discovery ID of the topology record
Display Class Name: The string shown in the GUI when this object's class is displayed
Display Name: The string shown in the GUI when this object's name is displayed.
Force Refresh: Indicate whether to force refresh topology information
Group Name: Group name of the topology record
ID: ID of the topology record
Initialized: Indicate whether the topology information is initialized
Is Managed: The IsManaged attribute determines if an ICIM_ManagedSystemElement should be monitored by the management system. An unmanaged object will never have associated instrumentation. This attribute is readonly.
Job ID: Job ID of the topology record
Name: Name of the topology record
Network Number: The network number (computed from Address and Netmask)
Observer: Indicate whether this record is an observer
Opened At: Timestamp when the topology instance opened at
Service Name: Name of external server used for imported events and instrumented attributes
Source: Creation MIB source for this entity.
System Name: The name of the ICIM_System containing this element.
Type: Type of the topology record
Value: Value of the topology record