Selecting and configuring a monitoring system

You can select and configure a system to continuously monitor Operations Manager component performance and health.

Selecting a monitoring platform

VMware recommends using Healthwatch to monitor your deployment. Healthwatch is a service tile developed and supported by VMware and available on Broadcom Support.

Many third-party systems can also be used to monitor an Operations Manager deployment.

Monitoring platform types

Monitoring platforms support two types of monitoring:

A dashboard for active monitoring when you are at a keyboard and screen
Automated alerts for when your attention is elsewhere

Some monitoring solutions offer both in one package. Others require putting the two pieces together.

Monitoring platforms

There are many monitoring options available, both open source and commercial products. Some commonly used platforms among Operations Manager customers include:

Healthwatch by VMware
VMware Partner Services available on Broadcom Support:
Other Commercial Services
- VMware vRealize Operations (vROPS)
Open-Source Tooling
- Prometheus + Grafana
- OpenTSDB

VMware Cloud Ops tools

The VMware Cloud Ops Team manages two types of deployments for internal VMware use: open-source Cloud Foundry, and Operations Manager.

For Cloud Foundry, VMware Cloud Ops uses several monitoring tools. The Datadog Config repository provides an example of how the VMware Cloud Ops team uses a customized Datadog dashboard to monitor the health of its open-source Cloud Foundry deployments.

To monitor Operations Manager deployments, VMware Cloud Ops leverages a combination of Healthwatch and Google Stackdriver.

Key inputs for platform monitoring

BOSH VM and Operations Manager component health metrics

Most monitoring service tiles for Operations Manager come packaged with the Firehose nozzle necessary to extract the BOSH and Operations Manager metrics leveraged for platform monitoring. Nozzles are programs that consume data from the Loggregator Firehose. Nozzles can be configured to select, buffer, and transform data, and to forward it to other apps and services.

The nozzles gather the component logs and metrics streaming from the Loggregator Firehose endpoint. For more information about the Firehose, see Loggregator architecture.

Operations Manager component metrics originate from the Metron agents on their source components, then travel through Dopplers to the Traffic Controller.
The Traffic Controller aggregates both metrics and log messages system-wide from all Dopplers, and emits them from its Firehose endpoint.

For information about high-signal-value metrics and capacity scaling indicators in an Operations Manager deployment, see Key Performance Indicators and Key capacity scaling indicators.

Continuous functional smoke tests

Operations Manager includes smoke tests, which are functional unit and integration tests on all major system components. By default, whenever an operator upgrades to a new version of VMware Tanzu Application Service for VMs (TAS for VMs), these smoke tests run as a post-deploy errand.

VMware recommends additional higher-resolution monitoring by the execution of continuous smoke tests, or Service Level Indicator tests, that measure user-defined features and check them against expected levels.

Healthwatch automatically runs these tests for TAS for VMs Service Level Indicators.
The VMware Cloud Ops Cloud Foundry smoke tests repository offers additional testing examples.

For information about how to set up Concourse to generate custom component metrics, see Metrics in the Concourse documentation.

Warning and critical thresholds

To properly configure your monitoring dashboard and alerts, you must establish what thresholds should drive alerting and red/yellow/green dashboard behavior.

Some key metrics have more fixed thresholds, with similar threshold numbers numbers recommended across different foundations and use cases. These metrics tend to revolve around the health and performance of key components that can impact the performance of the entire system.

Other metrics of operational value are more dynamic in nature. This means that you must establish a baseline and yellow/red thresholds suitable for your system and its use cases. You can establish initial baselines by watching values of key metrics over time and noting what seems to be a good starting threshold level that divides acceptable and unacceptable system performance and health.

Continuous evolution

Effective platform monitoring requires continuous evolution.

After you establish initial baselines, VMware recommends that you continue to refine your metrics and tests to maintain the appropriate balance between early detection and reducing unnecessary alert fatigue. You should occasionally revisit the dynamic measures recommended in Key performance indicators and Key capacity scaling indicators to ensure they are still appropriate to the current system configuration and its usage patterns.