This topic describes how developers can monitor and troubleshoot their apps using App Metrics.
App Metrics helps you understand and troubleshoot the health and performance of your apps by offering the following indicators, data, and visualizations:
The following sections describe a standard workflow for using App Metrics to monitor or troubleshoot your apps.
In a browser, navigate to metrics.sys.DOMAIN
and log in with your User Account and Authentication (UAA) credentials. Choose an app from the search bar for which you want to view metrics and/or logs. App Metrics respects UAA permissions such that you can view any app that runs in a space that you have access to.
App Metrics displays app data for a given time frame. See the sections below to Change the Time Frame for the dashboard.
The charts show time along the horizontal axis. You can change the time frame for all charts and the logs by using the time selector options at the top of the window. You can select from several pre-set timescales or select a custom date range.
Zoom: From within any chart, click and drag to zoom in on areas of interest. This adjusts all of the charts, and the logs, to show data from that time frame.
Auto-refresh mode allows the metrics charts and logs on your dashboard to be updated on a timed interval as data is ingested.
To enable auto-refresh, click the REFRESH button next to the time selection options on the top right of the dashboard. This will enable live updating of metrics and logs data for the currently selected timeframe.
Note: The default auto-refresh interval is set to one minute and is currently not configurable.
App Metrics relays metric data at the app process level to allow for an in-depth troubleshooting experience, even across rolling deployment. Users are able to view the app metrics related to a specific process and further drill down into specific instances within those processes, which correlates directly with the processes and app instances shown in Apps Manager.
The dashboard will display metrics aggregated across all processes by default. To view metrics by specific process, select a process type from the dropdown near the upper-left of the dashboard.
With a specific process type selected, the metrics charts will display aggregate data from all instances within the selected process type.
To view metrics for the individual instances within the selected process, select the “Instances” radio button at the upper-right of the dashboard.
To view metrics for a specific app instance (or selection of specific instances), select the desired instance(s) from the legend along the bottom of any chart on the dashboard while the “Instances” radio is selected.
The default metrics charts included with App Metrics provide high-level indicators of the Four Golden Signals for monitoring the health of apps running on distributed systems: Latency, Traffic, Errors, and Saturation.
The following sections explain how to use each of the charts on the dashboard to monitor and troubleshoot your app.
Note: If apps are not configured for network traffic, they show No Data or zeros for the default Latency, Traffic, and Errors metrics.
The following Container Metrics charts are available on the App Metrics dashboard to help monitor resource saturation:
CPU usage percentage:
A spike in CPU might point to a process that is computationally heavy. Scaling app instances can relieve the immediate pressure, but you need to investigate the app to better understand and fix the root cause.
Memory usage percentage:
A consistent, gradual increase in memory might mean a resource leak in the code. Scaling app memory can relieve the immediate pressure, but you need to find and resolve the underlying issue so that it does not occur again.
Disk usage percentage:
A spike in disk might mean the app is writing logs to files instead of STDOUT, caching data to local disk, or serializing large sessions to disk.
In addition, the Events chart helps to correllate these metrics to events for your app, including: Crash, Fail (staging failures), Update, Stop, Start, and SSH.
Note: The SSH event corresponds to someone successfully using SSH to access a container that runs an instance of the app.
See the following topics for more information about app events:
You can add custom metrics charts to your dashboard, including Spring Boot Actuator and Micrometer metrics, by defining the custom metrics you want to monitor and including them in an indicator document for your app.
In order to get custom, Actuator, or Micrometer metrics into the Metrics Store, you will need to bind Metric Registrar to your app and register your endpoint. For more information, see Configuring the Metric Registrar.
If you want to view custom metrics, you can configure your apps to emit those metrics out of the Loggregator Firehose and then view these metrics on the App Metrics dashboard.
In addition, Spring Boot apps with actuators or Micrometer metrics implemented emit these metrics out of the box, without any changes to source code.
In order for Metric Registrar to accurately report Spring Metrics, the configuration for Metric Registrar must be updated in the Tanzu Application Service Tile.
The id
tag will need to be removed from the list of Blocked tags in the Metric Registrar settings withing the Tanzu Application Service Tile.
An indicator document is a YAML document that specifies which app you want to monitor and the indicators you want to use to monitor it.
There are three steps to creating an indicator document:
First verify that the metrics are being emitted. After you have configured Metrics Registrar to scrape your metrics endpoint, you can verify your respective endpoint for metric names.
If you are using a Prometheus-style metrics endpoint, you can do so by hitting your app’s metrics endpoint at app.domain/metrics
and looking for the desired metric.
To validate Spring Boot Actuator and Micrometer metrics, see Metrics in Spring Boot Actuator: Production-ready Features in the Spring Boot documentation.
After you have the metric name, write a PromQL query for visualizing the metric.
Find additional example PromQL for any of the default charts on the dashboard by clicking Info in the upper-right of any chart or visit the PromQL Query Examples documentation.
Use the PromQL Explorer to test out PromQL before putting it in an indicator document:
Click the + button at the bottom right of the dashboard.
Test out queries to see how the graph looks before placing it in an indicator document.
Note: PromQL should always have the source_id
tag for non-admin users. App Metrics supports using a $sourceId
parameter in the PromQL which automatically refers to the sourceId of the current app. Example: cpu{source_id=“$sourceId”}
After you have the PromQL ready, put it in an indicator document.
For example, if you have a custom metric customMetricName500
and want to graph the amount of errors over a 1 minute period, then your PromQL query is sum(avg_over_time(customMetricName500{source_id=\"$sourceId\"}[1m]))
. The following is an example of the YAML for an indicator document:
apiVersion: indicatorprotocol.io/v1 kind: IndicatorDocument
metadata: labels: deployment: “my deployment name”
spec: product: name: org,space,app-name version: 0.0.1
indicators: - name: CustomErrorCount500 promql: “sum(rate(customMetricName500{source_id=‘$sourceId’}[1m]))” documentation: title: “Custom Metric 500 Errors” presentation: units: “none”
The org,space,app-name
in the example above determines which app these indicators are applied to. Replace org,space,app-name
with the org, space, and app name of the app dashboard that you want to customize.
App Metrics uses a derivative version of the Indicator Protocol. For more information about the App Metrics-supported indicator document schema, see Indicator Document Template Reference.
You can add custom monitoring to your dashboard’s indicators by creating a custom monitor document for your app.
Monitors are linked to specific indicators, so the first step to adding custom monitoring and alerting to your app is to verify the names of the indicators you would like to monitor.
You can view the indicator names of each chart on your app’s dashboard by hovering on the desired chart, clicking on the kabobs in the right-hand corner and selecting Info.
The indicator name can correspond to one of your custom indicators or to one of the default indicator names: RequestCount, HttpLatency, ErrorCount, CPU, MemoryPercentage, and DiskPercentage.
Once you have the indicator names you can create your monitor document that will define threshold for your indicator and the webhook to send alerts to. The following is an example of the YAML for an monitor document:
product: org,space,app-name
webhook_url: https://my-slack-webhook.com
monitors: - name: 500 Errors For Application indicator: ErrorCount warning: operator: gte threshold: 1.0 duration: 1m only_every: 1h critical: operator: gte threshold: 2.0 duration: 1m only_every: 15m
Please note that org,space,app-name
above is responsible for defining which app these indicators will be applied to. Please replace this with the org, space, and app name of the app you wish to monitor.
Please also note that the https://my-slack-webhook.com
should be where alerts are sent when a threshold is surpassed. Slack is the currently the only supported use case, but other webhook platforms may work if they accept a “text” payload.
For more detailed information on the monitor document schema, see the Monitor Document Template Reference.
The Logs view displays app log data ingested from the Loggregator Reverse Log Proxy (RLP):
Note: Logs with non-UTF-8 characters or non-standard UUID app GUIDs are not stored.
You can interact with the Logs view in the following ways:
By default, the most recent 1,000 log lines will be displayed in the logs drawer. You can click SHOW 1000 MORE LOGS
to load more.
You can query Metric Store and Log Store directly to access the raw data.
To query Metric Store, consult the documentation for Using Metric Store
When querying the API via HTTPS, each request must have the Authorization
header set with a UAA provided token.
GET /v1/sources/{sourceID}/logs
Issues a query against Log Store data.
Path Parameters: - sourceID – The app or component source ID. App source ID is the same as app GUID.
Query Parameters:
message
, message_type
, source_type
, and instance_id
.
message
– RegEx to search the log message body. Use the backtick operator in case of \.
.message_type
– The file descriptor the log was written to, OUT
or ERR
source_type
– The source of the log, any subset of {"API","APP","CELL","HEALTH","LGR","RTR","SSH","STG"}
connected by pipes, e.g. "APP|API"
.instance_id
– Filter based on the instance ID of the app or component that wrote the log>= 1
. Defaults to 1.asc
or desc
. Defaults to desc
.export SYSTEM_DOMAIN="<YOUR_SYSTEM_DOMAIN>"
export SOURCE_ID="$(cf app <YOUR_APP> --guid)"
curl --get -H "Authorization: $(cf oauth-token)" \
"https://log-store.$YOUR_SYSTEM_DOMAIN/v1/sources/$SOURCE_ID/logs" \
--data-urlencode 'query={message=~"Error.*"}' \
--data-urlencode 'startTime=2020-03-24T06:55:00Z' \
--data-urlencode 'endTime=2020-03-24T06:59:00Z'
Response Body
{
"metadata": {
"count": 1,
"links": {}
},
"items": [
{
"instance_id": "0",
"message": "Error: Sample query didn't work",
"message_type": "OUT",
"source_id": "50efa176-bd06-42d1-bac8-672aab387e75",
"source_type": "APP/PROC/WEB",
"timestamp": "2020-03-24T06:57:29.788299446Z"
}
]
}