You can monitor and troubleshoot apps using App Metrics.
App Metrics helps you to understand and troubleshoot the health and performance of your apps by offering the following indicators, data, and visualizations:
The following sections describe a standard workflow for App Metrics to monitor or troubleshoot your apps.
In a browser, go to metrics.sys.DOMAIN
and log in with your User Account and Authentication (UAA) credentials. Select an app from the search bar to view metrics and logs.
App Metrics respects UAA permissions so you can view any app that runs in a space that you have access to.
App Metrics displays app data for a specific time frame on the dashboard.
The charts show time along the horizontal axis. You can change the time frame for all charts and the logs by using the time selector options. You can select from several preset timescales or select a custom date range.
In addition, from any chart, you can click and drag to zoom in on areas of interest. This adjusts all of the charts, and logs, to show data for that time frame.
Auto-refresh mode updates the metrics charts and logs on your dashboard for a timed interval as data is received.
The default auto-refresh interval is set to one minute and is currently not configurable.
App Metrics relays metric data at the app process level to allow for an in-depth troubleshooting experience, even across a rolling deployment. You are able to view the app metrics that are related to a specific process and focus in further to the specific instances within those processes. This action correlates directly with the processes and app instances in Apps Manager.
The dashboard displays metrics aggregated across all processes by default. To view metrics by specific process, select a process type from the dropdown near the upper-left of the dashboard.
When you select a specific process type, the metrics charts display aggregate data from all instances for the selected process type.
To view metrics for the individual instances within the selected process, select the Instances radio button at the upper-right of the dashboard.
To view metrics for a specific app instance or selection of specific instances, select your instance or instances from the legend of any chart on the dashboard and select the Instances radio button.
The default metrics charts that are included with App Metrics provide high level indicators for the four golden signals for monitoring the health of apps running on distributed systems:
The following sections explain how to use each of the charts on the dashboard to monitor and troubleshoot your app.
If apps are not configured for network traffic, they show No Data or zeros for the default Latency, Traffic, and Errors metrics.
Latency: Average latency of a request in milliseconds. A spike in response time means your users are waiting longer. Scaling app instances can spread that workload over more resources and results in faster response times.
Traffic: Number of network requests per minute. A spike in HTTP requests means more users are using your app. Scaling app instances reduces the response time.
Errors: Number of network request errors per minute. A spike in HTTP errors means one or more 5xx errors occurred. Check your app logs for more information.
The following Container Metrics charts are available on the App Metrics dashboard to help monitor resource saturation:
CPU usage percentage: A spike in CPU might point to a process that is computationally heavy. Scaling app instances relieves the immediate pressure, but you must further investigate the app to better understand and ultimately fix the root cause.
Memory usage percentage: A consistent, gradual increase in memory might mean a resource leak in the code. Scaling app memory relieves the immediate pressure, but you must find and resolve the underlying issue so that it does not occur again.
Disk usage percentage: A spike in disk might mean the app is writing logs to files instead of STDOUT, caching data to local disk, or serializing large sessions to disk.
The Events chart helps to correlate the metrics to events for your app. They include:
The SSH event corresponds to you successfully using SSH to access a container that runs an instance of the app.
See the following topics for more information about app events:
You can add custom metrics charts to your dashboard, including the Spring Boot Actuator and Micrometer metrics. You define the custom metrics that you want to monitor and include them in the indicator document for your app.
If you want to view custom metrics, you can configure your apps to emit those metrics out of the Loggregator Firehose and then view these metrics on the App Metrics dashboard.
In addition, Spring Boot apps with actuators or Micrometer metrics implemented emit these metrics out of the box, without any changes to source code.
In order for Metric Registrar to accurately report Spring Metrics, the configuration for Metric Registrar must be updated in Tanzu Application Service Tile.
You must remove the id
tag from the list of Blocked tags in the Metric Registrar settings for the Tanzu Application Service Tile.
An Indicator document is a YAML document that specifies which app you want to monitor and the indicators you want to use to monitor it.
These are the steps to create an indicator document:
Verify that the metrics are being emitted. After you configure Metrics Registrar to scrape your metrics endpoint, verify your respective endpoint for metric names.
If you use a Prometheus style metrics endpoint, check your app’s metrics endpoint at app.domain/metrics
and search for the desired metric.
To validate Spring Boot Actuator and Micrometer metrics, see Metrics in Spring Boot Actuator: Production-ready Features in the Spring Boot documentation.
After you have the metric name, write a PromQL query to see the metric:
Find additional example PromQL for any of the default charts on the dashboard by clicking Info of any chart or visit the PromQL Query Examples documentation.
Use the PromQL Explorer to test out PromQL before you insert it in the indicator document:
Click the + button on the dashboard.
Test out the queries to see how the graph appears before you place it in the indicator document.
PromQL must have the source_id
tag for non-admin users. App Metrics supports using a $sourceId
parameter in the PromQL which automatically refers to the sourceId of the current app. Example: cpu{source_id=“$sourceId”}
.
App Metrics supports using a source_id
parameter in the PromQL query which refers to the “$source_id”
of the current app.
For example:
cpu{source_id=“$sourceId”}
After you have the PromQL query ready, insert it in the indicator document.
For example, if you have a custom metric customMetricName500
and want to graph the amount of errors over a one minute time period, then your PromQL query is:
`sum(avg_over_time(customMetricName500{source_id=\"$sourceId\"}[1m]))`
This is an example of the YAML file for an indicator document:
apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument
metadata:
labels:
deployment: "my deployment name"
spec:
product:
name: org,space,app-name
version: 0.0.1
indicators:
- name: CustomErrorCount500
promql: "sum(rate(customMetricName500{source_id='$sourceId'}[1m]))"
documentation:
title: "Custom Metric 500 Errors"
presentation:
units: "none"
The org,space,app-name
in the example determines which app these indicators are applied to. Replace org,space,app-name
with the org, space, and app name of the app dashboard that you want to customize.
App Metrics uses a derivative version of the Indicator Protocol.
You can add custom monitoring and alerting to your dashboard indicators by creating a monitor document for your app.
Monitors are linked to specific indicators, so the first step to adding custom monitoring and alerting to your app is to verify the names of the indicators you want to monitor.
You can view the indicator names for each chart on your app’s dashboard by pointing to the desired chart, clicking the three vertical dots and selecting Info.
The indicator corresponds to one of your custom indicators or to one of the following default indicator names:
After you have the indicator names you can create your monitor document that defines the threshold for your indicator and the webhook to send alerts to.
The following example is an YAML file for an monitor document:
product: org,space,app-name
webhook_url: https://my-slack-webhook.com
monitors:
- name: 500 Errors For Application
indicator: ErrorCount
warning:
operator: gte
threshold: 1.0
duration: 1m
only_every: 1h
critical:
operator: gte
threshold: 2.0
duration: 1m
only_every: 15m
The org,space,app-name
variable is responsible for defining which app these indicators are applied to. Replace this with the org, space, and app name of the app you want to monitor.
Also, webhook_url
: https://my-slack-webhook.com
must be where alerts are sent when a threshold is surpassed.
The Slack application is currently the only supported use case, but other webhook platforms might work, if they accept a text payload.
For more detailed information on the monitor document schema, see Monitor Document Template Reference.
View the following videos to enhance your understanding of metric charts:
The Logs view displays app log data ingested from the Loggregator Reverse Log Proxy (RLP):
Logs with non-UTF-8 characters or non-standard UUID app GUIDs are not stored.
You can interact with the Logs view in the following ways:
By default, the most recent 1,000 log lines are displayed in the logs drawer. You can click SHOW 1000 MORE LOGS
to load more.
You can query Metric Store and Log Store directly to access raw data.
To query Metric Store, consult the documentation for Using Metric Store
Make note of the following prerequisites before you query the Log Store:
When you query the API through HTTPS, each request must have the Authorization
header set with a UAA provided token.
GET /v1/sources/{sourceID}/logs
Issues a query against Log Store data.
Path Parameters :
Query Parameters:
message
, message_type
, source_type
, and instance_id
.
message
– RegEx to search the log message body. Use the backtick operator in case of \.
.message_type
– The file descriptor the log was written to, OUT
or ERR
source_type
– The source of the log, any subset of {"API","APP","CELL","HEALTH","LGR","RTR","SSH","STG"}
connected by pipes. For example, "APP|API"
.instance_id
– Filter based on the instance ID of the app or component that wrote the log.For example:
{
"metadata": {
"count": 1,
"links": {}
},
"items": [
{
"instance_id": "0",
"message": "Error: Sample query didn't work",
"message_type": "OUT",
"source_id": "50efa176-bd06-42d1-bac8-672aab387e75",
"source_type": "APP/PROC/WEB",
"timestamp": "2020-03-24T06:57:29.788299446Z"
}
]
}
>= 1
. Defaults to 1.asc
or desc
. Defaults to desc
.For example:
export SYSTEM_DOMAIN="<YOUR_SYSTEM_DOMAIN>"
export SOURCE_ID="$(cf app <YOUR_APP> --guid)"
curl --get -H "Authorization: $(cf oauth-token)" \
"https://log-store.$YOUR_SYSTEM_DOMAIN/v1/sources/$SOURCE_ID/logs" \
--data-urlencode 'query={message=~"Error.*"}' \
--data-urlencode 'startTime=2020-03-24T06:55:00Z' \
--data-urlencode 'endTime=2020-03-24T06:59:00Z'