Monitor and troubleshoot apps with App Metrics

You can monitor and troubleshoot apps using App Metrics.

Understanding the health and performance of your apps

App Metrics helps you to understand and troubleshoot the health and performance of your apps by offering the following indicators, data, and visualizations:

Latency: Response times for your app.
Traffic: Number of requests made for your app.
Errors: HTTP errors thrown by your app.
Saturation (Container Metrics): Three charts measuring CPU, memory, and disk consumption percentages.
Custom Metrics: User-customizable charts for measuring app performance, such as Spring Boot Actuator and Micrometer metrics, or user-defined custom business metrics.
App Events: A chart of update, start, stop, crash, SSH, and staging failure events.
Logs: A list of app logs that you can search, filter, and download.

The following sections describe a standard workflow for App Metrics to monitor or troubleshoot your apps.

Viewing an app

In a browser, go to metrics.sys.DOMAIN and log in with your User Account and Authentication (UAA) credentials. Select an app from the search bar to view metrics and logs.

App Metrics respects UAA permissions so you can view any app that runs in a space that you have access to.

App Metrics displays app data for a specific time frame on the dashboard.

Changing the time frame

The charts show time along the horizontal axis. You can change the time frame for all charts and the logs by using the time selector options. You can select from several preset timescales or select a custom date range.

In addition, from any chart, you can click and drag to zoom in on areas of interest. This adjusts all of the charts, and logs, to show data for that time frame.

Auto refresh the dashboard

Auto-refresh mode updates the metrics charts and logs on your dashboard for a timed interval as data is received.

The default auto-refresh interval is set to one minute and is currently not configurable.

Viewing metrics at the process and app instance level

App Metrics relays metric data at the app process level to allow for an in-depth troubleshooting experience, even across a rolling deployment. You are able to view the app metrics that are related to a specific process and focus in further to the specific instances within those processes. This action correlates directly with the processes and app instances in Apps Manager.

The dashboard displays metrics aggregated across all processes by default. To view metrics by specific process, select a process type from the dropdown near the upper-left of the dashboard.

When you select a specific process type, the metrics charts display aggregate data from all instances for the selected process type.

To view metrics for the individual instances within the selected process, select the Instances radio button at the upper-right of the dashboard.

To view metrics for a specific app instance or selection of specific instances, select your instance or instances from the legend of any chart on the dashboard and select the Instances radio button.

Interpreting metrics

The default metrics charts that are included with App Metrics provide high level indicators for the four golden signals for monitoring the health of apps running on distributed systems:

Latency
Traffic
Errors
Saturation

The following sections explain how to use each of the charts on the dashboard to monitor and troubleshoot your app.

Network metrics

If apps are not configured for network traffic, they show No Data or zeros for the default Latency, Traffic, and Errors metrics.

Latency: Average latency of a request in milliseconds. A spike in response time means your users are waiting longer. Scaling app instances can spread that workload over more resources and results in faster response times.
Traffic: Number of network requests per minute. A spike in HTTP requests means more users are using your app. Scaling app instances reduces the response time.
Errors: Number of network request errors per minute. A spike in HTTP errors means one or more 5xx errors occurred. Check your app logs for more information.

Monitor resource saturation with container metrics

The following Container Metrics charts are available on the App Metrics dashboard to help monitor resource saturation:

CPU usage percentage: A spike in CPU might point to a process that is computationally heavy. Scaling app instances relieves the immediate pressure, but you must further investigate the app to better understand and ultimately fix the root cause.
Memory usage percentage: A consistent, gradual increase in memory might mean a resource leak in the code. Scaling app memory relieves the immediate pressure, but you must find and resolve the underlying issue so that it does not occur again.
Disk usage percentage: A spike in disk might mean the app is writing logs to files instead of STDOUT, caching data to local disk, or serializing large sessions to disk.

Correlate metrics to events

The Events chart helps to correlate the metrics to events for your app. They include:

Crash
Fail (staging failures)
Update
Stop
Start
SSH

The SSH event corresponds to you successfully using SSH to access a container that runs an instance of the app.

See the following topics for more information about app events:

Adding custom metric charts

You can add custom metrics charts to your dashboard, including the Spring Boot Actuator and Micrometer metrics. You define the custom metrics that you want to monitor and include them in the indicator document for your app.

If you want to view custom metrics, you can configure your apps to emit those metrics out of the Loggregator Firehose and then view these metrics on the App Metrics dashboard.

In addition, Spring Boot apps with actuators or Micrometer metrics implemented emit these metrics out of the box, without any changes to source code.

Configuring Metric Registrar for Spring Metrics

In order for Metric Registrar to accurately report Spring Metrics, the configuration for Metric Registrar must be updated in Tanzu Application Service Tile.

You must remove the id tag from the list of Blocked tags in the Metric Registrar settings for the Tanzu Application Service Tile.

Creating an Indicator document

An Indicator document is a YAML document that specifies which app you want to monitor and the indicators you want to use to monitor it.

These are the steps to create an indicator document:

Find the metric you want to monitor.
Write the PromQL query.
Add the PromQL to your indicator document.

Finding the metric name

Verify that the metrics are being emitted. After you configure Metrics Registrar to scrape your metrics endpoint, verify your respective endpoint for metric names.

If you use a Prometheus style metrics endpoint, check your app’s metrics endpoint at app.domain/metrics and search for the desired metric.

To validate Spring Boot Actuator and Micrometer metrics, see Metrics in Spring Boot Actuator: Production-ready Features in the Spring Boot documentation.

Writing a PromQL query

After you have the metric name, write a PromQL query to see the metric:

Find additional example PromQL for any of the default charts on the dashboard by clicking Info of any chart or visit the PromQL Query Examples documentation.
Use the PromQL Explorer to test out PromQL before you insert it in the indicator document:
1. Click the + button on the dashboard.
2. Test out the queries to see how the graph appears before you place it in the indicator document.
  
  PromQL must have the source_id tag for non-admin users. App Metrics supports using a $sourceId parameter in the PromQL which automatically refers to the sourceId of the current app. Example: cpu{source_id=“$sourceId”}.

App Metrics supports using a source_id parameter in the PromQL query which refers to the “$source_id” of the current app.

For example:

cpu{source_id=“$sourceId”}

Add the PromQL query to your Indicator Document

After you have the PromQL query ready, insert it in the indicator document.

For example, if you have a custom metric customMetricName500 and want to graph the amount of errors over a one minute time period, then your PromQL query is:

`sum(avg_over_time(customMetricName500{source_id=\"$sourceId\"}[1m]))`

This is an example of the YAML file for an indicator document:

apiVersion: indicatorprotocol.io/v1
kind: IndicatorDocument

metadata:
  labels:
    deployment: "my deployment name"

spec:
  product:
    name: org,space,app-name
    version: 0.0.1

  indicators:
    - name: CustomErrorCount500
      promql: "sum(rate(customMetricName500{source_id='$sourceId'}[1m]))"
      documentation:
        title: "Custom Metric 500 Errors"
      presentation:
        units: "none"

The org,space,app-name in the example determines which app these indicators are applied to. Replace org,space,app-name with the org, space, and app name of the app dashboard that you want to customize.

Indicator document schema

App Metrics uses a derivative version of the Indicator Protocol.

Custom monitoring and alerting

You can add custom monitoring and alerting to your dashboard indicators by creating a monitor document for your app.

Creating a Monitor document

Monitors are linked to specific indicators, so the first step to adding custom monitoring and alerting to your app is to verify the names of the indicators you want to monitor.

You can view the indicator names for each chart on your app’s dashboard by pointing to the desired chart, clicking the three vertical dots and selecting Info.

The indicator corresponds to one of your custom indicators or to one of the following default indicator names:

RequestCount
HttpLatency
ErrorCount
CPU
MemoryPercentage
DiskPercentage

After you have the indicator names you can create your monitor document that defines the threshold for your indicator and the webhook to send alerts to.

The following example is an YAML file for an monitor document:

product: org,space,app-name

webhook_url: https://my-slack-webhook.com

monitors:
  - name: 500 Errors For Application
    indicator: ErrorCount
    warning:
       operator: gte
       threshold: 1.0
       duration: 1m
       only_every: 1h
    critical:
       operator: gte
       threshold: 2.0
       duration: 1m
       only_every: 15m

The org,space,app-name variable is responsible for defining which app these indicators are applied to. Replace this with the org, space, and app name of the app you want to monitor.

Also, webhook_url: https://my-slack-webhook.com must be where alerts are sent when a threshold is surpassed.

The Slack application is currently the only supported use case, but other webhook platforms might work, if they accept a text payload.

Monitor document schema

For more detailed information on the monitor document schema, see Monitor Document Template Reference.

Custom metric demos

View the following videos to enhance your understanding of metric charts:

Logs

The Logs view displays app log data ingested from the Loggregator Reverse Log Proxy (RLP):

Logs with non-UTF-8 characters or non-standard UUID app GUIDs are not stored.

You can interact with the Logs view in the following ways:

Keyword: Perform a keyword search. While filtering on keywords, logs results are reduced to only display log lines that contain the matching criteria. Matching terms will also be highlighted in blue.
Highlight: Enter a term to visually highlight within your search. The terms are highlighted in orange within the current filter results.
Sources: Choose which sources to display logs from. For more information, see Log Types and Their Messages.
Download: Download a file containing logs for the current search.
Copy: Click the copy icon to copy the text of the log.

By default, the most recent 1,000 log lines are displayed in the logs drawer. You can click SHOW 1000 MORE LOGS to load more.

Direct data access

You can query Metric Store and Log Store directly to access raw data.

Metric Store API

To query Metric Store, consult the documentation for Using Metric Store

Log Store API

Make note of the following prerequisites before you query the Log Store:

Authorization and Authentication

When you query the API through HTTPS, each request must have the Authorization header set with a UAA provided token.

Querying through HTTP endpoints

GET /v1/sources/{sourceID}/logs

Issues a query against Log Store data.

Path Parameters :

sourceID – The app or component source ID. App source ID is the same as app GUID.

Query Parameters:

query is a PromQL label selector query for filtering logs on message, message_type, source_type, and instance_id.
- message – RegEx to search the log message body. Use the backtick operator in case of \..
- message_type – The file descriptor the log was written to, OUT or ERR
- source_type – The source of the log, any subset of {"API","APP","CELL","HEALTH","LGR","RTR","SSH","STG"} connected by pipes. For example, "APP|API".
- instance_id – Filter based on the instance ID of the app or component that wrote the log.

For example:

{
  "metadata": {
    "count": 1,
    "links": {}
  },
  "items": [
    {
      "instance_id": "0",
      "message": "Error: Sample query didn't work",
      "message_type": "OUT",
      "source_id": "50efa176-bd06-42d1-bac8-672aab387e75",
      "source_type": "APP/PROC/WEB",
      "timestamp": "2020-03-24T06:57:29.788299446Z"
    }
  ]
}

startTime is an optional UNIX timestamp in nanoseconds or RFC3339. Defaults to 10 minutes ago and must be before end time.
endTime is an optional UNIX timestamp in nanoseconds or RFC3339. Defaults to now and must be after start time.
limit is an optional maximum number of logs to return. Defaults to 100.
page is an optional number of the page of logs to be returned, must be >= 1. Defaults to 1.
order is an optional order in which the logs are returned, asc or desc. Defaults to desc.

For example:

export SYSTEM_DOMAIN="<YOUR_SYSTEM_DOMAIN>"
export SOURCE_ID="$(cf app <YOUR_APP> --guid)"
curl --get -H "Authorization: $(cf oauth-token)" \
     "https://log-store.$YOUR_SYSTEM_DOMAIN/v1/sources/$SOURCE_ID/logs" \
     --data-urlencode 'query={message=~"Error.*"}' \
     --data-urlencode 'startTime=2020-03-24T06:55:00Z' \
     --data-urlencode 'endTime=2020-03-24T06:59:00Z'