This topic describes how to monitor Metric Store. It includes key scaling indicators (KSIs) to guide Metric Store scaling decisions.

Overview

Metric Store can be scaled vertically at this time. When disk resources are reaching complete consumption, Metric Store will start dropping the oldest data first. When memory and/or CPU resources are reaching complete consumption, Metric Store should be vertically scaled.

Key Scaling Indicators

Metric Emitted by Metric Store

Metric Store publishes metrics for monitoring the Metric Store itself. You can use these metrics to observe the health of the Metric Store and determine if the VMs and disks are appropriately scaled.

Nozzle Job Metrics

The nozzle is the process that ingresses envelopes from Loggregator’s Reverse Log Proxy (RLP).

Reports Metric Type Notes
Number of envelopes sent to collocated Metric Store metric_store_nozzle_ingress_envelopes_total counter When increasing, nozzle is correctly consuming from the RLP
Number of envelopes dropped when reading from RLP metric_store_nozzle_dropped_envelopes_total counter If this number is increasing at a steady rate, it may indicate that you need scale up the size of your VMs
Number of points dropped in outbound channel metric_store_nozzle_dropped_points_total counter Should always be zero. If not, it may be useful in debugging.
Number of points written to its collocated Metric Store metric_store_nozzle_egress_points_total counter
Number of errors writing to a remote node metric_store_nozzle_egress_errors_total counter If this number consistently increasing, it may indicate network issues or an overloaded Metric Store node
Total duration spent writing to points. metric_store_nozzle_egres_duration_seconds gauge

CF Auth Proxy Job Metrics

Reports Metric Type Notes
Duration in seconds of requests made to the auth proxy metric_store_auth_proxy_request_duration_seconds gauge
Duration in seconds of external requests made to CAPI metric_store_auth_proxy_capi_request_duration_seconds gauge

Metric Store Job Metrics

Reports Metric Type Notes
Number of points ingressed to colocated Metric Store metric_store_ingress_points_total counter This should be steadily increasing at a relatively consistent rate
Number of points succesfully written to storage engine metric_store_written_points_total counter This should be steadily increasing at a relatively consistent rate
Time spent writing points to the storage engine metric_store_write_duration_seconds gauge
Percentage of free space on persistent disk metric_store_disk_free_ratio gauge
Number of shards removed due to time-based expiration metric_store_expired_shards_total counter
Number of shards removed due to disk space threshold metric_store_pruned_shards_total counter
metric_store_storage_days Days of data stored on disk gauge
Size of the index metric_store_index_size_bytes gauge
Number of unique series stored in the index metric_store_series_count gauge
Number of unique measurements stored in the index metric_store_measurements_count gauge
Number of errors encountered reading from the storage engine metric_store_read_errors_total counter
Time spent retrieving tag values from the storage engine metric_store_tag_values_query_duration_seconds gauge
Time spent retrieving measurement names from the storage engine metric_store_measurement_names_query_duration_seconds gauge

Metric Store Remote Metrics

Reports Metric Type Notes
Size of a replayer queue metric_store_replayer_disk_usage_bytes gauge
Number of errors encountered writing to a replayer queue metric_store_replayer_queue_errors_total counter
Number of bytes written to a replayer queue metric_store_replayer_queued_bytes_total counter
Number of errors encountered reading from a replayer queue metric_store_replayer_read_errors_total counter
Number of errors encountered replaying writes to a remote node metric_store_replayer_replay_errors_total counter
Number of bytes successfully replayed to a remote node metric_store_replayer_replayed_bytes_total counter
Number of points dropped while writing to a remote node metric_store_dropped_points_total counter
Number of points successfully distributed to a remote node metric_store_distributed_points_total counter
Time spent distributing points to a remote node metric_store_distributed_request_duration_seconds gauge
Number of points collected by a metric-store instance from remote nodes metric_store_collected_points_total counter

check-circle-line exclamation-circle-line close-line
Scroll to top icon