Automation Config exposes several system metrics that can be used for monitoring and diagnostics. These metrics are available in graphical form on the Automation Config user interface dashboard and in machine-readable form using the /metrics
http endpoint.
For more information about visualizing reports in the Automation Config user interface using the Dashboard, see Dashboard Reports.
Machine-readable metrics
Automation Config exports system metrics in OpenMetrics text-based format. This format is directly consumable by Prometheus and other monitoring and alerting tools.
Automation Config metrics configuration
Configuration for system metrics collection consists of the following settings in the /etc/raas/raas
configuration file. Default values are shown.
# System metrics settings metrics: enabled: true # If True, enable the collection of system metrics prometheus: false # If True, enable the Prometheus endpoint at /metrics prometheus_username: # Static username for retrieving /metrics prometheus_password: # Static password for retrieving /metrics snapshot_interval: 60 # How often to record snapshot metrics, in seconds max_query_timedelta: 86400 # Maximum timedelta for a single call to get_system_metrics, in seconds keep: 30 # How long to retain metrics data, in days
The following settings control the handling of machine-readable system metrics:
- To disable metrics collection, set
enabled:false
. Note that this will also disable the Automation Config built-in dashboard. - To enable the export of machine-readable metrics from the
/metrics
http endpoint, setprometheus:true
. This setting does not affect the Automation Config built-in dashboard. - Access to the
/metrics
http endpoint is controlled by http Basic Authentication with credentials configured inprometheus_username
andprometheus_password
. Non-empty values are required for these settings to enable the/metrics
endpoint. These credentials are stored only in the/etc/raas/raas
configuration file, are not associated with any Automation Config account, and cannot be used to authenticate to Automation Config other than for accessing the/metrics
http endpoint.
The other settings shown above relate to the Automation Config built-in dashboard and do not affect the collection or reporting of machine-readable system metrics. In particular, the snapshot_interval
setting determines how often, in seconds, metrics are recorded for display on the dashboard, and the keep
setting determines how long, in days, metrics data will be kept in the database before being trimmed.
Although the /metrics
http endpoint is the recommended way to gather machine-readable metrics data from Automation Config, you can use the API (RaaS) to retrieve the data presented on the built-in dashboard. The stats.get_system_metrics()
API call lets you query metrics data by metric name, source, and date range. The configuration item max_query_timedelta
limits how much data Automation Config will return from a single API call. To get metrics data from a longer time span, you can make multiple API calls with different start and end dates.
Salt Master metrics configuration
The availability of some metrics depends on the configuration of the Salt masters connected to Automation Config:
- Metrics on Salt events and job returns will be accurate only if the
sseapi
returner is configured on the Salt masters. Metrics collection will not work properly with thesse_pgjsonb
(direct-to-database) returner. - Automation Config will collect low-level function runtime information from Salt masters that have
master_stats:true
set in their configuration. This option is disabled by default. See master_stats in the Salt documentation for details.
- Metrics on salt job states will be accurate only if the job completion engine is enabled in the Salt Master Plugin:
engines: -jobcompletion:{}
This engine is enabled in the Salt Master Plugin default configuration.
Configuring Prometheus to connect to Automation Config
You can enable a Prometheus server to scrape metrics from Automation Config by adding a scrape_configs
job to the Prometheus configuration (typically prometheus.yml
) for each API (RaaS) server instance you have:
scrape_configs: - job_name: 'sse' metrics_path: '/metrics' scheme: 'http' static_configs: - targets: ['localhost:8080'] basic_auth: username: prometheus password: metrics
The credentials in the Prometheus configuration should match the prometheus_username
and prometheus_password
specified in the /etc/raas/raas
configuration file, as noted above.
See the Prometheus project documentation for more information on setting up scrape targets and other Prometheus configuration topics.
Metric descriptions
The machine-readable metrics that Automation Config exports fall into several categories:
Category | Metric Name | Metric Type | Labels | Description |
---|---|---|---|---|
Salt master low-level metrics | salt_event_size_bytes |
Histogram | master_id |
Salt event size, in bytes |
salt_master_cmd_duration_seconds |
Histogram | master_id , cmd |
Salt master command duration, in seconds. Reported only if master_stats is configured on the Salt master. |
|
Salt Master Plugin metrics | raas_master_commands_processed |
Counter | master_id |
SSE commands processed |
raas_master_master_grains_pushed |
Counter | master_id |
Salt master grain updates pushed to Automation Config | |
raas_master_minion_keys_pushed |
Counter | master_id |
Minion key states updates pushed to Automation Config | |
raas_master_minion_cached_pushed |
Counter | master_id |
Minion cache updates pushed to Automation Config | |
raas_master_masterfs_pushed |
Counter | master_id |
MasterFS updates pushed to Automation Config | |
raas_master_sseapi_engine_iteration_seconds |
Histogram | master_id |
API (RaaS) engine iteration duration, in seconds | |
Server metrics | redis_commands_executed |
Counter | redis_instance |
Redis commands executed (system cache) |
redis_memory_bytes |
Gauge | redis_instance |
Redis memory usage (system cache) | |
celery_tasks_queued |
Counter | raas_instance , task |
Celery tasks queued (background jobs) | |
celery_tasks_executed |
Counter | raas_instance , task |
Celery tasks executed (background jobs) | |
celery_queue_length |
Gauge | raas_instance |
Celery queue length (background jobs waiting) | |
raas_rpc_request_duration_seconds |
Histogram | raas_instance |
SSE RPC API call duration, in seconds | |
PostgreSQL metrics | postgres_connections |
Gauge | postgres_instance |
Postgres connections |
postgres_transactions |
Counter | postgres_instance |
Postgres transactions committed | |
postgres_rows_read |
Counter | postgres_instance |
Postgres rows read | |
postgres_rows_inserted |
Counter | postgres_instance |
Postgres rows inserted | |
postgres_rows_updated |
Counter | postgres_instance |
Postgres rows updated | |
postgres_rows_deleted |
Counter | postgres_instance |
Postgres rows deleted | |
System metrics | highstate_minions |
Gauge | None | Number of minions that ran a highstate job |
highstate_minions_changed |
Gauge | None | Number of minions that ran a highstate job resulting in one or more changes | |
highstate_minions_succeeded |
Gauge | None | Number of minions that ran a highstate job with no failures | |
highstate_minion_duration_seconds |
Gauge | None | Average per-minion duration for a highstate run | |
highstate_states |
Gauge | None | Number of unique states applied in highstate runs | |
highstate_states_changed |
Gauge | None | Number of states applied in highstate runs that resulted in one or more changes | |
highstate_states_succeeded |
Gauge | None | Number of states applied in highstate runs with no failures | |
sse_jobs_in_progress |
Counter | None | Automation Config jobs in progress | |
sse_jobs_complete_all_successful |
Counter | None | Automation Config jobs complete with all successful returns | |
sse_jobs_complete_missing_returns |
Counter | None | Automation Config jobs complete with one or more missing returns | |
sse_jobs_complete_with_errors |
Counter | None | Automation Config jobs complete with one or more errors | |
sse_masters |
Gauge | None | Total Salt masters in Automation Config | |
sse_minion_grains_deleted |
Counter | master_id |
Number of minion grains deleted | |
sse_minion_grains_indexing_duration_seconds |
Counter | raas_instance |
Minion grains indexing calculations duration, in seconds | |
sse_minion_grains_saved |
Counter | master_id |
Number of minion grains saved | |
sse_minion_target_match_calcs |
Counter | raas_instace |
Number of minion versus target group matching calculations | |
sse_minion_target_match_duration_seconds |
Counter | raas_instance |
Minion versus target group matching calculations durations, in seconds | |
sse_minions |
Gauge | None | Total minions in Automation Config | |
sse_minions_present |
Gauge | master_id |
Minions present within the configured time limit raas_presence_expiration |
|
sse_minions_lost |
Gauge | master_id |
Minions not present within the time limit | |
sse_minions_unknown |
Gauge | master_id |
Unknown minions (never present) | |
sse_users_authenticated |
Gauge | None | Users authenticated to Automation Config |