You can monitor the health of the Redis for VMware Tanzu Application Service service using the logs, metrics, and Key Performance Indicators (KPIs) generated by Redis for Tanzu Application Service component VMs.

Loggregator

Redis metrics are emitted through Loggregator through the Reverse Log Proxy and Log Cache. You can use third-party monitoring tools to consume Redis metrics to monitor Redis performance and health. The Loggregator Firehose architecture endpoint is being deprecated.

As an example of how to display KPIs and metrics without the Firehose, see the CF Redis example dashboard in GitHub. This example uses Datadog. However, VMware does not endorse or support any third-party solution.

Metrics polling interval

The metrics polling interval defaults to 30 seconds. You can change this by navigating to the Metrics configuration page in Tanzu Operations Manager and entering a new value in Metrics polling interval (min: 10).

Screenshot of the metrics polling interval field. The minimum value allowed is ten seconds.

Metrics are emitted in the following format:

origin:"p-redis" eventType:ValueMetric timestamp:1480084323333475533 deployment:"cf-redis" job:"cf-redis-broker" index:"{redacted}" ip:"10.0.1.49" valueMetric:<name:"_p_redis_service_broker_shared_vm_plan_available_instances" value:4 unit:"" >

Critical logs

VMware recommends operators set up alerts on critical logs to help prevent further degradation of the Redis service. For examples of critical logs for service backups, including log messages for failed backups, backups with errors, and backups that failed to upload to destinations, see Troubleshooting in the Service Backups documentation.

Healthwatch

The Healthwatch service monitors and alerts on the current health, performance, and capacity of your service instances. By default, the Healthwatch dashboard displays core metrics and alerts configured for recommended thresholds.

For more information, see Using Healthwatch.

Key performance indicators

Key Performance Indicators (KPIs) for Redis for VMware Tanzu Application Service are metrics that operators find most useful for monitoring their Redis service to ensure smooth operation. KPIs are high-signal-value metrics that can indicate emerging issues. KPIs can be raw component metrics or derived metrics generated by applying formulas to raw metrics.

VMware recommends the following KPIs for general alerting and response with typical Redis for Tanzu Application Service installations. If using Healthwatch, some core metrics are configured by default using the recommended thresholds below. VMware recommends that operators continue to fine-tune the alert measures to their installation by observing historical trends. VMware also recommends that operators expand beyond this guidance and create new, installation-specific monitoring metrics, thresholds, and alerts based on learning from their own installations.

For how to create custom service alerts for Healthwatch, see Configuring Healthwatch alerts.

For a list of all other Redis metrics, see Other Redis metrics.

Redis for Tanzu Application Service KPIs

Total instances for on-demand service


total_instances

Description Total instances provisioned by app developers across all on-demand services and for a specific on-demand plan

Use: Track instance use by app developers.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Daily
Recommended alert thresholds Yellow warning: N/A
Red critical: N/A
Recommended response N/A

Quota remaining for on-demand service


quota_remaining

Description Number of available instances across all on-demand services and for a specific on-demand plan.

Use: Track remaining resources available for app developers.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Daily
Recommended alert thresholds Yellow warning: 3
Red critical: 0
Recommended response Increase quota allowed for the specific plan or across all on-demand services.

Total instances for shared VM service


_p_redis_service_broker_shared_vm_plan_total_instances

Description Total instances provisioned for shared-VM services.

Use: Track total shared-VM instances available for app developers.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement App-specific
Recommended alert thresholds Yellow warning: N/A
Red critical: N/A
Recommended response N/A

Redis KPIs

The metrics in this section can be used for on-demand and shared-VM service instances. You can differentiate between these service instance metrics as follows:

  • On-demand service instances:

    • Have origin p.redis
  • Shared-VM service instances:

    • Have origin p-redis
    • Their names are pre-pended with _p_redis_shared_vm_SHARED_INSTANCE_GUID/. SHARED-INSTANCE-GUID can be retrieved by running cf service SERVICE-NAME –guid.

Percent of persistent disk used


disk.persistent.percent

Description Percentage of persistent disk being used on a VM. The persistent disk is specified as an IaaS-specific disk type with a size. For example, pd-standard on GCP, or st1 on AWS, with disk size 5 GB. This is a metric relevant to the health of the VM. A percentage of disk usage approaching 100 causes the VM disk to become unusable as no more files are allowed to be written.

Use: Redis is an in-memory datastore that uses a persistent disk to backup and restore the dataset in case of upgrades and VM restarts.

Origin: BOSH HM
Type: percent
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: >75
Red critical: >90
Recommended response Ensure that the disk is at least 2.5x the VM memory for the on-demand broker and 3.5x the VM memory for cf-redis-broker. If it is, then contact VMware Tanzu Support. If it is not, then increase disk space.

Used memory percent


info.memory.used_memory / info.memory.maxmemory

Description The ratio of these two metrics returns the percentage of available memory used:
  • info.memory.used_memory is a metric of the total number of bytes allocated by Redis using its allocator (either standard libc, jemalloc, or an alternative allocator such as tcmalloc).
  • maxmemory is a configuration option for the total memory made available to the Redis instance.
Use: This is a performance metric that is most critical for Redis instances with a maxmemory-policy of allkeys-lru

Origin: Doppler/Firehose
Type: percentage
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement App-specific based on velocity of data flow. Some options are:

  1. Individual data points---Use if key eviction is in place, for example, in cache use cases.
  2. Average over last 10 minutes---Use if this gives you enough detail.
  3. Maximum of last 10 minutes
If key eviction is not in place, options 1 or 3 give more useful information to ensure that high usage triggers an alert.
Recommended alert thresholds Yellow warning: 80% Not applicable for cache usage. When used as a cache, Redis typically uses up to maxmemory and then evict keys to make space for new entries.

A different threshold might be appropriate for specific use cases of no key eviction, to account for reaction time. Factors to consider:

  1. Traffic load on app---Higher traffic means that Redis memory fills up faster.
  2. Average size of data added/ transaction---The more data added to Redis on a single transaction, the faster Redis fills up its memory.
Red critical: 90%. See warning-specific threshold information.
Recommended response No action assuming the maxmemory policy set meets your apps needs. If the maxmemory policy does not persist data as you want, either coordinate a backup cadence or update your maxmemory policy if using the on-demand Redis service.

Connected clients


info.clients.connected_clients

Description Number of clients currently connected to the Redis instance.

Use: Redis does not close client connections. They remain open until closed explicitly by the client or another script. Once the connected_clients reaches maxclients, Redis stops accepting new connections and begins producing ERR max number of clients reached errors.

Origin: Doppler/Firehose
Type: number
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: App-specific. When connected clients reaches max clients, no more clients can connect. This alert must be at the level where it can tell you that your app has scaled to a certain level and can require action.
Red critical: App-specific. When connected clients reaches max clients, no more clients can connect. This alert must be at the level where it can tell you that your app has scaled to a certain level and can require action.
Recommended response Increase max clients for your instance if using the on-demand service, or reduce the number of connected clients.

Blocked clients


info.clients.blocked_clients

Description The number of clients currently blocked waiting for a blocking request they have made to the Redis server. Redis provides two types of primitive commands to retrieve items from lists: standard and blocking. This metric concerns the blocking commands.

Standard Commands

The standard commands (LPOP, RPOP, RPOPLPUSH) immediately return an item from a list. If there are no items available the standard pop commands return nil.

Blocking Commands

The blocking commands (BLPOP, BRPOP, BRPOPLPUSH) wait for an empty list to become non-empty. The client connection is blocked until an item is added to the lists it is watching. Only the client that made the blocking request is blocked, and the Redis server continues to serve other clients.

The blocking commands each take a timeout argument that is the time in seconds the server waits for a list before returning nil. A blocking command with timeout 0 waits forever. Multiple clients can be blocked waiting for the same list. For details of the blocking commands, see: https://redis.io/commands/blpop.

Use: Blocking commands can be useful to avoid clients regularly polling the server for new data. This metric tells you how many clients are currently blocked due to a blocking command.

Origin: Doppler/Firehose
Type: number
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement App-specific. Change from baseline can be more significant than actual value.
Recommended alert thresholds Yellow warning: The expected range of the blocked_clients metric depends on what Redis is being used for:
  • Many uses have no need for blocking commands and expect blocked_clients to always be zero.
  • If blocking commands are being used to force a recipient client to wait for a required input, a raised blocked_clients might suggest a problem with the source clients.
  • blocked_clients might be expected to be high in situations where Redis is being used for infrequent messaging.
If blocked_clients is expected to be non-zero, warnings could be based on change from baseline. A sudden rise in blocked_clients could be caused by source clients failing to provide data required by blocked clients.

Red critical: There is no blocked_clients threshold critical to the function of Redis. However, a problem that is causing blocked_clients to rise might often cause a rise in connected_clients. connected_clients does have a hard upper limit and can be used to trigger alerts.
Recommended response Analysis could include:

  • Checking the connected_clients metric. blocked_clients would often rise in concert with connected_clients.
  • Establishing whether the rise in blocked_clients is accompanied by an overall increase in apps connecting to Redis, or by an asymmetry in clients providing and receiving data with blocking commands
  • Considering whether a change in blocked_clients is most likely caused by oversupply of blocking requests or undersupply of data
  • Considering whether a change in network latency is delaying the data from source clients
In general, a rise or change in blocked_clients is more likely to suggest a problem in the network or infrastructure, or in the function of client apps, rather than a problem with the Redis service.

Memory fragmentation ratio


info.memory.mem_fragmentation_ratio

Description Ratio of the amount of memory allocated to Redis by the OS to the amount of memory that Redis is using

Use: A memory fragmentation less than one shows that the memory used by Redis is higher than the OS available memory. In other packagings of Redis, large values reflect memory fragmentation. For Redis for Tanzu Application Service, the instances only run Redis, meaning that no other processes are affected by a high fragmentation ratio (e.g., 10 or 11).

Origin: Doppler/Firehose
Type: ratio
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Average over last 10 minutes
Recommended alert thresholds Yellow warning: < 1. Less than 1 indicates that the memory used by Redis is higher than the OS available memory which can lead to performance degradations.
Red critical: Same as warning threshold.
Recommended response Restart the Redis server to normalize fragmentation ratio.

Instantaneous operations per second


info.stats.instantaneous_ops_per_sec

Description The number of commands processed per second by the Redis server. The instantaneous_ops_per_sec is calculated as the mean of the recent samples taken by the server. The number of recent samples is hardcoded as 16 in the implementation of Redis.

Use: The higher the commands processed per second, the better the performance of Redis. This is because Redis is single threaded and the commands are processed in sequence. A higher throughput would thus mean faster response per request which is a direct indicator of higher performance. A drop in the number of commands processed per second as compared to historical norms could be a sign of either low command volume or slow commands blocking the system. Low command volume could be normal, or it could be indicative of problems upstream.

Origin: Doppler/Firehose
Type: count
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement Every 30 seconds
Recommended alert thresholds Yellow warning: A drop in the count compared to historical norms could be a sign of either low command volume or slow commands blocking the system. Low command volume could be normal, or it could be indicative of problems upstream. Slow commands could be due to a latency issue, a large number of clients being connected to the same instance, memory being swapped out, etc. Thus, the count is possibly a symptom of compromised Redis performance. However, this is not the case when low command volume is expected.

Red critical: A very low count or a large drop from previous counts might indicate a downturn in performance that should result in an investigation. That is unless the low traffic is expected behavior.
Recommended response A drop in the count can be a symptom of compromised Redis performance. The following are possible responses:

  1. Identify slow commands using the slowlog:
    Redis logs all the commands that take more than a specified amount of time in slowlog. By default, this time is set to 20ms and the slowlog is allowed a maximum of 120 commands. For the purposes of slowlog, execution time is the time taken by Redis alone and does not account for time spent in I/O. So it would not log slow commands solely due to network latency.

    Given that typical commands, including network latency, take about 200ms, a 20ms Redis execution time is 100 times slower. This could be indicative of memory management issues wherein Redis pages have been swapped to disk.

    To see all the commands with slow Redis execution times, type slowlog get in the redis-cli.
  2. Monitor client connections:
    Because Redis is single threaded, one process services requests from all clients. As the number of clients grows, the percentage of resource time given to each client decreases and each client spends an increasing time waiting for their share of Redis server time.

    Monitoring the number of clients can be important because there might be apps creating connections that you did not expect or your app might not be efficiently closing unused connections.

    The connected clients metrics can be used to monitor this. This can also be viewed from the redis-cli using the command info clients.
  3. Limit client connections:
    This currently defaults to 10000, but depending on the app, you might want to limit this further. To do this, run CONFIG SET maxclients NUMBER-OF-CONNECTIONS in the redis-cli. You can configure this for On-Demand service instances in Tanzu Operations Manager. Connections that exceed the limit are rejected and closed immediately.

    It is important to set maxclients to limit the number of unintended client connections. Set maxclients to 110% to 150% of your expected peak number of connections. In addition, because an error message is returned for failed connection attempts, the maxclient limit warns you that a significant number of unexpected connections are occurring. This helps maintain optimal Redis performance.
  4. Improve memory management:
    Poor memory can cause increased latency in Redis. If your Redis instance is using more memory than is available, the operating system swaps parts of the Redis process from out of physical memory and on to disk. Swapping significantly reduces Redis performance because reads from disk are about five orders or magnitude slower than reads from physical memory.

Keyspace hits / keyspace misses + keyspace hits


info.stats.keyspace_hits / info.stats.keyspace_misses + info.stats.keyspace_hits

Description Hit ratio to determine share of keyspace hits that are successful

Use: A small hit ratio (less than 60%) indicates that many lookup requests are not found in the Redis cache and apps are being forced to revert to slower resources. This might indicate that cached values are expiring too quickly or that a Redis instance has insufficient memory allocation and is deleting volatile keys.

Origin: Doppler/Firehose
Type: ratio
Frequency: 30s (default), 10s (configurable minimum)
Recommended measurement App-specific
Recommended alert thresholds Yellow warning: App-specific. In general depending how an app is using the cache, an expected hit ratio value can vary between 60% to 99% . Also, the same hit ratio values can mean different things for different apps. Every time an app gets a cache miss, it will probably go to and fetch the data from a slower resource. This cache miss cost can be different per app. The app developers might be able to provide a threshold that is meaningful for the app and its performance

Red critical: App-specific. See the warning threshold above.
Recommended response App-specific. See the warning threshold above. Work with app developers to understand the performance and cache configuration required for their apps.

BOSH Health Monitor metrics

The BOSH layer that underlies Tanzu Operations Manager generates healthmonitor metrics for all VMs in the deployment. As of Tanzu Operations Manager v2.0, these metrics are in the Loggregator Firehose by default. For more information, see BOSH System Metrics Available in Loggregator Firehose in VMware Tanzu Application Service for VMs Release Notes.

Other Redis metrics

Redis also exposes the following metrics. for more information, see the Redis documentation.

  • arch_bits
  • uptime_in_seconds
  • uptime_in_days
  • hz
  • lru_clock
  • client_longest_output_list
  • client_biggest_input_buf
  • used_memory_rss
  • used_memory_peak
  • used_memory_lua
  • loading
  • rdb_bgsave_in_progress
  • rdb_last_save_time
  • rdb_last_bgsave_time_sec
  • rdb_current_bgsave_time_sec
  • aof_rewrite_in_progress
  • aof_rewrite_scheduled
  • aof_last_rewrite_time_sec
  • aof_current_rewrite_time_sec
  • total_connections_received
  • total_commands_processed
  • instantaneous_ops_per_sec
  • total_net_input_bytes
  • total_net_output_bytes
  • instantaneous_input_kbps
  • instantaneous_output_kbps
  • rejected_connections
  • sync_full
  • sync_partial_ok
  • sync_partial_err
  • expired_keys
  • evicted_keys
  • keyspace_hits
  • keyspace_misses
  • pubsub_channels
  • pubsub_patterns
  • latest_fork_usec
  • migrate_cached_sockets
  • repl_backlog_active
  • repl_backlog_size
  • repl_backlog_first_byte_offset
  • repl_backlog_histlen
  • used_cpu_sys
  • used_cpu_user
  • used_cpu_sys_children
  • used_cpu_user_children
  • rdb_last_bgsave_status
  • aof_last_bgrewrite_status
  • aof_last_write_status
check-circle-line exclamation-circle-line close-line
Scroll to top icon