Monitoring Clusters

This topic describes how to display cluster metrics and interpret the graphs provided by the VMware Tanzu GemFire Management Console to help you monitor connected clusters.

Monitoring Controls: Information about the monitoring controls used when displaying the metrics for a GemFire cluster.
Graphs: Information about using and interpreting the graphs.

For information about configuring monitoring in the GemFire Management Console, see Configuring Monitoring.

If a Prometheus server is not used, not all graphs and functionality are available. You can only view GET throughput, PUT throughput, GET latency, and PUT latency. This is the default setting.

Monitoring Controls

To display the metrics for a GemFire cluster, select the cluster from the Dashboard and navigate to the Monitoring tab. The monitoring controls on this tab allow you to examine GemFire cluster metrics in greater detail, enhancing the ability to analyze performance effectively.

Monitoring controls

You can modify the way the data is presented using the following features on the Monitoring tab:

Member Selector
Scale Selector
Date Selection
Refresh Rate Selector
Metrics Enabled Selector

Member Selector

Monitoring controls: member selector

The member selection drop-down provides you with the flexibility to filter the displayed metrics and data on the graphs. By default, the member selector is set to All Members and displays metrics from all members of the cluster.

Use the drop-down menu to select one or more members to view data for only the selected members.

Scale Selector

Monitoring controls: scale selector

The scale selector allows you to set the duration of the data window to show for all monitored data. The default window is seven days, allowing you to view a week’s worth of data at once.

Use the drop-down menu to set the duration of the data window. The maximum setting is seven days.

Date Selector

Monitoring controls: date selector

The date selector control allows you to navigate through the data. Two sets of controls are provided. The single arrows < and > shift the data view forward or backward based on the scale setting with the following exceptions:

When the scale selector is set to 3 days, the < and > shift the data view forward or backward one day at a time.
The single arrows are disabled when the scale is set to 7 days. The maximum data storage and viewing capacity is seven days.

The double arrows “<<” and “>>” jump to the most recent or the oldest data.

Refresh Rate Selector

Monitoring controls: refresh rate selector

The refresh rate selector sets the frequency at which data refreshes with new information. You can pause the graphs from refreshing by selecting Pause Refresh.

Metrics Enabled Selector

Monitoring controls: metrics enabled selector

The metrics enabled selector allows you to enable or disable the cluster metrics monitoring. When metrics are disabled, the GemFire Management Console does not retrieve metrics from the Prometheus server, and the Prometheus server does not collect metrics from the GemFire cluster.

Graphs

This section describes using and interpreting the graphs provided by the VMware GemFire Management Console to help you monitor connected clusters.

These graphs allow you to see metrics and statistics for connected clusters. This includes information such as GET and PUT throughput, CPU Utilization percentage, and WAN receiver throughput (when applicable).

For information about the controls for the graphs and the three primary areas where the Management Console displays graphs, see the sections below.

Graph Controls
Data
Cluster
WAN Gateway

Graph Controls

Each of the GemFire Management Console graphs includes the following controls:

arrows pointing to information, + and -, and reset

A: The information icon provides you with comprehensive insights into the data displayed on the graph. In some cases, it might also highlight potential issues indicated by the data and suggest actionable steps for addressing them.

B: The “plus” (+) and “minus” (-) buttons allow you to zoom in or out on individual graphs, offering a closer or broader view of the data. While using this function, data refresh on the specific graph is temporarily paused to facilitate detailed examination.

C: The reset button resets the graph to its default scale, providing you with a way to return to the default view.

Data

These metrics provide insight into the data flow with the GemFire cluster.

Gets Throughput: This throughput of all ‘Get’ operations performed across the entire cache. RIGHT Axis (RED): The ‘Get’ operation requests by the remote Client applications. LEFT Axis (BLUE): The ‘Get’ operation requests by the remote Client applications and distributed get operations.
Average Get Latency: The total time taken by all ‘Get’ operations performed across the cluster divided by the number of all ‘Get’ operations performed across the cluster, resulting in the average amount of time taken per ‘Get’ operation in the system.
Puts Throughput: The throughput of all ‘Put’ operations performed across the entire cache. RIGHT Axis (RED): The ‘Put’ operation requests by the remote Client applications. LEFT Axis (BLUE): The ‘Put’ operation requests by the remote Client applications and distributed ‘Put’ operations.
Hits Per Second: The rate of ‘Get’ operations across the entire cluster that resulted in a matched key per second.
Misses Per Second: The rate of ‘Get’ operations that resulted in a miss, where the key that was requested was not in the cache per second.
Average Put Latency: The total time taken by all ‘Put’ operations performed across the cluster divided by the number of all ‘Put’ operations performed across the cluster, resulting in the average amount of time taken per ‘Put’ operation in the system.
Cache Hit Ratio: The Cache hit ratio is based on the system ‘Get’ operations and any ‘Get’ operation across the cluster. It represents the percentage of those ‘Get’ operations that return a value based on the specified key.
Function Execution Ratio: RED (RIGHT AXIS): Function Execution Queue size. BLUE (LEFT AXIS): A calculated rate based on functionExecutionCalls minus functionExecutionsCompleted, which provides the output of FAILED function execution calls per second.
Clients Put Requests Rate by Server: For Each Server Member, the rate of requests of ‘Put’ operations received per second on the specific member.
Persistent Region Overcapacity: For persistent regions that hold data in memory and on disk, this chart exists to ensure that these values stay.
Message Queue Size: The GemFire message queue represents the size of the ‘Get’, ‘Put’, ‘Destroy’, and other operations’ subscription queue.
Client Query Rate. This is the GemFire query rate, the number of cache client operations query requests performed per second.
Average Query Time: The total time of all query requests (processQueryTime) divided by the number of requests (queryRequests) resulting in the average time spent on any given query.
Region Details: The type, name, and entry count for each region in the cache. The Partition Total rows summarize the corresponding Partitioned region.

Cluster

These metrics provide insight into the GemFire cluster itself, such as CPU and memory utilization.

Server Old Gen Utilization: For each server in the cluster, the overall % of that member’s available memory usage.
Disk Utilization: Disk utilization by member in the cluster. The percentage remaining is computed by using the statistics gathered by Tanzu GemFire from the underlying system. Some operating systems, such as macOS, do not emit these metrics. If the operating system does not emit metrics, this graph is empty.
CPU Utilization % by Member: The current utilization percentage of this cluster’s CPUs.
Current Client Connects by Member: Displays a line for each server in the cluster and represents the number of external clients connected to that member.
Sampler Delay Duration: Delay duration measures the actual time (in ms) the Tanzu GemFire statistics sampler slept. It calculates the difference between when it went to sleep and when it woke up to sample. Sample time shows how long it took to collect the sample after it woke up.
Cluster Communication Delays: GemFire uses ‘replyWaitsInProgress’ as a means to measure intra-cluster communication and determine a stalled or failing member.
IO Waits: The time spent waiting to read or write.
Abandoned Reads/Second: GemFire monitors the AbandonedReadRequests from the cache server. This number increasing at a fast rate can indicate an issue with the network or the client app.
% Steal Time: The percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor. Your virtual machine (VM) shares resources with other instances on a single host in a virtualized environment. One of the resources it shares is CPU Cycles. A percentage of over 2% that continues to grow will typically indicate that your underlying infrastructure is not properly handling VMs. If the steal time is greater than 10% for 20 minutes, the VM might be in a state that should be addressed or it will significantly degrade performance.

WAN Gateway

When a cluster is a part of a multi-site setup, these metrics provide insight into the configured Senders and Receivers of the cluster.

WAN Receiver Throughput: All of the Gateway Receivers for this cluster and the rate of bytes/second that are sent (red) or received (blue). If your WAN connected clusters fail, then you will only see a red line as the rate of received (blue) bytes will be zero.
WAN Sender Queue: The rate (messages/second) at which a Gateway Sender is able to send events to a WAN connected cluster.