The VMware administrator uses the Cluster Utilization dashboard with the Cluster Contention dashboard for performance management.
Design Considerations
This dashboard supports the Cluster Contention dashboard. Use it to identify vSphere clusters with high utilization in a selected data center. When utilization exceeds 100%, performance can be negatively impacted especially when VMs experience a contention. By default, VMware Aria Operations has a 5-minutes collection interval. For five minutes, there may be 300 seconds worth of data points. If a spike is experienced for a few seconds, it may not be visible if the remaining of the 300 seconds is low utilization.
To view the common design considerations among all performance management dashboards, see the Performance Dashboards.
How to Use the Dashboard
- CPU(%) and Memory (%).
- Review the CPU and Memory distribution charts for an overview of the CPU and memory utilization of the clusters.
- The highest metric in the last one week is used. Average or 95th percentile is not used as this is utilization and not contention. High utilization does not mean bad performance.
- One week is used instead of one day to give you a longer time horizon and covers the weekend. Adjust the timeline as you deem fit for your operations.
- Expect memory to be higher than CPU, as it is a form of cache. The Memory Consumed counter is used as it is more appropriate than the Memory Active counter.
- Low utilization can actually indicate bad performance, as not much of real work gets done. The chart uses the dark gray color for low utilization.
- Clusters Utilization.
- The cluster utilization table lists all the clusters, sorted by the highest utilization in the last one week. If the table displays the green color, then there is no need to analyze further.
- You can change the time period to the period of your interest. The maximum number is reflected accordingly.
- Select a cluster from the table.
- All the utilization charts show the key utilization metrics of the selected cluster.
- For memory, the high utilization counters are explicitly shown, Balloon, Compressed, and Swapped. Notice they exist even though utilization is not even at 90%, indicating high pressure in the past. If you look only at utilization, you might think you are safe.
- The line charts show both average and highest among ESXi hosts in the cluster. The reason is unbalanced and it is not rare. There are many settings that can contribute to it (for example, DRS settings, VM Reservation, VM – Host Affinity, Resource Pool, Stretched Cluster, and Large VMs).
- The disk IOPS is split into read and write to gain insight into the behavior. Some workload is read oriented, while others are write oriented.
- The disk throughput is not shown as it sums all the traffic. In reality, each ESXi host has its own limit.
- The vMotion line chart is added, as a high number of vMotion can indicate that the cluster load is volatile, assuming the DRS Automation level is not set to the most sensitive setting.
Points to Note
- If your operations team have some forms of standardization that utilization should not exceed a certain threshold, you can add the threshold into the line chart. The threshold line helps less technical teams as they can see how the real value compares with the threshold.
- Consider adding a third distribution chart. Show the balloon counter in this third chart, as it complements the consumed counter. If there is no ballooning, a high consumed value is in fact better than a lower value.
- The workload metric can exceed a 100% because it is demand / usable capacity * 100. This can happen if you have four hosts in a cluster with each host running at 100% demand and admission control is set to 50%.
- The VM Utilization dashboard complements the VM Contention dashboard. For more information, see the points to note in the Cluster Contention Dashboard.