The vSAN Contention dashboard is the primary dashboard for managing vSAN performance. The VMware administrator or architect can use it to monitor and troubleshoot the vSAN cluster performance. If you determine that there is a performance issue, use the vSAN Utilization dashboard to see if the cause for the contention is high utilization.
Design Considerations
To view the common design considerations among all performance management dashboards, see the Performance Dashboards.
The vSAN Contention dashboard complements the vSphere Cluster Capacity, and shares the same design consideration. It focuses on the storage and vSAN specific metrics, and does not repeat what is already covered. It does not list any non vSAN cluster.
How to Use the Dashboard
- vSAN Peak VM Latency, vSAN Peak CPU Ready, vSAN Peak Dropped Packet.
- Review the three distribution charts for an overview of all the vSAN clusters performance.
- The vSAN peak VM latency chart shows the distribution of disk latency experienced by all the VMs in the cluster. You should expect most of the VMs to experience latency that matches your expectation. For example, in an all flash systems, the VMs should not have >20 ms disk latency. If your vSAN environment is all flash, you must adjust the distribution bucket to a more stringent set.
- The vSAN peak CPU ready chart shows if any of the vSAN kernel modules has to wait for CPU. Expect this number to be near 0% and below 1%, as vSAN should not wait for CPU time. vSAN gets higher priority than VM World as it lives in the kernel space.
- The vSAN peak dropped packet chart shows if any of the vSAN clusters are dropping packet in the vSAN network (not the VM network). vSAN relies on the network to keep the cluster in-sync. This number should be near 0% and less than 1%.
- vSAN Clusters.
- It lists all the vSAN clusters, sorted by the least performing.
- It lists all the ESXi hosts, sorted by the worst performance in the last 24 hours. If the table is showing all green, then there is no need to analyze further. The reason 24 hours is selected instead of one week is that the performance issues greater than 24 hours are likely to be irrelevant.
- You can change the time period to the period of your interest. The maximum number is reflected accordingly.
- Select a vSAN cluster from the vSAN clusters table.
- All the health charts show the KPI of the selected cluster.
- If you are using SMART, the two heat maps at the bottom of the dashboard provide early warning.
Points to Note
- A large vSAN cluster can have many components. Each of these components can have multiple performance metrics. The total number of KPI can reach hundreds of metrics. For example, take a 10 node cluster. It can have 530 counters to check. VMware Aria Operations aggregates them by introducing a set of KPIs. This analysis reduces the number to a more manageable number. The following table shows the KPIs and their formula.
Name What it is Max Capacity Disk Latency (ms) Highest latency among all capacity disks take the worst, not average, as the latency in a single capacity disk is already an average of all its VMs. If there are 50 VMs on the disk and 30 are issuing IO on it, then its average is among 30. Min Disk Group Write Buffer Free (%) Lowest free capacity among all the disk group write buffers. If this number is low, one of your buffers is not enough. While you want to maximize your cache, a low number is an early warning for capacity management. Max Disk Group Read Cache/Write Buffer Latency (ms) Each disk has a Read Cache Read Latency, Read Cache Write Latency (for writing into cache), Write Buffer Write Latency, and Write Buffer Read Latency (for de-staging purpose). This takes the highest among all these four numbers and the highest among all disk groups. It is the max of the max because each of the four datapoints is an average of all the VMs on it. Sum Disk Group Errors Sum of the bus reset + sum of commands canceled among all the disk groups. You must use sum and not get the max as each member should return zero. Count Disk Group Congestion Above 60 The number of disk groups congestion greater than 60. 60 is hardcoded in the vSAN Management Pack as it is a good starting point. As any congestion above 60 serves an early warning, count how many of such occurrences happen. Max Disk Group Congestion The highest congestion among all disk groups. A high number indicates that at least one disk group is not performing. Min Disk Group Capacity Free (%) The lowest free capacity among all disk groups. A low space triggers rebalance. Min Disk Group Read Cache Hit Rate (%) The lowest hit rate among the disk group read cache. Ensure that this number is high as it indicates that the read is served by cache. Sum vSAN PortGroup Packets Dropped (%) Sum of all vSAN VMkernel port RX dropped packet + TX dropped packet. You should expect no dropped packet in your vSAN network.