You use the Troubleshooting tabs to identify the root cause of problems that are not resolved by alert recommendations or simple analysis.

About this task

To further troubleshoot the symptoms of the capacity problems that are occurring on the cluster and host system, and determine when those problems occurred, you use the Troubleshooting tabs to continue to investigate the memory problem.

Prerequisites

Use the Analysis tabs to analyze your environment. See Analyze the State of Your Environment.

Procedure

  1. Click Environment > vSphere Hosts and Clusters > USA-Cluster.
  2. Click the Troubleshooting tab and review the symptoms.

    The Symptoms tab displays the symptoms that triggered on the selected cluster. You notice that several critical symptoms exist.

    • Cluster Compute Resource Time Remaining with committed projects is critically low

    • Cluster Compute Resource Time Remaining is critically low

    • Capacity remaining is critically low

  3. Analyze the critical symptoms.
    1. Hover your mouse over each critical symptom to identify the metric used.
    2. To view only the symptoms that affect the cluster, enter cluster in the quick filter text box.

      When you hover over Cluster Compute Resource Time Remaining is critically low, the metric Badge|Time Remaining with committed projects (%) appears. You notice that its value is less than or equal to zero, which caused the capacity symptom to trigger and generate an alert on USA-Cluster.

  4. Click the Timeline tab to review the triggered symptoms, alerts, and events that occurred on USA-Cluster over time, and identify when the problems occurred.
    1. On the toolbar, click Select Event Type.
    2. Click Date Controls and select Last 7 Days.

      Several events appear in red.

    3. Hover your mouse over each event to view the details.
    4. To display the events that occurred on the cluster's data center, click Show Ancestor Events, and select Datacenter.

      Warning events for the data center appear in yellow.

    5. Hover your mouse over the warning events.

      You notice that the density is starting to get low, and that a hard threshold violation occurred on the data center late in the evening. The hard threshold violation shows that the Badge|Density metric value was under the acceptable value of 25, and that the violation triggered with a value of 14.89.

    6. To view the affected child objects, click Show Descendant Events and select Host System.
  5. Click the Events tab to examine the changes that occurred on USA-Cluster, and determine whether a change occurred that contributed to the root cause of the alert or other problems with the cluster.
    1. On the toolbar, click each badge and view the events that occurred.

      The Workload badge displays a graph of the events that occurred on the cluster. Several red triangles appear at various points in the graph. Troubleshooting events for cluster workload

    2. Hover your mouse over each red triangle.

      By reviewing the graph, you can determine whether a reoccurring event has caused the errors. Each event indicates that the guest file system is out of disk space. The affected objects appear in the pane below the graph.

    3. Click each red triangle to identify the affected object and highlight it in the pane below.
  6. Click the All Metrics tab to evaluate the objects in their context in the environment topology to help identify the possible cause of a problem.
    1. In the top view, select USA-Cluster.
    2. In the metrics pane, expand Badge and double-click Badge|Capacity Remaining (%).

      The Badge|Capacity Remaining (%) calculation is added to the lower right pane.

    3. In the metrics pane, double-click Density.
    4. In the metrics pane, double-click Workload.
    5. On the toolbar, click Date Controls and select Last 7 Days.

      The metric chart indicates that the capacity for the cluster remained at a steady level for the past week, but that the cluster density increased to its maximum value in the last several days. The Badge|Workload (%) calculation displays the workload extremes that correspond to the density problem.

Results

You have analyzed the symptoms, timeline, events, and metrics related to the problems on your cluster, and determined that the heavy workload on the cluster has decreased the cluster density in the last several days, which indicates that the cluster is starting to run out of capacity.

What to do next

Examine the Details views and heatmaps to interpret the properties, metrics, and alerts to look for trends and spikes that occur in the resources for your objects, the distributions of resources across your objects, and data maps to examine the use of various resource types across your objects.