You can check the performance, stability, system resources, and software versions of a vSphere Bitfusion server by performing a health check.

You can check the health status of a selected vSphere Bitfusion server and if needed, perform troubleshooting. The health check examines the performance, stability, system resources, and software versions of a selected vSphere Bitfusion server and the server's surrounding vCenter Server environment. Each health check can return a pass, marginal, or fatal status.

For example, the health check verifies that all nodes are running, that there is enough free space, and that the connection to vCenter Server is working. To view the list of all available health checks, see List of health checks in vSphere Bitfusion.

By deactivating a health check in the following procedure, you change the health check settings for the specific vSphere Bitfusion server only. А deactivated health check is still performed in the background, but the status of the check is not changing the overall health status of the server that is displayed on the Servers tab. You can change the global health check settings for all vSphere Bitfusion servers on the Settings > Global Health Check Defaults tab.

Procedure

  1. In the vSphere Client, select Menu (vSphere Client menu icon) > Bitfusion.
  2. On the Servers tab, select a server from the list.
  3. From the Actions drop-down menu, select Health.
    The Health logs dialog box appears and the results of the health checks are displayed. You see the status, type, name, and details of the check.
  4. (Optional) To deactivate a specific health check, click the toggle button.
  5. Click Save and Exit.

What to do next

List of health checks in vSphere Bitfusion

vSphere Bitfusion performs the following checks when a health check of a server is initiated from the vSphere Bitfusion plug-in.

Health checks list

Name Type Description
cass_buckets Stability Validates the bucketing used by Cassandra to store data for utilization and other items.
cass_node_num Stability Confirms that Cassandra and Bitfusion see the same number of servers in the cluster.
cass_nodetool Stability Confirms that Cassandra sees that the cluster is in a healthy state.
cass_replication Stability Confirms the replication factor.
compute_mode Stability Confirms that the GPUs have compute mode set appropriately.
network Stability Verifies if there are dropped packets on the network.
ecc Stability Verifies if there are any ECC errors on the GPUs.
gpu_api Stability Confirms that the GPU APIs are matching.
pci_nvml Stability Confirms that all GPUs can be enumerated.
pci_p2p Stability Verifies that PCIe P2P is supported.
temperature Stability Verifies that the GPUs temperature is below 100 degrees celsiuses.
vcenter_check Stability Validates that the server can connect to vCenter Server.
xid Stability Verifies if there are any GPU Xid failures.
bogomips Performance Validates performance. The metric is used by the Linux kernel.
hostmem Performance Validates that there is enough host memory on the system.
iface_compat Performance Validates that the network configuration is valid.
memops Performance Verifies that memops is enabled for the GPUs.
mtu Performance Verifies that jumbo frames are enabled for the network.
nvidia_stats Performance Validates the statistics for the GPUs.
nvidia_topo Performance Validates the host topology.
pci_width Performance Validates that the GPUs are using the maximum PCIe lane capacity.
ulimit_n Performance Verifies that the maximum file descriptors limit is appropriate.
diskspace System Resource Confirms the free space on the server.
install System Resource Validates the Bitfusion installation.
pciinfo System Resource Validate the PCI configuration.
shadow_mem System Resource Verifies that there is at least the same amount of system memory as there is frame buffer memory on the GPUs.
cuda_version Software Version Verifies the CUDA version.
libdep Software Version Verifies that the software dependencies for Bitfusion are installed.
driver_version Software Version Verifies the NVIDIA driver version.