You can check the performance, stability, system resources, and software versions of a vSphere Bitfusion server by performing a health check.
You can check the health status of a selected vSphere Bitfusion server and if needed, perform troubleshooting. The health check examines the performance, stability, system resources, and software versions of a selected vSphere Bitfusion server and the server's surrounding vCenter Server environment. Each health check can return a pass, marginal, or fatal status.
For example, the health check verifies that all nodes are running, that there is enough free space, and that the connection to vCenter Server is working. To view the list of all available health checks, see List of health checks in vSphere Bitfusion.
By deactivating a health check in the following procedure, you change the health check settings for the specific vSphere Bitfusion server only. А deactivated health check is still performed in the background, but the status of the check is not changing the overall health status of the server that is displayed on the Servers tab. You can change the global health check settings for all vSphere Bitfusion servers on the tab.
Procedure
What to do next
List of health checks in vSphere Bitfusion
vSphere Bitfusion performs the following checks when a health check of a server is initiated from the vSphere Bitfusion plug-in.
Health checks list
Name | Type | Description |
---|---|---|
cass_buckets |
Stability | Validates the bucketing used by Cassandra to store data for utilization and other items. |
cass_node_num |
Stability | Confirms that Cassandra and Bitfusion see the same number of servers in the cluster. |
cass_nodetool |
Stability | Confirms that Cassandra sees that the cluster is in a healthy state. |
cass_replication |
Stability | Confirms the replication factor. |
compute_mode |
Stability | Confirms that the GPUs have compute mode set appropriately. |
network |
Stability | Verifies if there are dropped packets on the network. |
ecc |
Stability | Verifies if there are any ECC errors on the GPUs. |
gpu_api |
Stability | Confirms that the GPU APIs are matching. |
pci_nvml |
Stability | Confirms that all GPUs can be enumerated. |
pci_p2p |
Stability | Verifies that PCIe P2P is supported. |
temperature |
Stability | Verifies that the GPUs temperature is below 100 degrees celsiuses. |
vcenter_check |
Stability | Validates that the server can connect to vCenter Server. |
xid |
Stability | Verifies if there are any GPU Xid failures. |
bogomips |
Performance | Validates performance. The metric is used by the Linux kernel. |
hostmem |
Performance | Validates that there is enough host memory on the system. |
iface_compat |
Performance | Validates that the network configuration is valid. |
memops |
Performance | Verifies that memops is enabled for the GPUs. |
mtu |
Performance | Verifies that jumbo frames are enabled for the network. |
nvidia_stats |
Performance | Validates the statistics for the GPUs. |
nvidia_topo |
Performance | Validates the host topology. |
pci_width |
Performance | Validates that the GPUs are using the maximum PCIe lane capacity. |
ulimit_n |
Performance | Verifies that the maximum file descriptors limit is appropriate. |
diskspace |
System Resource | Confirms the free space on the server. |
install |
System Resource | Validates the Bitfusion installation. |
pciinfo |
System Resource | Validate the PCI configuration. |
shadow_mem |
System Resource | Verifies that there is at least the same amount of system memory as there is frame buffer memory on the GPUs. |
cuda_version |
Software Version | Verifies the CUDA version. |
libdep |
Software Version | Verifies that the software dependencies for Bitfusion are installed. |
driver_version |
Software Version | Verifies the NVIDIA driver version. |