vSphere Bitfusion performs the following checks when a health check of a server is initiated from the vSphere Bitfusion Plug-in.

Health Checks List

Name Type Description
cass_buckets Stability Validates the bucketing used by Cassandra to store data for utilization and other items.
cass_node_num Stability Confirms that Cassandra and Bitfusion see the same number of servers in the cluster.
cass_nodetool Stability Confirms that Cassandra sees that the cluster is in a healthy state.
cass_replication Stability Confirms the replication factor.
compute_mode Stability Confirms that the GPUs have compute mode set appropriately.
network Stability Verifies if there are dropped packets on the network.
ecc Stability Verifies if there are any ECC errors on the GPUs.
gpu_api Stability Confirms that the GPU APIs are matching.
pci_nvml Stability Confirms that all GPUs can be enumerated.
pci_p2p Stability Verifies that PCIe P2P is supported.
temperature Stability Verifies that the GPUs temperature is below 100 degrees celsiuses.
vcenter_check Stability Validates that the server can connect to vCenter Server.
xid Stability Verifies if there are any GPU Xid failures.
bogomips Performance Validates performance. The metric is used by the Linux kernel.
hostmem Performance Validates that there is enough host memory on the system.
iface_compat Performance Validates that the network configuration is valid.
memops Performance Verifies that memops is enabled for the GPUs.
mtu Performance Verifies that jumbo frames are enabled for the network.
nvidia_stats Performance Validates the statistics for the GPUs.
nvidia_topo Performance Validates the host topology.
pci_width Performance Validates that the GPUs are using the maximum PCIe lane capacity.
ulimit_n Performance Verifies that the maximum file descriptors limit is appropriate.
diskspace System Resource Confirms the free space on the server.
install System Resource Validates the Bitfusion installation.
pciinfo System Resource Validate the PCI configuration.
shadow_mem System Resource Verifies that there is at least the same amount of system memory as there is frame buffer memory on the GPUs.
cuda_version Software Version Verifies the CUDA version.
libdep Software Version Verifies that the software dependencies for Bitfusion are installed.
driver_version Software Version Verifies the NVIDIA driver version.