vSphere Bitfusion performs the following checks when a health check of a server is initiated from the vSphere Bitfusion Plug-in.
Health Checks List
Name | Type | Description |
---|---|---|
cass_buckets |
Stability | Validates the bucketing used by Cassandra to store data for utilization and other items. |
cass_node_num |
Stability | Confirms that Cassandra and Bitfusion see the same number of servers in the cluster. |
cass_nodetool |
Stability | Confirms that Cassandra sees that the cluster is in a healthy state. |
cass_replication |
Stability | Confirms the replication factor. |
compute_mode |
Stability | Confirms that the GPUs have compute mode set appropriately. |
network |
Stability | Verifies if there are dropped packets on the network. |
ecc |
Stability | Verifies if there are any ECC errors on the GPUs. |
gpu_api |
Stability | Confirms that the GPU APIs are matching. |
pci_nvml |
Stability | Confirms that all GPUs can be enumerated. |
pci_p2p |
Stability | Verifies that PCIe P2P is supported. |
temperature |
Stability | Verifies that the GPUs temperature is below 100 degrees celsiuses. |
vcenter_check |
Stability | Validates that the server can connect to vCenter Server. |
xid |
Stability | Verifies if there are any GPU Xid failures. |
bogomips |
Performance | Validates performance. The metric is used by the Linux kernel. |
hostmem |
Performance | Validates that there is enough host memory on the system. |
iface_compat |
Performance | Validates that the network configuration is valid. |
memops |
Performance | Verifies that memops is enabled for the GPUs. |
mtu |
Performance | Verifies that jumbo frames are enabled for the network. |
nvidia_stats |
Performance | Validates the statistics for the GPUs. |
nvidia_topo |
Performance | Validates the host topology. |
pci_width |
Performance | Validates that the GPUs are using the maximum PCIe lane capacity. |
ulimit_n |
Performance | Verifies that the maximum file descriptors limit is appropriate. |
diskspace |
System Resource | Confirms the free space on the server. |
install |
System Resource | Validates the Bitfusion installation. |
pciinfo |
System Resource | Validate the PCI configuration. |
shadow_mem |
System Resource | Verifies that there is at least the same amount of system memory as there is frame buffer memory on the GPUs. |
cuda_version |
Software Version | Verifies the CUDA version. |
libdep |
Software Version | Verifies that the software dependencies for Bitfusion are installed. |
driver_version |
Software Version | Verifies the NVIDIA driver version. |