To perform troubleshooting or maintenance on a vSphere Bitfusion server, you must remove the server from the vSphere Bitfusion cluster.
When powering off a vSphere Bitfusion server for maintenance or to perform troubleshooting, the health status of the vSphere Bitfusion cluster changes. When the cluster is not in a healthy state, you cannot add vSphere Bitfusion servers or perform a cluster backup operation. If half or more of the servers are powered off, the cluster is inoperable. When powering off a server for a longer period of time, you can prevent any potential risk by removing the server from the cluster.
Performing the following procedure immediately removes the server from the vSphere Bitfusion cluster. Any running applications that are using the GPUs receive an immediate GPU failure and usually return an error condition.
Prerequisites
- Prevent new client connections to the specific server in the server settings.
- Verify that there are no running applications on the server.
Procedure
Results
What to do next
- If you deleted the server from the cluster without deleting the VM, delete the /etc/bitfusion/bitfusion-manager.yaml configuration file on the VM, reactivate the VM as a vSphere Bitfusion server, restart the vSphere Bitfusion service, and power on the VM. For more information, see Activating the vSphere Bitfusion Client in the Installing VMware vSphere Bitfusion and How to start and stop the vSphere Bitfusion service.
- If you deleted the server VM, you can reuse the underlying hardware as a vSphere Bitfusion server by creating a VM and deploying the vSphere Bitfusion server appliance. See How to install subsequent vSphere Bitfusion servers.