You can manually shut down the entire vSAN cluster to perform maintenance or troubleshooting.
Use the Shutdown Cluster wizard unless your workflow requires a manual shut down. When you manually shut down the vSAN cluster, do not deactivate vSAN on the cluster.
Note: If you have a vSphere with Tanzu environment, you must follow the specified order when shutting down or starting up the components. For more information, see "Shutdown and Startup of VMware Cloud Foundation" in the
VMware Cloud Foundation Operations Guide.
Procedure
- Shut down the vSAN cluster.
- Check the vSAN Skyline Health to confirm that the cluster is healthy.
- Power off all virtual machines (VMs) running in the vSAN cluster, if vCenter Server is not hosted on the cluster. If vCenter Server is hosted in the vSAN cluster, do not power off the vCenter Server VM or service VMs (such as DNS, Active Directory) used by vCenter Server.
- If vSAN file service is enabled in the vSAN cluster, you must deactivate the file service. Deactivating the vSAN file service removes the empty file service domain. If you want to retain the empty file service domain after restarting the vSAN cluster, you must create an NFS or SMB file share before deactivating the vSAN file service.
- Click the Configure tab and turn off HA. As a result, the cluster does not register host shutdowns as failures.
For vSphere 7.0 U1 and later, enable vCLS retreat mode. For more information, see the VMware knowledge base article at https://kb.vmware.com/s/article/80472.
- Verify that all resynchronization tasks are complete.
Click the Monitor tab and select vSAN > Resyncing Objects.
- If vCenter Server is hosted on the vSAN cluster, power off the vCenter Server VM.
Make a note of the host that runs the vCenter Server VM. It is the host where you must restart the vCenter Server VM.
- Deactivate cluster member updates from vCenter Server by running the following command on the ESXi hosts in the cluster. Ensure that you run the following command on all the hosts.
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
- Log in to any host in the cluster other than the witness host.
- Run the following command only on that host. If you run the command on multiple hosts concurrently, it may cause a race condition causing unexpected results.
python /usr/lib/vmware/vsan/bin/reboot_helper.py prepare
The command returns and prints the following:
Cluster preparation is done.Note:- The cluster is fully partitioned after the successful completion of the command.
- If you encounter an error, resolve the issue based on the error message and try enabling vCLS retreat mode again.
- If there are unhealthy or disconnected hosts in the cluster, remove the hosts and retry the command.
- Place all the hosts into maintenance mode with No Action. If the vCenter Server is powered off, use the following command to place the ESXi hosts into maintenance mode with No Action.
esxcli system maintenanceMode set -e true -m noAction
Perform this step on all the hosts.To avoid the risk of data unavailability while using No Action at the same time on multiple hosts, followed by a reboot of multiple hosts, see the VMware knowledge base article at https://kb.vmware.com/s/article/60424. To perform simultaneous reboot of all hosts in the cluster using a built-in tool, see the VMware knowledge base article at https://kb.vmware.com/s/article/70650. - After all hosts have successfully entered maintenance mode, perform any necessary maintenance tasks and power off the hosts.
- Restart the vSAN cluster.
- Power on the ESXi hosts.
Power on the physical box where ESXi is installed. The ESXi host starts, locates the VMs, and functions normally.If any hosts fail to restart, you must manually recover the hosts or move the bad hosts out of the vSAN cluster.
- When all the hosts are back after powering on, exit all hosts from maintenance mode. If the vCenter Server is powered off, use the following command on the ESXi hosts to exit maintenance mode.
esxcli system maintenanceMode set -e false
Perform this step on all the hosts. - Log in to one of the hosts in the cluster other than the witness host.
- Run the following command only on that host. If you run the command on multiple hosts concurrently, it may cause a race condition causing unexpected results.
python /usr/lib/vmware/vsan/bin/reboot_helper.py recover
The command returns and prints the following:
Cluster reboot/power-on is completed successfully! - Verify that all the hosts are available in the cluster by running the following command on each host.
esxcli vsan cluster get
- Enable cluster member updates from vCenter Server by running the following command on the ESXi hosts in the cluster. Ensure that you run the following command on all the hosts.
esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates
- Restart the vCenter Server VM if it is powered off. Wait for the vCenter Server VM to be powered up and running. To deactivate vCLS retreat mode, see the VMware knowledge base article at https://kb.vmware.com/s/article/80472.
- Verify again that all the hosts are participating in the vSAN cluster by running the following command on each host.
esxcli vsan cluster get
- Restart the remaining VMs through vCenter Server.
- Check the vSAN Skyline Health and resolve any outstanding issues.
- (Optional) Enable vSAN file service.
- (Optional) If the vSAN cluster has vSphere Availability enabled, you must manually restart vSphere Availability to avoid the following error: Cannot find vSphere HA master agent.
To manually restart vSphere Availability, select the vSAN cluster and navigate to:
- Configure > Services > vSphere Availability > EDIT > Disable vSphere HA
- Configure > Services > vSphere Availability > EDIT > Enable vSphere HA
- Power on the ESXi hosts.
- If there are unhealthy or disconnected hosts in the cluster, recover or remove the hosts from the vSAN cluster. If vCenter Server uses service VMs such as DNS or Active Directory, note them as exceptional VMs in the Shutdown cluster wizard.
Retry the above commands only after the vSAN Skyline Health shows all available hosts in the green state.If you have a three-node vSAN cluster, the command reboot_helper.py recover cannot work in a one host failure situation. As an administrator, do the following:
- Temporarily remove the failure host information from the unicast agent list.
- Add the host after running the following command.
reboot_helper.py recover
Following are the commands to remove and add the host to a vSAN cluster:#esxcli vsan cluster unicastagent remove -a <IP Address> -t node -u <NodeUuid>
#esxcli vsan cluster unicastagent add -t node -u <NodeUuid> -U true -a <IP Address> -p 12321