If a host fails and its virtual machines must be restarted, you can control the order in which the virtual machines are restarted with the VM restart priority setting. You can also configure how vSphere HA responds if hosts lose management network connectivity with other hosts by using the host isolation response setting. Other factors are also considered when vSphere HA restarts a virtual machine after a failure.
The following settings apply to all virtual machines in the cluster in the case of a host failure or isolation. You can also configure exceptions for specific virtual machines. See Customize an Individual Virtual Machine.
VM Restart Priority
VM restart priority determines the relative order in which virtual machines are allocated resources after a host failure. Such virtual machines are assigned to hosts with unreserved capacity, with the highest priority virtual machines placed first and continuing to those with lower priority until all virtual machines have been placed or no more cluster capacity is available to meet the reservations or memory overhead of the virtual machines. A host then restarts the virtual machines assigned to it in priority order. If there are insufficient resources, vSphere HA waits for more unreserved capacity to become available, for example, due to a host coming back online, and then retries the placement of these virtual machines. To reduce the chance of this situation occurring, configure vSphere HA admission control to reserve more resources for failures. Admission control allows you to control the amount of cluster capacity that is reserved by virtual machines, which is unavailable to meet the reservations and memory overhead of other virtual machines if there is a failure.
The values for this setting are Disabled, Low, Medium (the default), and High. The Disabled setting is ignored by the vSphere HA VM/Application monitoring feature because this feature protects virtual machines against operating system-level failures and not virtual machine failures. When an operating system-level failure occurs, the operating system is rebooted by vSphere HA, and the virtual machine is left running on the same host. You can change this setting for individual virtual machines.
A virtual machine reset causes a hard reboot of the guest operating system, but does not power cycle the virtual machine.
The restart priority settings for virtual machines vary depending on user needs. Assign higher restart priority to the virtual machines that provide the most important services.
For example, in the case of a multitier application, you might rank assignments according to functions hosted on the virtual machines.
High. Database servers that provide data for applications.
Medium. Application servers that consume data in the database and provide results on web pages.
Low. Web servers that receive user requests, pass queries to application servers, and return results to users.
If a host fails, vSphere HA attempts to register to an active host the affected virtual machines that were powered on and have a restart priority setting of Disabled, or that were powered off.
Host Isolation Response
Host isolation response determines what happens when a host in a vSphere HA cluster loses its management network connections, but continues to run. You can use the isolation response to have vSphere HA power off virtual machines that are running on an isolated host and restart them on a nonisolated host. Host isolation responses require that Host Monitoring Status is enabled. If Host Monitoring Status is disabled, host isolation responses are also suspended. A host determines that it is isolated when it is unable to communicate with the agents running on the other hosts, and it is unable to ping its isolation addresses. The host then executes its isolation response. The responses are Power off and restart VMs or Shutdown and restart VMs. You can customize this property for individual virtual machines.
If a virtual machine has a restart priority setting of Disabled, no host isolation response is made.
To use the Shutdown and restart VMs setting, you must install VMware Tools in the guest operating system of the virtual machine. Shutting down the virtual machine provides the advantage of preserving its state. Shutting down is better than powering off the virtual machine, which does not flush most recent changes to disk or commit transactions. Virtual machines that are in the process of shutting down take longer to fail over while the shutdown completes. Virtual Machines that have not shut down in 300 seconds, or the time specified in the advanced option das.isolationshutdowntimeout, are powered off.
After you create a vSphere HA cluster, you can override the default cluster settings for Restart Priority and Isolation Response for specific virtual machines. Such overrides are useful for virtual machines that are used for special tasks. For example, virtual machines that provide infrastructure services like DNS or DHCP might need to be powered on before other virtual machines in the cluster.
A virtual machine "split-brain" condition can occur when a host becomes isolated or partitioned from a master host and the master host cannot communicate with it using heartbeat datastores. In this situation, the master host cannot determine that the host is alive and so declares it dead. The master host then attempts to restart the virtual machines that are running on the isolated or partitioned host. This attempt succeeds if the virtual machines remain running on the isolated/partitioned host and that host lost access to the virtual machines' datastores when it became isolated or partitioned. A split-brain condition then exists because there are two instances of the virtual machine. However, only one instance is able to read or write the virtual machine's virtual disks. VM Component Protection can be used to prevent this split-brain condition. When you enable VMCP with the aggressive setting, it monitors the datastore accessibility of powered-on virtual machines, and shuts down those that lose access to their datastores.
To recover from this situation, ESXi generates a question on the virtual machine that has lost the disk locks for when the host comes out of isolation and cannot reacquire the disk locks. vSphere HA automatically answers this question, allowing the virtual machine instance that has lost the disk locks to power off, leaving just the instance that has the disk locks.
Factors Considered for Virtual Machine Restarts
After a failure, the cluster's master host attempts to restart affected virtual machines by identifying a host that can power them on. When choosing such a host, the master host considers a number of factors.
Before a virtual machine can be started, its files must be accessible from one of the active cluster hosts that the master can communicate with over the network
Virtual machine and host compatibility
If there are accessible hosts, the virtual machine must be compatible with at least one of them. The compatibility set for a virtual machine includes the effect of any required VM-Host affinity rules. For example, if a rule only permits a virtual machine to run on two hosts, it is considered for placement on those two hosts.
Of the hosts that the virtual machine can run on, at least one must have sufficient unreserved capacity to meet the memory overhead of the virtual machine and any resource reservations. Four types of reservations are considered: CPU, Memory, vNIC, and Virtual flash. Also, sufficient network ports must be available to power on the virtual machine.
In addition to resource reservations, a virtual machine can only be placed on a host if doing so does not violate the maximum number of allowed virtual machines or the number of in-use vCPUs.
If the advanced option has been set that requires vSphere HA to enforce VM to VM anti-affinity rules, vSphere HA does not violate this rule. Also, vSphere HA does not violate any configured per host limits for fault tolerant virtual machines.
If no hosts satisfy the preceding considerations, the master host issues an event stating that there are not enough resources for vSphere HA to start the VM and tries again when the cluster conditions have changed. For example, if the virtual machine is not accessible, the master host tries again after a change in file accessibility.
Limits for Virtual Machine Restart Attempts
If the vSphere HA master agent's attempt to restart a VM, which involves registering it and powering it on, fails, this restart is retried after a delay. vSphere HA attempts these restarts for a maximum number of attempts (6 by default), but not all restart failures count against this maximum.
For example, the most likely reason for a restart attempt to fail is because either the VM is still running on another host, or because vSphere HA tried to restart the VM too soon after it failed. In this situation, the master agent delays the retry attempt by twice the delay imposed after the last attempt, with a 1 minute minimum delay and a 30 minute maximum delay. Thus if the delay is set to 1 minute, there is an initial attempt at T=0, then additional attempts made at T=1 (1 minute), T=3 (3 minutes), T=7 (7 minutes), T=15 (15 minutes), and T=30 (30 minutes). Each such attempt is counted against the limit and only six attempts are made by default.
Other restart failures result in countable retries but with a different delay interval. An example scenario is when the host chosen to restart virtual machine loses access to one of the VM's datastores after the choice was made by the master agent. In this case, a retry is attempted after a default delay of 2 minutes. This attempt also counts against the limit.
Finally, some retries are not counted. For example, if the host on which the virtual machine was to be restarted fails before the master agent issues the restart request, the attempt is retried after 2 minutes but this failure does not count against the maximum number of attempts.
Virtual Machine Restart Notifications
vSphere HA generates a cluster event when a failover operation is in progress for virtual machines in the cluster. The event also displays a configuration issue in the Cluster Summary tab which reports the number of virtual machines that are being restarted. There are four different categories of such VMs.
VMs being placed: vSphere HA is in the process of trying to restart these VMs
VMs awaiting a retry: a previous restart attempt failed, and vSphere HA is waiting for a timeout to expire before trying again.
VMs requiring additional resources: insufficient resources are available to restart these VMs. vSphere HA retries when more resources become available, for example a host comes back online.
Inaccessible Virtual SAN VMs: vSphere HA cannot restart these Virtual SAN VMs because they are not accessible. It retries when there is a change in accessibility.
These virtual machine counts are dynamically updated whenever a change is observed in the number of VMs for which a restart operation is underway. The configuration issue is cleared when vSphere HA has restarted all VMs or has given up trying.
In vSphere 5.5 or earlier, a per-VM event is triggered for an unsuccessful attempt to restart the virtual machine. This event is disabled by default in vSphere 6.x and can be enabled by setting the vSphere HA advanced option das.config.fdm.reportfailoverfailevent to 1.