When a datastore becomes inaccessible, VMCP might not terminate and restart the affected virtual machines.
Problem
When an All Paths Down (APD) or Permanent Device Loss (PDL) failure occurs and a datastore becomes inaccessible, VMCP might not resolve the issue for the affected virtual machines.
Cause
In an APD or PDL failure situation, VMCP might not terminate a virtual machine for the following reasons:
- VM is not protected by vSphere HA at the time of failure.
- VMCP is disabled for this virtual machine.
Furthermore, if the failure is an APD, VMCP might not terminate a VM for several reasons:
- APD failure is corrected before the VM was terminated.
- Insufficient capacity on hosts with which the virtual machine is compatible
- During a network partition or isolation, the host affected by the APD failure is not able to query the primary host for available capacity. In such a case, vSphere HA defers to the user policy and terminates the VM if the VM Component Protection setting is aggressive.
- vSphere HA terminates APD-affected VMs only after the following timeouts expire:
- APD timeout (default 140 seconds).
- APD failover delay (default 180 seconds). For faster recovery, this can be set to 0.
Note: Based on these default values, vSphere HA terminates the affected virtual machine after 320 seconds (APD timeout + APD failover delay)
Solution
To address this issue, check and adjust any of the following:
- Insufficient capacity to restart the virtual machine
- User-configured timeouts and delays
- User settings affecting VM termination
- VM Component Protection policy
- Host monitoring or VM restart priority must be enabled