When a datastore becomes inaccessible, VMCP might not terminate and restart the affected virtual machines.

Problem

When an All Paths Down (APD) or Permanent Device Loss (PDL) failure occurs and a datastore becomes inaccessible, VMCP might not resolve the issue for the affected virtual machines.

Cause

In an APD or PDL failure situation, VMCP might not terminate a virtual machine for the following reasons:

  • VM is not protected by vSphere HA at the time of failure.
  • VMCP is disabled for this virtual machine.
Furthermore, if the failure is an APD, VMCP might not terminate a VM for several reasons:
  • APD failure is corrected before the VM was terminated.
  • Insufficient capacity on hosts with which the virtual machine is compatible
  • During a network partition or isolation, the host affected by the APD failure is not able to query the primary host for available capacity. In such a case, vSphere HA defers to the user policy and terminates the VM if the VM Component Protection setting is aggressive.
  • vSphere HA terminates APD-affected VMs only after the following timeouts expire:
    • APD timeout (default 140 seconds).
    • APD failover delay (default 180 seconds). For faster recovery, this can be set to 0.
      Note: Based on these default values, vSphere HA terminates the affected virtual machine after 320 seconds (APD timeout + APD failover delay)

Solution

To address this issue, check and adjust any of the following:

  • Insufficient capacity to restart the virtual machine
  • User-configured timeouts and delays
  • User settings affecting VM termination
  • VM Component Protection policy
  • Host monitoring or VM restart priority must be enabled