In a DRS activated cluster, you may observe large amounts of VM migrations initiated by DRS targeting only some hosts in the cluster.

DRS scans hosts in the cluster every minute for possible recommendations as a part of its load balancing workflow. The outcome relies on the results of compatibility checks of powered-on VMs against the hosts in the cluster. If there are compatibility constraints that narrow a VM's compatible set of hosts to only certain hosts in the cluster, then DRS will try to satisfy these constraints by migrating that VM to one of the compatible hosts.

The constraints typically come from two sources, static user configuration and runtime state change in the cluster. This article focuses on the runtime state change that could be unexpected to the users.

The runtime state change that can impact the compatibility between powered-on VMs and hosts can be in one of the following areas.

vSphere High Availability

In a HA activated cluster, all hosts are expected to have a healthy HA status. If a host does not have healthy HA status at some time, it will generate a compatibility failure during VM compat-check. Some examples of such situations are: HA agent unreachable, HA agent isolated, HA agent partitioned. See, "Troubleshooting vSphere HA Host States" for more information.

These state changes typically accompany the following event in vCenter Server.

"vSphere HA agent on a host has an error"

Storage Accessibility

If a VM cannot access its configuration file (VMX file), virtual disks (VMDK) or swap file from its current host, it will fail the compat-check leading to the current host being incompatible. If a different host in the cluster still has accessibility to these files DRS could try to migrate the VM to that host. The outcome of such migration varies depending on the accessibility of VM’s VMX file from its current host. If the VM only loses accessibility to its VMDK but not the VMX file, the migration could succeed. If the VM loses accessibility to its VMX file, the migration can fail.

Network Accessibility

For environments with NSX-T, the NSX component status could go down on some hosts, or all hosts at different times. In versions of vCenter Server prior to 7.0u2, this could lead to compatibility check failures for VMs and the affected hosts.

Resolution

  1. If a user is planning to perform an operation that could potentially lead to the runtime state change discussed above, temporarily setting DRS to manual mode can avoid undesired migrations. Alternatively, a user can also temporarily set the DRS advanced option, ​​VmsPerLBIteration, to 0. It asks DRS to scan no VM during its load balance workflow, hence no migration recommendation.
  2. Since vCenter 7.0 Update 1, DRS introduced an advanced option to tolerate a powered-on VM's incompatibility with its current host for a predefined time period, CompatCheckTransientFailureTimeSeconds. Users can configure this option to avoid undesired migrations due to transient incompatibility.
Note: The default value of this option is 600 (10 minutes), which means DRS will not move a VM out just because of incompatibility with its current host if the incompatibility lasts for less than 10 mins. The maximum value of this option is 3600 (60 minutes).

Since vCenter 7.0 Update 3 and 8.0 Update 1, this option can also be set to -1, which disallows DRS to move VMs out due to its incompatibility with its current host.

Since vCenter 8.0 Update 3, the default value of this option is set to -1.

To set the DRS advanced option perform the following steps from the vSphere client:

  1. Right-click the DRS cluster and click Settings > vSphere DRS > Edit > Advanced Options > Add
  2. In the Option column enter the option name.
  3. Click in the Value column to enter the desired value, then click OK for this setting to take effect.