vSphere 6.0 adds support for vMotion of MSCS clustered virtual machines.

Pre-requisites for vMotion support:

  • vMotion is supported only for a cluster of virtual machines across physical hosts (CAB) with pass-through RDMs.
  • The vMotion network must be a 10Gbps Ethernet link. 1Gbps Ethernet link for vMotion of MSCS virtual machines is not supported.
  • vMotion is supported for Windows Server 2008 SP2 and above releases. Windows Server 2003 is not supported.
  • The MSCS cluster heartbeat time-out must be modified to allow 10 missed heartbeats.
  • The virtual hardware version for the MSCS virtual machine must be version 11 and later .

Modifying the MSCS heartbeat time-out:

Failover cluster nodes use the network to send heartbeat packets to other nodes of the cluster. If a node does not receive a response from another node for a specified period of time, the cluster removes the node from cluster membership. By default, a guest cluster node is considered down if it does not respond within 5 seconds. Other nodes that are members of the cluster will take over any clustered roles that were running on the removed node.

An MSCS virtual machine can stall for a few seconds during vMotion. If the stall time exceeds the heartbeat time-out interval, then the guest cluster considers the node down and this can lead to unnecessary failover. To allow leeway and make the guest cluster more tolerant, the heartbeat time-out interval needs to be modified to allow 10 missed heartbeats. The property that controls the number of allowed heart misses is SameSubnetThreshold. You will need to modify this from its default value to 10. From any one of the participating MSCS cluster nodes run the following command: cluster <cluster-name> /prop SameSubnetThreshold=10:DWORD.

You can also adjust other properties to control the workload tolerance for failover. Adjusting delay controls how often heartbeats are sent between the clustered node. The default setting is 1 second and the maximum setting is 2 seconds. Set the SameSubnetDelay value to 1. Threshold controls how many consecutive heartbeats can be missed before the node considers its partner to be unavailable and triggers the failover process. The default threshold is 5 heartbeats and the maximum is 120 heartbeats. It is the combination of delay and threshold that determines the total elapsed time during which clustered Windows nodes can lose communication before triggering a failover. When the clustered nodes are in different subnets, they are called CrossSubnetDelay and CrossSubnetThreshold. Set the CrossSubnetDelay value to 2 and the CrossSubnetThreshold value to 10.