The NSX Advanced Load Balancer Controller and Service Engines can be deployed in an environment with VMware vSphere high availability (HA) and Distributed Resource Scheduler (DRS) features enabled. This topic provides details on high availability considerations when HA-DRS or a live vMotion is performed.

About VMware vSphere High Availability

VMware vSphere High Availability delivers the availability required by most applications running in virtual machines, independent of the operating system and applications running in it. High Availability provides uniform, cost-effective fail-over protection against hardware and operating system outages within your virtualized IT environment.

For more information on High Availability, see High Availability Options.

VMware vSphere Distributed Resource Scheduler

VMware DRS allows the grouping of hosts into resource clusters, to separate the computing needs of different business units. VMware vSphere clusters allows you to:

  • Provide highly available resources to your workloads

  • Balance workloads for optimal performance

  • Scale and manage computing resources without service disruption

For more details on Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM), see DRS-DPM.

About VMware vSphere vMotion

VMware vSphere vMotion is a zero-downtime live migration of workloads from one server to another. During the workload migration, the application is still running, and you can continue to have access to the systems needed. For more details on vMotion, see vSphere vMotion.

Note:

Deployments must adhere to recommendations provided by VMware for configuring VMware HA, DRS, and vSAN features.

Deploying the NSX Advanced Load Balancer in VMware HA Enabled Hosts

The NSX Advanced Load Balancer supports Controller clusters and Service Engines when deploying on hosts with VMware HA enabled.

Deployment Prerequisites

If a VMware Cluster is configured with vSphere HA and enabled with dedicated failover hosts, the same set of hosts must be configured in the Service Engine Group properties in the Host Exclude List field.

Observations when a Service Engine is marked down due to a Host Failure

The following are the observations and considerations while integrating with vCenter, and a Service Engine is marked down due to a host failure:

  • Existing Service Engines are available in the Service Engine Group and have capacity.

    • When existing Service Engines are available for the virtual services to be placed on, in case of an SE failure, the Controller programs the Virtual Services on the SEs immediately. This is a standard functionality, irrespective of VMware HA configuration.

  • Service Engines in the Service Engine Group do not have capacity.

    • When the Controller is unable to deploy another Service Engine because the maximum SE capacity has been reached in the SE Group and no existing SEs have additional capacity, vSphere HA will restart the failed Service Engine VMs on a different ESXi host. As per the NSX Advanced Load Balancer test qualifications, this operation takes 2-3 minutes.

When the Controller needs to deploy another Service Engine (in write access mode) as the existing SEs do not have capacity, the Controller will initiate deployment of a new Service Engine VM. As per the NSX Advanced Load Balancer test qualifications, this operation takes 4-5 minutes.

  • Service Engine workloads running on VMware DRS enabled cluster.

    • vCenter administrator needs to ensure that all the ESXi hosts that are part of VMware DRS enabled cluster and are going to host Service Engine workloads, have the following configuration:

      • vSwtich0 created with no Physical Adapters attached (Internal only switch).

      • vSwitch0 must have the Virtual Machine Port Group for a Standard Switch PG created with the name set to Avi Internal.

Live vMotion Migrations

The behavior with pro-active vMotion migration of Controllers and Service Engines is the same as listed in the Deploying NSX Advanced Load Balancer in vSphere DRS Enabled section mentioned below.

vSphere HA for SE

The virtual service operational availability and time taken for switchover in case of vSphere HA failover for SE is as follows:

Virtual Service Slot Availability

Virtual Service Placement/Switchover Time

Virtual Service Recovery by Controller or vSphere HA

Virtual service slots available in other SE (VS is already scaled out)

Virtual Service switches over to the available SE and the time it take for the switchover is as mentioned in the SE-to-SE Failure Detection Method topic in the VMware NSX Advanced Load Balancer Configuration Guide guide.

Virtual Service recovers before vSphere HA brings up the SE.

Virtual service slots available in other SE (virtual service is not scaled out)

Virtual Service switch to other SE takes around a minute, based on Default Controller - SE detection time.

Virtual Service recovers before vSphere HA brings up the SE.

Virtual Service slots NOT available in other SEs in the SE Group.

Detection time is same as in case 1. The Controller spins a new SE, based on SE group capacity. The VS will be placed on the new SE.

The Virtual Service will get placed on new SE and approximate time would be 3 to 5 mins.

Deploying NSX Advanced Load Balancer in vSphere DRS Enabled

  • Controller Clusters:

    NSX Advanced Load Balancer supports the Controller clusters when deployed on hosts with vSphere DRS enabled.

    vSphere DRS ensures that the Controllers VMs are available during the vMotion of a Controller VM node.

    Note:

    There might be a momentary loss of real-time metrics data and latency during API calls to the Controller in the live vMotion window.

  • Service Engines:

    The NSX Advanced Load Balancer supports Service Engines while deploying on hosts with vSphere DRS enabled.

    • Deployment Prerequisites: Use of Level-5 (Aggressive) migration level is not recommended as vMotion of Service Engines can happen due to high CPU alarms in normal course of operation. For more information on high CPU utilization for Services Engines as reported by the hypervisor,see the Disparity in CPU Utilization topic in the VMware NSX Advanced Load Balancer Monitoring and Operability Guide.

  • Impact on Application (Data-Plane) Traffic:

    The following data is based on the qualification of NSX Advanced Load Balancer in a representative test topology:

    • vMotion results in data-plane reconfiguration at the hypervisor level. The observations of the NSX Advanced Load Balancer with various application protocols is as follows:

      • In case of TCP based applications, the TCP protocol’s retry mechanism resolves any lost packets, without impacting the application. No traffic loss was observed.

      • In case of UDP or ICMP based applications, there is a possibility of traffic failures during a small time window. A traffic loss for 15-30ms was observed.

    • During vMotion, the Controllers and other Service Engines can register a momentary loss of data-plane heartbeats to the Service Engine being vMotioned. This is manifested by the following events in succession, being generated and visible on the Controller:

      • SE_DP_HB_FAILED , followed by

      • SE_DP_HB_RECOVERED

    This sequence of events can be ignored when it has occurred during the time of vMotion and has subsequently recovered.