Service Engine Failure Detection

Failure detection is essential in achieving Service Engine high availability.

NSX Advanced Load Balancer relies on various methods to detect Service Engine failures, as listed:

Controller-to-SE Failure Detection Method
SE-to-SE Failure Detection Method
BGP-Router-to-SE Failure Detection Method

Controller-to-SE Failure Detection Method

In all deployments, the NSX Advanced Load Balancer Controller sends heartbeat messages to all Service Engines in all groups under its control once every 10 seconds. If there is no response from a specific SE for six consecutive heartbeat messages, the Controller concludes that the SE is DOWN and moves all virtual services to the new SEs.

When vSphere High Availability is enabled, if the Controller detects that a vSphere host failure has occurred, the SEs will transition to OPER_PARTITIONED or OPER_DOWN prior to missing six consecutive heartbeat misses.

SEs (on the failed host) which have operational virtual services transition to OPER_PARTITIONED state.
SEs (on the failed host) which do not have any operational virtual services transition to OPER_DOWN state.

SE-to-SE Failure Detection Method

In the above-mentioned Controller-to-SE failure detection method, the Controller detects a Service Engine failure by sending periodic heartbeat messages over the management interface. However, this method will not detect datapath failures for the data interfaces on SEs.

To verify holistic failure detection, the Service Engine datapath heartbeat mechanism was devised, where the Service Engines send periodic heartbeat messages over the data interfaces.

By default, this communication is set to standard mode. It can also be configured for the aggressive mode, as discussed in the Enabling Aggressive Mode using the CLI section.

Service Engine Datapath Communication Modes

Depending on the Service Engine deployment, the three modes available for SE-to-SE inter-process communication are as discussed below:

Custom EtherTypes

This is the default mode applicable when the Service Engines are in the same subnet. The EtherTypes used are:

ETHERTYPE_AVI_IPC 0XA1C0
ETHERTYPE_AVI_MACINMAC 0XA1C1
ETHERTYPE_AVI_MACINMAC_TXONLY 0XA1C2

IP Encapsulation

This mode is applicable when the infrastructure does not permit EtherTypes through. Even in this mode, it is assumed that the Service Engines are in the same subnet. This mode is applicable for AWS by default.

Configure IP encapsulation by using the se_ip_encap_ipc X command.

The following example displays configuring IP encapsulation using the CLI.

#shell
Login: admin
Password: 
[GB-slough-cam:cd-avi-cntrl1]: > configure serviceengineproperties
[GB-slough-cam:cd-avi-cntrl1]: seproperties> se_bootup_properties
[GB-slough-cam:cd-avi-cntrl1]: seproperties:se_bootup_properties> se_ip_encap_ipc 1
[GB-slough-cam:cd-avi-cntrl1]: seproperties:se_bootup_properties> save
[GB-slough-cam:cd-avi-cntrl1]: seproperties:> save
[GB-slough-cam:cd-avi-cntrl1]: > reboot serviceengine <IP 1>
[GB-slough-cam:cd-avi-cntrl1]: > reboot serviceengine <IP 2>

Note:

For changes to the se_ip_encap_ipc command to be effective, reboot all Service Engines in the Service Engine group.

The IP protocols used in this mode are:

IPPROTO_AVI_IPC 73
IPPROTO_AVI_MACINMAC 97
IPPROTO_AVI_MACINMAC_TX 63

IP packets

This mode is applicable when the Service Engines are in different subnets. The IP packet destined to the destination Service Engine’s interface IP is sent to the next-hop router. The IP protocols used in this mode are:

IPPROTO_AVI_IPC_L3 75
IPPROTO_AVI_MACINMAC 97

BGP-Router-to-SE Failure Detection Method

With BGP configured, the SE-to-SE failure detection is augmented as described below:

Bidirectional Forwarding Detection (BFD) detects SE failures and prompts the router not to use the route to the failed SE for flow load balancing.
Routers detect SE failures using BGP protocol timers.

Failure Detection Algorithm

Consider a SE group on which a virtual service has been scaled out. The sequence followed for failure detection is as explained below:

Virtual service’s primary SE sends periodic heartbeat messages to all virtual services’ secondary SEs.
If a SE fails to respond repeatedly, the primary SE will suspect that the said SE may be down.
A notification is sent to NSX Advanced Load Balancer Controller indicating a possible SE failure.
NSX Advanced Load Balancer Controller sends a sequence of echo messages to confirm if the suspected Service Engine is indeed down.

Based on the time frame and frequency of heartbeat messages sent across the Service Engines, the modes of operation are standard and aggressive. The algorithm for both modes is the same, with a difference in frequency and time frame, as explained below:

The primary SE sends heartbeat messages to the secondary SE on a customized interval, for instance, 100 milliseconds. A string of consecutive failures to respond will indicate that the given SE could be down. According to the settings shown in the second column, the primary SE will suspect a secondary SE to be down if,
- 10 consecutive heartbeat messages fail for one second (standard), or
- 10 consecutive heartbeat messages fail for one second (aggressive). However it can be tweaked to make it aggressive with the below configuration parameters.
- As soon as primary suspects that the secondary is down, it apprises the NSX Advanced Load Balancer Controller, which then sends echo messages to the suspect. According to the settings shown in the third column, the Controller will declare the suspect down after,
  - Four consecutive echo messages fail for eight second (standard), or
  - Two consecutive echo messages fail for four second (aggressive).

By summing the values in the second and third columns, the Controller makes a failure conclusion within nice seconds under standard settings, but just within five seconds under aggressive settings.

The time taken to detect Service Engine failure based on SE-DP heartbeat failure is as follows:


Detection Mode	SE-SE HB Messaging	Controller-SE Echo Messages	Total Time for Failure Detection
Normal Mode	HB-Period: 100 ms	Echo-Period: 2 seconds	1+8 = 9 seconds
Normal Mode	10 consecutive failures	4 consecutive failures	1+8 = 9 seconds
Aggressive Mode	HB-Period: 100 ms	Echo-Period: 2 seconds	1+4 = 5 seconds
Aggressive Mode	10 consecutive failures	2 consecutive failures	1+4 = 5 seconds

The aggressive failure detection as aggressive as two seconds can be achieved with the following configuration. However, it is recommended only on bare-metal environment, on virtualised environment, it leads to false positives.

serviceengineproperties indicate the aggressive timeout values:

configure serviceengineproperties 
se_runtime_properties
|   dp_aggressive_hb_frequency                    | 100 milliseconds                |
|   dp_aggressive_hb_timeout_count                | 5                              |
se_agent_properties
|   controller_echo_rpc_aggressive_timeout        | 500 milliseconds               |
|   controller_echo_miss_aggressive_limit         | 3                               |

Enabling Aggressive Mode using CLI

Service Engine failure detection can be set to aggressive mode using only the CLI, as explained below.

Log in to the shell prompt for NSX Advanced Load Balancer Controller and enter the following commands under the chosen Service Engine group:

[admin:1-Controller-2]: > configure serviceenginegroup AA-SE-Group
  
[admin:1-Controller-2]: serviceenginegroup> aggressive_failure_detection
  
[admin:1-Controller-2]: serviceenginegroup> save

Verify the settings using the following show command:

[admin:1-Controller-2]: > show serviceenginegroup AA-SE-Group  | grep aggressive

| aggressive_failure_detection   | True

Service Engine File is Read-only

There are instances where Service Engine’s file system becomes read-only. This affects Service Engine functionality and can result in unexplained failures. The read-only state can be because of a faulty physical disk or the Service Engine running out of disk space. This can only be detected by analyzing the Service Engine syslog files.