Failure Scenarios and Resolutions

The behavior of the NSX Advanced Load Balancer after a failure in a GSLB deployment depends on where the failure is.

The failure can occur:

At a leader site or in one of the followers
For the entire site or only for the NSX Advanced Load Balancer Controller

Follower Site Failures

Note:

In our follower-site failure examples, we focus on infrastructure deployed in Santa Clara, Chicago, and New York.


Issue	Resolution
Full-Site Failure	A full-site failure occurs at the NY-1 follower site, as shown below. The leader in Santa Clara and Chicago are active sites and therefore detect the failure. Administrative changes to the GSLB configuration continue to be possible on the leader, but they will not make it to the NY-1 site. Both control-plane and data-plane health monitors will mark NY-1's GS members Down. For more information, see GSLB Health Monitors. DNS service for the GSLB configuration remains operational at the two surviving sites. Global application service will continue on the surviving sites (Santa Clara, Chicago, and NY-2).
Partial-site Failure	If only the NSX Advanced Load Balancer Controller at the NY-1 site fails, the SEs continue to serve applications in headless mode. The leader and Chicago Controllers detect the failure using their control-plane monitors. Any administrative changes made on the leader do not propagate to the NY-1 site. Data-plane health monitors running in Santa Clara and Chicago continue to perceive NY-1's members as `UP`. DNS service for the GSLB configuration remains operational at all three sites (because it comes from SEs, none of which have failed). Global application service continues on all four sites (Santa Clara, Chicago, NY-1, and NY-2).
Follower Site Recovery	The following holds true for either full-site or partial-site failures. The leader Controller in Santa Clara detects connectivity to the (newly rebooted) follower Controller at NY-1. The latest GSLB configuration is pushed to it. Other active sites likewise detect successful connectivity to the NY-1 follower Controller as a result of their control-plane health monitors. If the data-plane never went down (partial-site failure), no more action is required. If data-plane monitors for NY-1's GS members had been configured and previously marked NY-1's GS members as Down, NY-1's members will be marked Up and traffic to them will resume only after those data-plane monitors once again perceive good health.

Leader Site Failures


Issue	Resolution
Full-Site Failure	A full-site failure occurs at the Santa Clara leader site, as shown below. As they are active sites, both Chicago and NY-1 detect the failure. No administrative changes to the GSLB configuration can be made. Both control-plane and data-plane health monitors mark Santa Clara's GS members as Down. DNS service for the GSLB configuration remains operational at the two surviving active sites (Chicago and NY-1). Global application service continues on the three surviving sites (Chicago, NY-1, and NY-2).
Partial-Site Failure	If only the NSX Advanced Load Balancer Controller at the Santa Clara site fails, the site’s SEs continue to serve applications in headless mode. As they are active sites, both Chicago and NY detect the Controller failure using their control-plane health monitors. No administrative changes to the GSLB configuration can be made. Data-plane health monitors running in Chicago and NY continue to perceive Santa Clara's members as Up. DNS service for the GSLB configuration remains operational at Santa Clara, Chicago, and NY-1. Global application service will continue on all sites.

Leader Site Change

A new leader is designated in neither of the above leader-site failure scenarios. The re-election process is not automatic. Instead, an optional and manual operation can be initiated at any active follower site, either to restore the ability to lead or for site maintenance. Both Chicago and NY-1 qualify as potential leaders if Santa Clara’s Controller is down. From either follower site the steps would be:

A GSLB administrator logs into the follower site's Controller and commands it to become the new leader. Until it becomes the leader, it is the leader-designate.
A take-over message is propagated to all other NSX Advanced Load Balancer sites, apprising them of the change in command.
If the old leader (Santa Clara) comes back up, it assumes the role of a follower due to the take-over message queued for transmission while it was down.

GSLB followers trust all configuration commands from their leader. Although the command to become a leader might be considered a configuration command, it is highly privileged and never sent from another site. Rather, it requires an admin with the appropriate credentials to log into the Controller being promoted to the leadership role.

For more information, see Detaching GSLB Site from Unresponsive Leader Site.

Network Partitioning

Network partitioning occurs due to failures or outages in the internet or VPN infrastructure between the sites. In the case of network failure, each site updates the GS member state based on control-plane and data-plane health monitors. Both parts of the network act as independent and exclusive subnetworks.

Hence, each site responds to DNS queries using only the member virtual services to which it has connectivity. In the above example, DNS queries to Santa Clara would be resolved to lead clients to vip-1, whereas DNS queries to Chicago or NY-1 would lead clients to either vip-3 or vip-6.

Santa Clara remains the leader during the network outage, notwithstanding its inability to access the other two NSX Advanced Load Balancer sites. No new leader is automatically elected on the other network partition, nor is it required that there be a leader.

When an active site within such a leaderless partition is promoted to the leadership role, the UUID of that site’s Controller and the Gslb.view_id parameter generated are shared with other followers within the partition. If a network repair causes a leader outside the partition to re-attempt GSLB configuration changes on sites within the partition, those attempts will be rejected due to the view_id clash. This prevents the confusion that would otherwise arise from followers obeying configuration commands from more than one leader.

Site Configuration Errors

Errors in site configuration related to IP address and credentials show up when the site information is saved. Some sample error screens are as follows:

Authentication Failure

The username and password for the admin of Boston site can be unique to that site, or the same credentials used at all NSX Advanced Load Balancer GSLB sites.

Max Retry Login Failure

Appropriately authenticated individuals log into a leader to perform GSLB-related functions, such as to read a GSLB configuration or to make changes to it. In addition, behind the scenes, the leader GSLB site will robotically log into a follower GSLB site to pass on configuration changes that can only be initiated from the leader. In both cases, a login attempt lockout rule might be in force, whereby a certain number of failures results in locking out of the administrative account for some specified number of minutes (default = 30 minutes).

Redress

When defining a new GSLB configuration or adding a GSLB site to an existing configuration, one specifies account credentials to be associated with the site. It is a best practice to define the same GSLB administrative account (for example, gslbadmin) for all participating GSLB sites. By associating with that account (for example, No-Lockout-User-Account-Profile as shown below), one can eliminate max retry login failures.

To track robotic actions separately from those of GSLB administrator personnel, assign staff members different, individually IDs of their own.

HTTP 400 Error

There are several GSLB contexts in which a 400 error might occur. The following example illustrates a possible restriction: An NSX Advanced Load Balancer site can participate exactly in one GSLB configuration. Invitations to join a second are rejected.