Server flapping, or bouncing between up and down, is a common issue. Generally, server flapping is caused by the server reaching or slightly exceeding the health monitor's maximum allowed response time.
To validate if a server is flapping, you need to check the specific server's analytics page within the pool. You can enable the Alerts and System Events Overlay icons for the main chart. This will show server up and down events over the time period selected. The page also displays the list of failed health monitors.
Compare the response times from the server to the health monitor's configured receive timeout window. If the failures can be attributed to these timers, you can use the following steps to rectify the same:
Add additional servers — This will not help if the slowdown is due to a backend database, but for servers that are simply busy or overloaded, this can be a quick and permanent fix.
Increase the health monitor's receive timeout window — The timeout value can be 1-300 seconds. The timeout value must always be shorter than the send interval for the health monitor.
Raise the number of successful checks required, and decrease the number of failed checks allowed — This will ensure the server is not brought back into the rotation as quickly, potentially giving it more time to handle the processes that are causing the slow response.
Change the connection ramp-up (if using the least connections load-balancing algorithm)— Servers can be susceptible to receive too many connections too fast when first brought up. For instance, if one server has 1 connection and the rest have 100 connections, then as per the least connections algorithm, the new server should get the next 99 connections. This can easily overwhelm the server, leaving a flash crowd of connections that must be dealt with the remaining servers, causing a domino effect. You can configure the connection ramp-up feature on the Advanced tab of the pool's configuration. The connection ramp-up feature slowly ramps up the percentage of new connections sent to a new server. Increasing the ramp-up time can be beneficial if you are seeing a cascading failure of servers.
Set the maximum number of connections per server — This option, configurable on the Advanced tab of the pool configuration, ensures that servers are not overloaded and can handle connections at optimal speed.