FAQs on Database Replication

This section discusses the FAQs and resolutions on Database Replication.

What are the observations leader node disconnects from the NSX Advanced Load Balancer cluster?

Following are the observations when a leader node disconnects from the NSX Advanced Load Balancer cluster:

The leader node restarts and breaks the NSX Advanced Load Balancer cluster due to a lost quorum.
One of the follower nodes posts a message that the DB replication is not completed and it cannot become a leader due to that.
Another follower node will post DB replication from the leader and becomes the leader node.

What will happen when none of the follower is not able to complete database replication in an NSX Advanced Load Balancer cluster?

On the follower node, the following process is used to replicate the database:
Check if streaming replication of the database can be enabled. In general, it is expected to succeed as NSX Advanced Load Balancer allows for a reasonable number of WAL record files to satisfy this. If this fails or if it is the first time a follower node is added to the cluster, the follower node does a full backup, which usually takes time.
Once the streaming replication is set up, NSX Advanced Load Balancer monitors that the replication continues to occur every minute. If it detects a failure five consecutive times, then it declares that the replication has failed and attempt a full replication.
On trying to do a full replication, NSX Advanced Load Balancer always does a full copy of the entire database data directory into the next directory. Only when it is complete it moves this to the current directory for the database. If there is a failure as a part of the full copy of the database files, then it will continue to have a valid, current directory.
On deciding to do a full backup, it is essential to record this to ensure that this node does not become a leader if there is a leader failure during this time. With this scheme, a follower node must always be in sync and can take over as a leader. If the full sync fails, the safety mechanism will prevent the node from taking over as leader, but with manual intervention, you can promote one of the nodes as the leader.

Does the cluster_mgr logs show the sync up every minute once the full replication attempts?

Logs for the full replication process can be found in /var/lib/avi/log/postgres_service_main.log and postgres_service_metrics.log.

What are the common causes of replication failure?

Permanent connectivity issues can cause replication failures. Streaming replication always succeeds when the postgres on the follower node can recover from its checkpoint and is behind the leader. In some instances, when the postgres is not brought down gracefully, it may not be able to replicate from leader. This will force us to do a full backup from the leader.

Which directory is the ‘current’ directory and which directory is the ‘next’ directory during the database replication process?

For config database: next is /var/lib/postgresql/9.3/main_backup and curl is /var/lib/postgresql/9.3/main/.

For metrics database: next is /var/lib/postgresql/9.3/pg_metrics/metrics_backup and /var/lib/postgresql/9.3/pg_metrics/metrics.

What are the symptoms of replication failures? Are they only noticeable during leader election? Are internal events generated for such events?

DB REPLICATION FAILED events are observed on the NSX Advanced Load Balancer, when it detects replication failures.

Is there any relation to cluster convergence? If so, what are the recent improvements on this in detail?

When NSX Advanced Load Balancer detects replication failure, a full backup is triggered. As a part of this, it touches a file in /var/lib/avi/etc to indicate that the replication has not been completed. So, the specific node does not participate in becoming a leader if there is a convergence. Also, the state for this service goes to COPYING_DB_FROM_LEADER, which will move the node state from CLUSTER_ACTIVE to CLUSTER_STARTING. The overall cluster state is not observed as CLUSTER_HA_ACTIVE.

Does the recent versions of NSX Advanced Load Balancer still use zookeeper ? If not, why does it still run on NSX Advanced Load Balancer and what has changed?

The recent versions of the NSX Advanced Load Balancer do not use zookeeper. The zookeeper process is run for backward compatibility. When NSX Advanced Load Balancer is upgraded from previous versions to 17.2, the Controllers will have to publish the leadership information in ZK for the SEs that are not yet in 17.2. Once the upgrade is complete, SE will use the new scheme for leadership notifications. At some point, ZK will be removed when all upgrades will be from 17.2.x.