Disk Latency Issues

The controllers must operate on disks with low latency. The cluster requires disk storage system for each node to have a peak write latency of less than 300 ms, and a mean write latency of less than 100 ms.

Problem

A deployed NSX Controller is disconnected from a controller cluster.
Unable to gather any controller logs as disk partition is full.
If the storage system does not meet these requirements, the cluster can become unstable and cause system downtime.
TCP listeners applicable to a functioning NSX Controller, no longer appear in the output of the show network connections of-type tcp command.
The disconnected controller attempts to join the cluster using an all-zeroes UUID, which is not valid.
The show control-cluster history command displays a message similar to:
INFO.20150530-000550.1774:D0530 13:25:29.452639 1983 zookeeper_client.cc:774] Zookeeper client disconnected!

Running the show log cloudnet/cloudnet_java-zookeeper*.log command in the NSX Controller console contains entries similar to:

cloudnet_java-zookeeper.20150530-000550.1806.log-2015-05-30
	 13:25:07,382 47956539 [SyncThread:1] WARN
	 org.apache.zookeeper.server.persistence.FileTxnLog - fsync-ing the write ahead
	 log in SyncThread:1 took 3219ms which will adversely effect operation latency.
	 See the ZooKeeper troubleshooting guide

The NSX Controller logs contains entries similar to:

D0525 13:46:07.185200 31975
	 rpc-broker.cc:369] Registering address resolution for: 20.5.1.11:7777
D0525 13:46:07.185246 31975
	 rpc-tcp.cc:548] Handshake complete, both peers support the same
	 protocol
D0525 13:46:07.197654 31975
	 rpc-tcp.cc:1048] Rejecting a connection from peer
	 20.5.1.11:42195/ef447643-f05d-4862-be2c-35630df39060, cluster
	 9f7ea8ff-ab80-4c0c-825e-628e834aa8a5, which doesn't match our cluster
	 (00000000-0000-0000-0000-000000000000)
D0525 13:46:07.222869 31975
	 rpc-tcp.cc:1048] Rejecting a connection from peer
	 20.5.1.11:42195/ef447643-f05d-4862-be2c-35630df39060, cluster
	 9f7ea8ff-ab80-4c0c-825e-628e834aa8a5, which doesn't match our cluster
	 (00000000-0000-0000-0000-000000000000)

Cause

This issue occurs due to slow disk performance, which adversely impacts the NSX Controller cluster.

Check for slow disks by looking for fsync messages in the /var/log/cloudnet/cloudnet_java-zookeeper log file. If fsync takes more than one second, Zookeeper displays a fsync warning message, and it is a good indication that the disk is too slow. VMware recommends dedicating a Logical Unit Number (LUN) specifically for the control-cluster and/or moving the storage array closer to the control-cluster in terms of latencies.
You can view the read latency and write latency calculations that are inputted into a 5-second (by default) moving average, which in turn is used to trigger an alert upon breaching the latency limit. The alert is turned off after the average comes down to the low watermark. By default, the high watermark is set to 200 ms, and the low watermark is set to 100 ms. You can use the show disk-latency-alert config command. The output is displayed as follows:
```
enabled=True   low-wm=51      high-wm=150
nsx-controller # set disk-latency-alert enabled yes
nsx-controller # set disk-latency-alert low-wm 100
nsx-controller # set disk-latency-alert high-wm 200
```
Use the GET /api/2.0/vdn/controller/<controller-id>/systemStats REST API to fetch latency alert status of the controller nodes.
Use the GET /api/2.0/vdn/controller REST API to indicate whether a disk latency alert is detected on a controller node.

Solution

Deploy NSX Controller on low-latency disks.
Each controller should use its own disk storage server. Do not share same disk storage server between two controllers.

What to do next

For more information on how to view alerts, refer to View Disk Latency Alerts.