This guide accompanies the one on networking and focuses on troubleshooting of network connections.
For connections that use TLS there is an additional guide on troubleshooting TLS.
Troubleshooting of network connectivity issues is a broad topic. There are entire books written about it. This guide explains a methodology and widely available networking tools that help narrow most common issues down efficiently.
Networking protocols are layered. So are problems with them. An effective troubleshooting strategy typically uses the process of elimination to pinpoint the issue (or multiple issues), starting at higher levels. Specifically for messaging technologies, the following steps are often effective and sufficient:
These steps, when performed in sequence, usually help identify the root cause of the vast majority of networking issues. Troubleshooting tools and techniques for levels lower than the Internet (networking) layer are outside of the scope of this guide.
Certain problems only happen in environments with a high degree of connection churn. Client connections can be inspected using the management UI. It is also possible to inspect all TCP connections of a node and their state. That information collected over time, combined with server logs, will help detect connection churn, file descriptor exhaustion and related issues.
All developers and operators have been there: typos, outdated values, issues in provisioning tools, mixed up public and private key paths, and so on. Step one is to double check application and client library configuration.
Verifying server configuration helps prove that RabbitMQ is running with the expected set of settings related to networking. It also verifies that the node is actually running. Here are the recommended steps:
rabbitmq-diagnostics listenersor the
Note that in older RabbitMQ versions, the
environment commands were only available as part of rabbitmqctl:
rabbitmqctl status and so on. In modern versions either tool can be used to run those commands but rabbitmq-diagnostics is what most documentation guides will typically recommend.
The listeners section will look something like this:
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0 Interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS Interface: [::], port: 15672, protocol: http, purpose: HTTP API Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS) Interface: [::], port: 1883, protocol: mqtt, purpose: MQTT
In the above example, there are 6 TCP listeners on the node:
In second example, there are 4 TCP listeners on the node:
All listeners are bound to all available interfaces.
Inspecting TCP listeners used by a node helps spot non-standard port configuration, protocol plugins (e.g. MQTT) that are supposed to be configured but aren't, cases when the node is limited to only a few network interfaces, and so on. If a port is not on the listener list it means the node cannot accept any connections on it.
RabbitMQ nodes will log key client connection lifecycle events. A TCP connection must be successfully established and at least 1 byte of data must be sent by the peer for a connection to be considered (and logged as) accepted.
From this point, connection handshake and negotiation proceeds as defined by the specification of the messaging protocol used, e.g. AMQP 0-9-1, AMQP 1.0 or MQTT.
If no events are logged, this means that either there were no successful inbound TCP connections or they sent no data.
Besides hostname resolution and IP routing issues, TCP port inaccessibility for outside connections is a common reason for failing client connections. telnet is a commonly used, very minimalistic tool for testing TCP connections to a particular hostname and port.
The following example uses
telnet to connect to host
localhost on port
5672. There is a running node with stock defaults running on
localhost and nothing blocks access to the port, so the connection succeeds.
12345 is then entered for input followed by an Enter. This data will be sent to the node on the opened connection.
12345 is not a correct AMQP 0-9-1 or AMQP 1.0 protocol header, so the server closes TCP connection:
telnet localhost 5672 # => Trying ::1... # => Connected to localhost. # => Escape character is '^]'. 12345 # enter this and hit Enter to send # => AMQP Connection closed by foreign host.
telnet connection succeeds, use
Control + ] and then
Control + D to quit it.
The following example connects to
localhost on port
5673. The connection fails (refused by the OS) since there is no process listening on that port.
telnet localhost 5673 # => Trying ::1... # => telnet: connect to address ::1: Connection refused # => Trying 127.0.0.1... # => telnet: connect to address 127.0.0.1: Connection refused # => telnet: Unable to connect to remote host
Failed or timing out
telnet connections strongly suggest there's a proxy, load balancer or firewall that blocks incoming connections on the target port. It could also be due to RabbitMQ process not running on the target node or uses a non-standard port. Those scenarios should be eliminated at the step that double checks server listener configuration.
There's a great number of firewall, proxy and load balancer tools and products. iptables is a commonly used firewall on Linux and other UNIX-like systems. There is no shortage of
iptables tutorials on the Web.
The following example uses
lsof to display OS processes that listen on port 5672 and use IPv4:
sudo lsof -n -i4TCP:5672 | grep LISTEN
Similarly, for programs that use IPv6:
sudo lsof -n -i6TCP:5672 | grep LISTEN
On port 1883:
sudo lsof -n -i4TCP:1883 | grep LISTEN
sudo lsof -n -i6TCP:1883 | grep LISTEN
If the above commands produce no output then no local OS processes listen on the given port.
The following example uses
ss to display listening TCP sockets that use IPv4 and their OS processes:
sudo ss --tcp -f inet --listening --numeric --processes
Similarly, for TCP sockets that use IPv6:
sudo ss --tcp -f inet6 --listening --numeric --processes
For the list of ports used by RabbitMQ and its various plugins, see above. Generally all ports used for external connections must be allowed by the firewalls and proxies.
rabbitmq-diagnostics listeners and
rabbitmq-diagnostics status can be used to list enabled listeners and their ports on a RabbitMQ node.
Messaging protocols supported by RabbitMQ use TCP and require IP routing between clients and RabbitMQ hosts to be functional. There are several tools and techniques that can be used to verify IP routing between two hosts. traceroute and ping are two common options available for many operating systems. Most routing table inspection tools are OS-specific.
Note that both
ping use ICMP while RabbitMQ client libraries and inter-node connections use TCP. Therefore a successful
ping run alone does not guarantee successful client connectivity.
ping have Web-based and GUI tools built on top.
All network activity can be inspected, filtered and analyzed using a traffic capture.
tcpdump and its GUI sibling Wireshark are the industry standards for capturing traffic, filtering and analysis. Both support all protocols supported by RabbitMQ. See the Using Wireshark with RabbitMQ guide for an overview.
For connections that use TLS there is a separate guide on troubleshooting TLS.
When adopting TLS it is important to make sure that clients use correct port to connect (see the list of ports above) and that they are instructed to use TLS (perform TLS upgrade). A client that is not configured to use TLS will successfully connect to a TLS-enabled server port but its connection will then time out since it never performs the TLS upgrade that the server expects.
A TLS-enabled client connecting to a non-TLS enabled port will successfully connect and try to perform a TLS upgrade which the server does not expect, this triggering a protocol parser exception. Such exceptions will be logged by the server.
The following example uses
netstat to list all TCP connection sockets regardless of their state and interface. IP addresses will be displayed as numbers instead of being resolved to domain names. Program names will be printed next to numeric port values (as opposed to protocol names).
sudo netstat --all --numeric --tcp --programs
Both inbound (client, peer nodes, CLI tools) and outgoing (peer nodes, Federation links and Shovels) connections can be inspected this way.
Combining connection information from management UI or CLI tools with those of
ss can help troubleshoot misbehaving applications, application instances and client libraries.
High connection churn (lots of connections opened and closed after a brief period of time) can lead to resource exhaustion. It is therefore important to be able to identify such scenarios.
ss are most popular options for inspecting TCP connections. A lot of connections in the
TIME_WAIT state is a likely symptom of high connection churn. Lots of connections in states other than
ESTABLISHED also might be a symptom worth investigating.
Evidence of short lived connections can be found in RabbitMQ log files. E.g. here's an example of such connection that lasted only a few milliseconds:
2018-06-17 16:23:29.851 [info] <0.634.0> accepting AMQP connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672) 2018-06-17 16:23:29.853 [info] <0.634.0> connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/' 2018-06-17 16:23:29.855 [info] <0.634.0> closing AMQP connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672, vhost: '/', user: 'guest')