Are your 502 errors the result of your infrastructure, TAS for VMs, or an app? This article helps you understand and debug these errors.
In your deployment, 502 errors can come from any of the following:
If you are unsure of the source of 502 errors, see General Debugging Steps below.
Some general debugging steps for any issue resulting in 502 errors are as follows:
Gather the Gorouter logs and Diego Cell logs at the time of the incident. To SSH into the router VM, see Advanced Troubleshooting with the BOSH CLI. To download the router VM logs from Ops Manager, see Monitoring VMs.
Review the logs and consider the following:
Consider the following:
The following table describes the log levels supported by Gorouter. The log level is set to debug
and is not configurable.
Message | Description | Examples |
---|---|---|
fatal |
Gorouter is unable to handle any requests due to a fatal error. | Gorouter cannot bind to its TCP port, a TAS for VMs component has published invalid data to Gorouter. |
error |
An unexpected error has occurred. | Gorouter failed to fetch token from UAA service. |
info |
An expected event has occurred. | Gorouter started or exited, Gorouter has begun to prune routes for stale droplets. |
debug |
A lower-level event has occurred. | Route registration, route unregistration. |
This section section provides a sample Gorouter log entry and explanation of the contents.
[2017-02-01 22:54:08+0000] {"log_level":0,"timestamp":1485989648.0895808,"message":"endpoint-registered","source":"vcap.Gorouter.registry","data":{"uri":"0-*.login.bosh-lite.com","backend":"10.123.0.134:8080","modification_tag":{"guid":"","index":0}}}
Property | Description |
---|---|
log_level |
Logging level of the message |
timestamp |
Epoch time of the log |
message |
Content of the log entry |
source |
Gorouter function that initiated the log entry |
data |
Additional information that varies based on the message |
This section provides details about Gorouter access logs.
Gorouter generates an access log in the following format when it receives a request:
<Request Host> - [<Start Date>] "<Request Method> <Request URL> <Request Protocol>" <Status Code> <Bytes Received> <Bytes Sent> "<Referrer>" "<User-Agent>" <Remote Address> <Backend Address> x_forwarded_for:"<X-Forwarded-For>" x_forwarded_proto:"<X-Forwarded-Proto>" vcap_request_id:<X-Vcap-Request-ID> response_time:<Response Time> gorouter_time:<Gorouter Time> app_id:<Application ID> app_index:<Application Index> x_cf_routererror:<X-Cf-RouterError> <Extra Headers>
Gorouter access logs are also redirected to syslog.
See the list below for more information about the Gorouter access log fields:
The following are optional fields: Status Code
, Response Time
, Application ID
, Application Index
, X-Cf-RouterError
, and Extra Headers
.
If the access log lacks a Status Code
, Response Time
, Application ID
, Application Index
, or X-Cf-RouterError
, the corresponding field shows -
.
Response Time
is the total time it takes for the request to go through the Gorouter to the app and for the response to travel back through the Gorouter. This includes the time that the request spends traversing the network to the app and back again to the Gorouter. It also includes the time the app spends forming a response.
Gorouter Time
is the total time it takes for the request to go through the Gorouter initially plus the time it takes for the response to travel back through the Gorouter. This does not include the time the request spends traversing the network to the app. This also does not include the time the app spends forming a response.
X-Cf-RouterError
is populated if the Gorouter encounters an error. The returned values can help distinguish whether a non-2xx response code is due to an error in the Gorouter or the back end. For more information on the possible errors, see the Diagnose App Errors section.
This section describes the basic structure of Gorouter logs and how to diagnose Gorouter errors.
If Gorouter cannot connect to the app container, you might see this error in the gorouter.log
:
[2018-07-05 17:59:10+0000] {"log_level":3,"timestamp":1530813550.92134,"message": "backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint": {"ApplicationId":"","Addr":"10.0.32.15:60099","Tags":null,"RouteServiceUrl":""}, "error":"dial tcp 10.0.32.15:60099: getsockopt: connection refused"}}
If TCP cannot make an initial connection to the backend, Gorouter retries TCP dial errors up to three times. If it still fails, Gorouter returns a 502 to the client and writes to the access.log
.
Any of the following can cause connection errors between Gorouter and the app container:
If Gorouter successfully dials the endpoint but an error occurs, you might see the following:
read: connection reset by peer
errors. When these errors occur, the Gorouter returns error to the client, marks backend ineligible, and does not retry another back end.
If a TCP connection can be established (with Envoy), but the app itself does not send a complete http response, Gorouter retries the request on different app instances if any exist.
A request is considered idempotent under the following circumstances (see also Gorouter source code):
Request body is empty AND (Request method is one of “GET”, “HEAD”, “OPTIONS”, “TRACE” OR Request Header
X-Idempotency-Key
orIdempotency-Key
is set)
The most likely scenario for this to occur is during creation of a new app instance, while the app is not yet completely started, but Envoy is already accepting connections.
TLS handshake errors. When these errors occur, the Gorouter retries up to three times. If it still fails, Gorouter can return a 502. These errors appear similar to the following in the gorouter.log
, and a 502 error is logged in the access.log
:
[2018-07-05 18:20:54+0000] {"log_level":3,"timestamp":1530814854.4359834,"message":" backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint": {"ApplicationId":"","Addr":"10.0.16.17:61002","Tags":null,"RouteServiceUrl":""}, "error":"x509:certificate is valid for 53079ca3-c4fe-4910-78b9-c1a6, not xxx"}}
A stale route occurs when Gorouter contains out-of-date route information for a backend app. In nearly all cases, stale routes are self-correcting.
When Gorouter detects that it is sending traffic to the wrong app, it prunes that backend app from its route table and terminates the connection.
When a route is unmapped or when an app container is deleted because the app is deleted or moved, a deregister message is sent to Gorouter. This message tells Gorouter to delete the route mapping to that container.
If Gorouter does not receive this deregister message, the route is now considered stale. Gorouter still attempts to send traffic to the app.
You are more likely to have stale routes when the following are true:
The following procedure helps you identify stale routes:
cf routes
for all spaces and ensure the route is only mapped to the intended apps. Sometimes there can be multiple routes using the same hostname and domain, but with different paths. If the domain is shared, check all orgs as well./var/vcap/jobs/gorouter/bin/retrieve-local-routes | jq .
address
and private_instance_id
.Cross-reference the Gorouter routes table entry with actual Long-Running Processes (LRPs):
cfdot actual-lrps | jq .'
Important The CAPI endpoint GET /v3/processes/:guid/stats
provides information about the host and ports the app is using, and is intended for use by app developers. This endpoint does not provide complete information for operators. Operators should use the cfdot
CLI on the Diego Cell to view the LRPs directly and all at once.
In most cases, you do not have to fix stale routes. Stale routes are normal in the routing table.
Gorouter fixes stale routes on its own when SSL verification is enabled. When Gorouter attempts to send traffic to a stale endpoint, it recognizes that the endpoint is stale. Gorouter prunes the stale route and then tries another endpoint. Stale routes are only a problem if an end user receives errors.
The following procedure helps you fix stale routes:
Important Using TLS to verify app identity depends on SSL verification. If you deactivate SSL verification, there is no way to avoid misrouting.
Use this table when you are debugging Gorouter errors. The table lists error types, status codes, and indicates if the Gorouter retries the errored request.
For each error, there is a backend-endpoint-failure
log entry in gorouter.log
and an error message in gorouter.err.log
. Additionally, the access.log
records the request status codes. For more information, see the Gorouter documentation on GitHub.
If the request is one that can be retried, Gorouter makes up to three attempts.
Error Type | Status Code | Can be re‑tried? | Source of Issue | Evidence |
---|---|---|---|---|
Dial | 502 | Yes | App or Platform | Logs with error dial tcp |
AttemptedTLSWith NonTLSBackend |
525 | Yes | Platform | Logs with error tls: first record does not look like a TLS handshake or backend_tls_handshake_failed metric increments |
HostnameMismatch | 503 | Yes | Platform | Logs with error x509: certificate is valid for < x >, not < y > or backend_invalid_id metric increments |
UntrustedCert | 526 | Yes | Platform | Logs with error prefix x509: certificate signed by unknown authority or backend_invalid_tls_cert metric increments |
RemoteFailedCertCheck | 496 | Yes | Platform | Logs with error remote error: tls: bad certificate |
ContextCancelled | 499 | No | Client/App | Logs with error context canceled This status code appears in logs only. It is never returned to clients as it occurs when the downstream client closes the connection before Gorouter responds. |
RemoteHandshakeFailure | 525 | Yes | Platform | Logs with error remote error: tls: handshake failure and backend_tls_handshake_failed metric increments |
ExpiredOrNotYetValidCertFailure | 502 | Yes | Platform | Logs with error x509: certificate has expired or is not yet valid . For example, this error can occur if the Diego Cell clock drifts. |
unknown | 502 | No | App or Platform | If the error is not one of the types above, then 502 is the default response. Gorouter tracks 502 errors with the gorouter.bad_gateways metric. For more information, see Router Error: 502 Bad Gateway. |
This section describes app-related 502 errors.
If 502 errors only occur in specific app instances and not all app instances on the platform, it is likely an app-related error. The app might be overloaded, unresponsive, or unable to connect to the database.
If all apps are experiencing 502 errors, then it could either be a platform issue, such as a misconfiguration, or an app issue, such as all apps being unable to connect to an upstream database.
If the Gorouter has back end keep-alive connections enabled, 502 errors can occur due to a race condition when the app instance keep-alive idle timeout is less than 90 seconds. For more information, see Gorouter Back End Keep-Alive Connections.
Note: Gorouter does not retry any error response returned by the app.
In the case that Gorouter encounters an error connecting to an application backend, the X-CF-RouterError
header is populated to help distinguish the origin of a non-2xx response code.
The value of the X-Cf-Routererror
header can be one of the following:
Value | Description |
---|---|
invalid_cf_app_instance_header | The provided value for the X-Cf-App-Instance header does not match the required format of APP_GUID:INSTANCE_ID . |
empty_host | The value for the Host header is empty, or the Host header is equivalent to the remote address. Some LB’s optimistically set the Host header value with their IP address when there is no value present. |
unknown_route | The desired route does not exist in the gorouter’s route table. |
no_endpoints | There is an entry in the route table for the desired route, but there are no healthy endpoints available. |
Connection Limit Reached | The backends associated with the route have reached their max number of connections. The max connection number is set via the spec property router.backends.max_conns . |
route_service_unsupported | Route services are not enabled or WebSockets requests are bound to route services. You can configure route services using the spec property router.route_services_secret . If the property is empty, route services are deactivated. |
endpoint_failure | The registered endpoint for the desired route failed to handle the request. |