Skip to content

Conversation

@rbreslow
Copy link
Contributor

@rbreslow rbreslow commented Sep 24, 2020

Overview

Keep-alive, when enabled, enables the load balancer to reuse back-end connections until the keep-alive timeout expires. To ensure that the load balancer is responsible for closing the connections to your instance, make sure that the value you set for the HTTP keep-alive time is greater than the idle timeout setting configured for your load balancer.

The default keep-alive timeout for the Node http.Server is 5 seconds. The default idle timeout for the ALB is 60 seconds.

The ALB is opening connections for reuse, the Node http.Server closes that connection, and then the ALB tries to communicate over the closed connection which triggers a 502.

HTTP 502: Bad gateway

Possible causes:
. . .
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.

We can see more evidence of this by looking at the ALB access logs with Athena:

Details
SELECT elb_status_code,
         request_processing_time,
         target_processing_time,
         response_processing_time,
         request_url
FROM stg_alb_logs
WHERE elb_status_code LIKE '5%';
Athena 2020-09-23 14-35-24

response_processing_time is set to -1 if the load balancer can't send the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request.

This PR sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.

See:

Resolves #428

Checklist

  • Description of PR is in an appropriate section of CHANGELOG.md and grouped with similar changes, if possible

Testing Instructions

I'm not sure how to replicate the failures we saw earlier. I've tried exercising the staging application and wasn't able to generate any 502s today before or after applying the fix (see the Jenkins deploy). Additionally, the prd_alb_logs table in Athena does not contain any 502s.

I think the evidence makes it clear we should apply this either way, and we can be on the look out for more 502s as we're able to gather more data.

@rbreslow rbreslow self-assigned this Sep 24, 2020
Comment on lines 28 to 30
// Ensure the headersTimeout is set higher than the keepAliveTimeout due to
// this nodejs regression bug: https://github.com/nodejs/node/issues/27363
server.headersTimeout = 66000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was resolved and back ported to Node v12 in nodejs/node#34131. However, it doesn't look like the fix will be accessible until Node v12.18.5 drops (https://github.com/nodejs/node/commits/v12.x. vs https://github.com/nodejs/node/commits/v12.x-staging).

@rbreslow rbreslow marked this pull request as ready for review September 24, 2020 18:22
Copy link
Contributor

@hectcastro hectcastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The application server timeout changes appear to be working as an effective 502 deterrent. I added a simple k6 load test in 4827e9a and used it against an instance of the application with and without the changes.

When the timeout changes were not applied, 502s were present in the Athena ALB request logs. When the timeout changes were applied, 502s were not present in the logs. 👍

rbreslow and others added 2 commits September 28, 2020 09:53
Sets the keep-alive timeout to be higher than the ALB timeout so that
the ALB is responsible for prompting the closure of individual TCP connections to the server.
I manually filled out a 50 district PA project and exported it as a HAR
file. From there, I used har-to-k6 converter to produce a k6 load test.
Lastly, I created a few parameters to make the test more reusable:

- JWT_AUTH_TOKEN: A DistrictBuilder JWT authentication token
- DB_PROJECT_ID: A DistrictBuilder project UUID (PA; 50 districts)
- DB_DOMAIN: The DistrictBuilder instance domain where the project above
  resides
@rbreslow rbreslow force-pushed the feature/jrb/adjust-timeouts branch from 4827e9a to eabff24 Compare September 28, 2020 13:54
@rbreslow rbreslow merged commit 1b8b1db into develop Sep 28, 2020
@rbreslow rbreslow deleted the feature/jrb/adjust-timeouts branch September 28, 2020 13:55
@coalkettler
Copy link
Contributor

@rbreslow Chiming in to say that this was really useful to read and that Adam Crowder article offers a great explanation. Good stuff! 🕵️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resolve 502 errors generated by ELB

4 participants