Set keep-alive timeout higher than ALB idle timeout #448
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
The default keep-alive timeout for the Node http.Server is 5 seconds. The default idle timeout for the ALB is 60 seconds.
The ALB is opening connections for reuse, the Node http.Server closes that connection, and then the ALB tries to communicate over the closed connection which triggers a 502.
We can see more evidence of this by looking at the ALB access logs with Athena:
Details
This PR sets the keep-alive timeout to be higher than the ALB timeout so that the ALB is responsible for prompting the closure of individual TCP connections to the server.
See:
Resolves #428
Checklist
CHANGELOG.mdand grouped with similar changes, if possibleTesting Instructions
I'm not sure how to replicate the failures we saw earlier. I've tried exercising the staging application and wasn't able to generate any 502s today before or after applying the fix (see the Jenkins deploy). Additionally, the
prd_alb_logstable in Athena does not contain any 502s.I think the evidence makes it clear we should apply this either way, and we can be on the look out for more 502s as we're able to gather more data.