-
Notifications
You must be signed in to change notification settings - Fork 433
Description
[Observed by a customer]
What happened
A DNS error on a major cloud provider brought down a large commit at a user site. This could have been retried and would have succeeded.
User received this error. A human can deduce that the issue is an unresponsive DNS (UDP port 53) that caused an operation to fail. However a machine cannot, and has no idea what to do with an ISE.
[2023-10-99 99:99:99.999] {foo.py} ERROR - Internal Server Error: {"message":"commit: close writer ns=s3://AWS-BUCKET/ metarange id=fff555: failed closing metarange writer: sstable store (fff555): adapter put s3://AWS-BUCKET/ fff555: Put \"https://AWS-BUCKET.s3.amazonaws.com/_lakefs/fff555\": dial tcp: lookup AWS-BUCKET.s3.amazonaws.com on 172.31.99.99:53: read udp 10.0.99.99:48835-\u003e172.31.99.99:53: i/o timeout"}
What we should do
We should either:
- Retry operations; or
- Give sufficient information on the API response to allow the caller to retry operation.
Preferences
I prefer adding information to the API response: an optional response field (JSON body or header) as well as a readable message that user code can use to trigger a retry. As a server, lakeFS has no idea how hard it should try vs. how quickly it should fail. By passing the problem onto the user, we give them more information. This could be handled automatically -- by wrapping expensive commit operations in a backoff -- or it could be handled manually -- by the user reading the message and deciding that it may be worth retrying this operation.