Skip to content

[graveler] A single failed range and metarange operation can fail a commit or merge #6766

@arielshaqed

Description

@arielshaqed

[Observed by a customer]

What happened

A DNS error on a major cloud provider brought down a large commit at a user site. This could have been retried and would have succeeded.

User received this error. A human can deduce that the issue is an unresponsive DNS (UDP port 53) that caused an operation to fail. However a machine cannot, and has no idea what to do with an ISE.

[2023-10-99 99:99:99.999] {foo.py} ERROR - Internal Server Error: {"message":"commit: close writer ns=s3://AWS-BUCKET/ metarange id=fff555: failed closing metarange writer: sstable store (fff555): adapter put s3://AWS-BUCKET/ fff555: Put \"https://AWS-BUCKET.s3.amazonaws.com/_lakefs/fff555\": dial tcp: lookup AWS-BUCKET.s3.amazonaws.com on 172.31.99.99:53: read udp 10.0.99.99:48835-\u003e172.31.99.99:53: i/o timeout"}

What we should do

We should either:

  • Retry operations; or
  • Give sufficient information on the API response to allow the caller to retry operation.

Preferences

I prefer adding information to the API response: an optional response field (JSON body or header) as well as a readable message that user code can use to trigger a retry. As a server, lakeFS has no idea how hard it should try vs. how quickly it should fail. By passing the problem onto the user, we give them more information. This could be handled automatically -- by wrapping expensive commit operations in a backoff -- or it could be handled manually -- by the user reading the message and deciding that it may be worth retrying this operation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions