-
Notifications
You must be signed in to change notification settings - Fork 413
Proposal for running lakeFS on top of a key/value store #2466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
arielshaqed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
| - PostgreSQL | ||
| - MySQL | ||
| - Embedded Pebble/RocksDB (great option for a "quickstart" environment?) | ||
| - MongoDB | ||
| - AWS DynamoDB | ||
| - FoundationDB | ||
| - Azure Cosmos | ||
| - Azure Blob Store | ||
| - Google BigTable | ||
| - Google Spanner | ||
| - Google Cloud Storage | ||
| - HBase | ||
| - Cassandra (or compatible systems such as ScyllaDB) | ||
| - Raft (embedded, or server implementations such as Consul/ETCd) | ||
| - Persistent Redis (Redis Enterprise, AWS MemoryDB) | ||
| - Simple in-memory tree (for simplifying and speeding up tests?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include a list (or multiple lists) of required primitives for the KV store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added! thanks.
design/metadata_kv/index.md
Outdated
| We add an additional field to each `Branch` object: In addition to the existing `staging_token`, we add an array of strings named `sealed_tokens`. | ||
|
|
||
| 1. get branch, find current `staging_token` | ||
| 1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is SetIf the (existing, pushed onto staging_token) staging_token is unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it clear that SetIf() is actually a CAS operation on the entire branch (staging_token, sealed_tokens and commit_id).
design/metadata_kv/index.md
Outdated
| 1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`. | ||
| 1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit | ||
| 1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit. | ||
| 1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know. | |
| 1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
design/metadata_kv/index.md
Outdated
| 1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`. | ||
| 1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit | ||
| 1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit. | ||
| 1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this SetIf on? AFAICT just staging_token or just commit_id are not enough to provide ordering in the face of concurrent commits. It might be if the entire value is unchanged, or possibly staging_token and sealed_tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was indeed unclear - made it explicit that SetIf() is actually a CAS operation on the entire branch (staging_token, sealed_tokens and commit_id).
| 1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know. | ||
|
|
||
| An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off, | ||
| adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am less optimistic (sorry) here. Suppose that we have an automated commit, once every X seconds (X ~ 1s). Typically everything works really well and we're happy. I would claim that this is a good way to implement streaming input: write to branch input, every X seconds commit and on success merge to main. (It is actually hard to do something else! E.g. writing to branch input-NNNNNN with a running commit counter NNNNNN pushes this huge problem onto the writers, who now need essentially to resolve the same issue that we are resolving here.)
Now suppose that our database (or maybe that part of our object store) has a hiccup and a commit takes >X seconds. Then sealed_tokens starts growing. AFAIU this requires every concurrent commit to fail (there is no point in retrying if sealed_tokens has grown since the last call, it will never go back to what we expect. So typically only the last concurrent commit will succeed. So until some time after recovery, not only do we have no commits, but also reads get progressively slower.
It may be OK to leave this open in the first version(s), but I do want to document this limitation while we have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a suggestion on how to address it. If you have other strategies that could be used here, let's add them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is, we run into trouble even if we never retry commits (so exponential backoff doesn't solve the issue)! Every commit tries and fails, growing sealed_tokens. As long as there's a nice fugue (commit starts by growing sealed_tokens, then runs until after the next commit starts) then no commit succeeds, and sealed_tokens grows, and reads take longer and longer.
I believe the only way around this will be to trim prefixes of sealed_tokens. E.g. accept a commit if the end of its sealed_tokens overlaps that start of the current sealed_tokens, and trim the shared bit.
Example
sealed_tokens: 1,2,3; staging_token: 4
Process A starts to commit tokens 1,2,3,4.
sealed_tokens: 1,2,3,4; staging_token: 5
Process B starts to commit tokens 1,2,3,4,5.
sealed_tokens: 1,2,3,4,5; staging_token: 6
Process C starts to commit tokens 1,2,3,4,5,6.
sealed_tokens: 1,2,3,4,5,6; staging_token: 7
Process A ends committing: it tries for a "fast-path commit", but since staging_token is no longer 5 that fails. Instead it reads the branch (or uses the value return by Set-If, if we want to require such an interface). Now 1,2,3,4 is a prefix of 1,2,3,4,5,6, so A tries again to commit the remaining sealed_tokens 5,6 and a staging_token of 7. That works, so A succeeded!
sealed_tokens: 5,6; staging_token: 7
Process C ends committing: its fast-path commit works (staging_token is 7), so it drops all sealed_tokens.
sealed_tokens: (); staging_token: 7
Process B ends committing: its fast-path commit doesn't work, and it has an empty prefix, so it knows it has already succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔓 Lock freedom 🔓
Let's talk about how to make this lock-free (in the sense of non-blocking algorithms).
Why is this important? Consider a loaded system where commits are not even lock-free (i.e., it is possible to make noi progress for an arbitrarily long period of time). For instance, let's say we have a commit every second. For some reason we have a delay (maybe the KV store went down for a while) and now we have 60 commits queued. They all race to finish the commit (by writing the old staging token with a shorter new sealed tokens queue) with incoming new commits (which try to write a new staging token and push a longer sealed tokens queue). Now there is no progress guarantee, we can have more and more new commits admitted. Request timeouts don't really help here, because new commit requests arrive at the same rate at which timeouts occur. So this is at best a steady-state -- and commits finish by a random walk, i.e. we recover in time quadratic in the number of pending commits.
We can only talk about lock-free (and wait-free operations) for constant-time operations, so let's separate the "write metarange" operation (which is trivially concurrent but take time linear in the number of changes) from the "update commit" (which can run in constant time). Also we cannot assume that the KV store is lock-free, so for our purposes we assume either that it is or that we count time in number of operations on the KV. (This is the same as any discussion of nonblocking operations, which invariably assume the memory is lock-free even though it probably is not...).
Becoming lock-free
With the process above, commits are still not lock-free. But we can make them so! The key is to have a "bag" of pending commits -- some data structure with Put, UpdateIfUnchanged, and GetAny operations. For instance, PUTs just write a value to key zzz_<bag-id>_<random>, and GETANYs just read the first key with prefix zzz_<bag-id>_. (A roughly-ordered queue is an even better choice, and we can do that too... but that is not necessary for correctness here, so this comment is too narrow for heuristics to put in such a queue...).
Now when trying (or possibly retrying) to commit, just add the desired commit record to the bag for that branch, and periodically probe to see if the commit succeeded. Meanwhile, every start commit and every finish-commit probe first repeatedly runs GetAny and tries to commit, until the bag is empty.
Now, if there is a pool of pending commits, then after all start-commits finish every committer (including new committers) is trying to end a commit! The only way a committer has to retry is that another committer succeeded, so we continue to make progress until the bag is (momentarily) empty.
Becoming wait-free
Wait-freedom is (much) harder. I think we could "just" use a queue as a bag above... but that is quite a powerful operation to expect our KV store to have, so it involves either implementing a lock-free queue (algorithms exist) or using a KV that has one (Redis).
design/metadata_kv/index.md
Outdated
| #### Writer flow | ||
|
|
||
| 1. Read the branch's existing staging token: currently, we batch requests for a short duration and amortize the cost across requests. | ||
| 1. Write to the staging token received |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking F2F I now understand that we store the path not so much in the object store as in the KV store. Can we specify this here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed - specified.
design/metadata_kv/index.md
Outdated
| It is important to understand that the current pessimistic approach locks the branch for the entire duration of the commit. | ||
| This takes time proportional to the amount of changes to be committted and is unbounded. All writes to a branch are blocked for that period of time. | ||
|
|
||
| With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first SetIf() that replaces the `staging_token`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first SetIf() that replaces the `staging_token`. | |
| With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first `SetIf()` that replaces the `staging_token`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Fixed.
arielshaqed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking for more details about SetIf.
| // SetIf returns an ErrPredicateFailed error if the valuePredicate passed | ||
| // doesn't match the currently stored value. | ||
| SetIf(key, value, valuePredicate []byte) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might allow a very powerful but complex predicate -- not clear because we don't document it anywhere. Or is this a CAS? Because CAS is weak, but hard to use efficiently.
Ideally we would break the predicate down into a much weaker language, allowing more efficient implementation on many KVs. Or support CAS and document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple CAS. documented.
design/metadata_kv/index.md
Outdated
| 1. get branch, find current `staging_token` | ||
| 1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`. | ||
| 1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit | ||
| 1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is it safe to clear the staged entries of a particular staging token? After a commit has ended, there could ongoing reads still reading entries from the staging token, so we can't clear them right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good question - the answer to "when is it safe" would be "never" :)
I revised the reader flow to address this, also using optimistic concurrency control. Interested in hearing your opinion on the suggested solution (also, I'm not sure I did a good job explaining it).
design/metadata_kv/index.md
Outdated
|
|
||
| #### Writer flow | ||
|
|
||
| 1. Read the branch's existing staging token: currently, we batch requests for a short duration and amortize the cost across requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, this flow can cause a single write to appear in more than one commit. In particular, it can't be described as an idempotent operation in (3) below. Not saying this is necessarily bad but I'm finding it hard to describe it in terms of a consistency model.
In the following example, write 001 appears on commit a, reverted by commit b and then appears again as a new write on commit c.
Write 001 start – to staging token A
Commit a for token A
Write 002 start – to staging token B
Commit b for token B
Write 002 finalize: staging token was changed – write again to token C
Write 001 finalize: staging token was changed – write again to token C
Commit c for token C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: talked f2f with @arielshaqed, who explained that this "limitation" was discussed by the team. @ozkatz can you please add an explanation about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great minds do think alike I guess. Thanks @arielshaqed for the description (should we maybe add it to this PR?)!
…from stale staging_token(s)
arielshaqed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! I think you addressed a different issue from the one that I had with periodic commits, so I blathered on a bit about the difference and how we could solve that one.
I do NOT think that this blocks an initial revision. (However, having written it down, I now think that it is simple enough to do immediately.)
| 1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know. | ||
|
|
||
| An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off, | ||
| adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is, we run into trouble even if we never retry commits (so exponential backoff doesn't solve the issue)! Every commit tries and fails, growing sealed_tokens. As long as there's a nice fugue (commit starts by growing sealed_tokens, then runs until after the next commit starts) then no commit succeeds, and sealed_tokens grows, and reads take longer and longer.
I believe the only way around this will be to trim prefixes of sealed_tokens. E.g. accept a commit if the end of its sealed_tokens overlaps that start of the current sealed_tokens, and trim the shared bit.
Example
sealed_tokens: 1,2,3; staging_token: 4
Process A starts to commit tokens 1,2,3,4.
sealed_tokens: 1,2,3,4; staging_token: 5
Process B starts to commit tokens 1,2,3,4,5.
sealed_tokens: 1,2,3,4,5; staging_token: 6
Process C starts to commit tokens 1,2,3,4,5,6.
sealed_tokens: 1,2,3,4,5,6; staging_token: 7
Process A ends committing: it tries for a "fast-path commit", but since staging_token is no longer 5 that fails. Instead it reads the branch (or uses the value return by Set-If, if we want to require such an interface). Now 1,2,3,4 is a prefix of 1,2,3,4,5,6, so A tries again to commit the remaining sealed_tokens 5,6 and a staging_token of 7. That works, so A succeeded!
sealed_tokens: 5,6; staging_token: 7
Process C ends committing: its fast-path commit works (staging_token is 7), so it drops all sealed_tokens.
sealed_tokens: (); staging_token: 7
Process B ends committing: its fast-path commit doesn't work, and it has an empty prefix, so it knows it has already succeeded.
@arielshaqed I like this idea a lot - and it is simple. |
|
|
||
| * A successful commit holds the contents of some listing of the entire keyspace. | ||
|
|
||
| * Mutating operations (commits and writes) may succeed or fail. When they fail, their contents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sort of a non guarantee: Failing mutating operations are not guaranteed to not be visible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this item is mostly redundant since it's true for pretty much all object stores and databases that I can think of. perhaps @arielshaqed can clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everybody is right! If you know the state after a mutation then you already have consistency -- that's the C. If the mutation failed then you've lost either availability (A) or partition-freedom (P). Now if we drop A then the client might never return after a mutation failed, and if you drop P then the client is no longer connected. In both cases that means the client cannot know what happened after a failed mutation unless it retries forever on failure (and potentially never succeeds).
That's why it's true of all stores.
I would prefer to keep this explicit non-guarantee. It reminds readers that we are not requiring an ACID DB such as Postgres.
|
|
||
| ### Graveler Metadata - Commits, Tags, Repositories | ||
|
|
||
| These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tags are mutable, they can be deleted and recreated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I do think they could be treated as immutable when discussing concurrency (i.e. it's ok to assume low/non contention when writing a single tag entry) - so less prone to read-modify-write issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we spell this out? E.g. something like
When tags are mutated, *no* ordering guarantees apply.| 1. Read the branch's existing staging token: if branch exists in the cache, use it! Otherwise, do an amortized read (see [above](#caching-branch-pointers-and-amortized-reads)) and cache the result for a very short duration. | ||
| 1. Write to the staging token received - this is another key/value record (e.g. `"graveler/staging/${repoId}/${stagingToken}/${path}"`) | ||
| 1. Read the branch's existing staging token **again**. This is always an amortized read, not a cache read. If we get the same `staging_token` - great, no commit has *started while writing* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case. | ||
| 1. If the `staging_token` *has* changed - **retry the operation**. If the previous write made it in time to be included in the commit, we'll end up writing a record with the same identity - an idempotent operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: instead of retrying the operation - fail the operation. The user can then retry.
Pros:
- Avoid strange consistency issues like mentioned above. Changes appear only once in the history, unless the user explicitly retries.
- Simpler to understand and implement
Con:
- Increased chance of a failed operation being persisted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in any case, this operation cannot be described as "idempotent".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes appear only once in history anyway, what are the strange consistency issues?
Writing the exact same entry again is idempotent: the identity field written would be exactly the same, meaning it won't show as a diff between versions. Just like in Git: if you overwrite a local file with the exact same content, it's a no-op, you won't see it as a diff even tough you modified the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the banana problem - the retry will not be writing the same identity again will cause the identity to change, so it will not be idempotent.
And, the same write will appear twice in the history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I undertand I need to better phrase where this retry happens:
this is an top-level graveler operation. It doesn't delegate it back to the caller (catalog), so it is done using the same value that was given to it (i.e. the same serialized entry and same calculated identity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think that we have a B-A-N-A problem here. If I try write A & N to the same path and try committing twice while doing it. The value could very well be:
B- before anything was done.A- in the first commit.N- in the second commit.A- staging area after the second commit.
So from the A was "written twice" for a single write operation.
I think that the design assumes that no other value is being written to the same path concurrently, hence the mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shall add the sealed_tokens prefix removal suggestion separately.
|
|
||
| * A successful commit holds the contents of some listing of the entire keyspace. | ||
|
|
||
| * Mutating operations (commits and writes) may succeed or fail. When they fail, their contents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everybody is right! If you know the state after a mutation then you already have consistency -- that's the C. If the mutation failed then you've lost either availability (A) or partition-freedom (P). Now if we drop A then the client might never return after a mutation failed, and if you drop P then the client is no longer connected. In both cases that means the client cannot know what happened after a failed mutation unless it retries forever on failure (and potentially never succeeds).
That's why it's true of all stores.
I would prefer to keep this explicit non-guarantee. It reminds readers that we are not requiring an ACID DB such as Postgres.
|
|
||
| ### Graveler Metadata - Commits, Tags, Repositories | ||
|
|
||
| These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we spell this out? E.g. something like
When tags are mutated, *no* ordering guarantees apply.
arielshaqed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we add (this) explicit suggestion for "prefix match" with an explanation of why it ensures progress.
| 1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit. | ||
| 1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following ensures liveness for the concurrent progressive commits scenario in this comment. I think we must have it to ensure progression. (To be clear: the current design without it is perfectly correct: no mistakes will be made in this scenario. But it cannot guarantee progression: time to completion of a single commit is unbounded (and the commit may never happen).
| 1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit. | |
| 1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know. | |
| 1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform a `SetIfPrefixMatch()` operation (below) to updates the branch key/value pair again: replace its commit ID with the new value, and remove the _suffix_ of this commit's `sealed_tokens` that matches the prefix of the branch's `sealed_tokens`. These are precisely the prefixes have materialized into the new commit. | |
| 1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know. | |
| The `SetIfPrefixMatch()` operation is defined at the metadata level (_not_ the KV store), and can use the existing `SetIf` operation of the KV store. It performs the following atomically / optimistically / optimistically with retries: Given a `committed_sealed_tokens`: | |
| 1. If there is a no nonempty prefix of the current branch `sealed_tokens` that is a suffix of `committed_sealed_tokens`: return as "no work to be done". | |
| 1. Otherwise, set `new_sealed_tokens` to be this matching overlap. | |
| 1. Update the branch value to replace its commit ID with the new commit ID, and its `sealed_tokens` with `new_sealed_tokens`. | |
| We can implement `SetIfPrefixMatch` by being optimistic about the current branch `sealed_tokens`. It might not always work, but for any concurrent executions _at least one will terminate_, which ensures global progress. Now for any execution, if we need to retry then the length of the prefix must be shorter, so any execution also progresses. |
| 1. Write to the staging token received - this is another key/value record (e.g. `"graveler/staging/${repoId}/${stagingToken}/${path}"`) | ||
| 1. Read the branch's existing staging token **again**. This is always an amortized read, not a cache read. If we get the same `staging_token` - great, no commit has *started while writing* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case. | ||
| 1. If the `staging_token` *has* changed - **retry the operation**. If the previous write made it in time to be included in the commit, we'll end up writing a record with the same identity - an idempotent operation. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### Optimization: Detect successful commits | |
| We can start without this optimization and add it later. However it can greatly help avoid meltdowns when many writes compete with commits. That's important because writes -- unlike commits with the staging queue overlap above -- are not lock-free or even wait-free. (In the presence of repeated concurrent commits they can suffer from livelock.) It is **not** sufficient to guarantee wait-freedom, it just makes things a bit better when thing go bad, | |
| When a staging token changes during a write, instead of immediately retrying, _look up the physical address in the commit for the original staging token or for some successive token_. If it is found there, then it *was* committed concurrently and no rewrite is necessary. |
| 1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know. | ||
|
|
||
| An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off, | ||
| adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔓 Lock freedom 🔓
Let's talk about how to make this lock-free (in the sense of non-blocking algorithms).
Why is this important? Consider a loaded system where commits are not even lock-free (i.e., it is possible to make noi progress for an arbitrarily long period of time). For instance, let's say we have a commit every second. For some reason we have a delay (maybe the KV store went down for a while) and now we have 60 commits queued. They all race to finish the commit (by writing the old staging token with a shorter new sealed tokens queue) with incoming new commits (which try to write a new staging token and push a longer sealed tokens queue). Now there is no progress guarantee, we can have more and more new commits admitted. Request timeouts don't really help here, because new commit requests arrive at the same rate at which timeouts occur. So this is at best a steady-state -- and commits finish by a random walk, i.e. we recover in time quadratic in the number of pending commits.
We can only talk about lock-free (and wait-free operations) for constant-time operations, so let's separate the "write metarange" operation (which is trivially concurrent but take time linear in the number of changes) from the "update commit" (which can run in constant time). Also we cannot assume that the KV store is lock-free, so for our purposes we assume either that it is or that we count time in number of operations on the KV. (This is the same as any discussion of nonblocking operations, which invariably assume the memory is lock-free even though it probably is not...).
Becoming lock-free
With the process above, commits are still not lock-free. But we can make them so! The key is to have a "bag" of pending commits -- some data structure with Put, UpdateIfUnchanged, and GetAny operations. For instance, PUTs just write a value to key zzz_<bag-id>_<random>, and GETANYs just read the first key with prefix zzz_<bag-id>_. (A roughly-ordered queue is an even better choice, and we can do that too... but that is not necessary for correctness here, so this comment is too narrow for heuristics to put in such a queue...).
Now when trying (or possibly retrying) to commit, just add the desired commit record to the bag for that branch, and periodically probe to see if the commit succeeded. Meanwhile, every start commit and every finish-commit probe first repeatedly runs GetAny and tries to commit, until the bag is empty.
Now, if there is a pool of pending commits, then after all start-commits finish every committer (including new committers) is trying to end a commit! The only way a committer has to retry is that another committer succeeded, so we continue to make progress until the bag is (momentarily) empty.
Becoming wait-free
Wait-freedom is (much) harder. I think we could "just" use a queue as a bag above... but that is quite a powerful operation to expect our KV store to have, so it involves either implementing a lock-free queue (algorithms exist) or using a KV that has one (Redis).
No description provided.