Proposal for running lakeFS on top of a key/value store #2466

ozkatz · 2021-09-13T13:02:16Z

No description provided.

arielshaqed

Neat!

arielshaqed · 2021-09-19T09:02:22Z

design/metadata_kv/index.md

+- PostgreSQL
+- MySQL
+- Embedded Pebble/RocksDB (great option for a "quickstart" environment?)
+- MongoDB
+- AWS DynamoDB
+- FoundationDB
+- Azure Cosmos
+- Azure Blob Store
+- Google BigTable
+- Google Spanner
+- Google Cloud Storage
+- HBase
+- Cassandra (or compatible systems such as ScyllaDB)
+- Raft (embedded, or server implementations such as Consul/ETCd)
+- Persistent Redis (Redis Enterprise, AWS MemoryDB)
+- Simple in-memory tree (for simplifying and speeding up tests?)


Can we include a list (or multiple lists) of required primitives for the KV store?

Added! thanks.

arielshaqed · 2021-09-19T09:04:48Z

design/metadata_kv/index.md

+We add an additional field to each `Branch` object: In addition to the existing `staging_token`, we add an array of strings named `sealed_tokens`.
+
+1. get branch, find current `staging_token`
+1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.


IIUC this is SetIf the (existing, pushed onto staging_token) staging_token is unchanged?

Made it clear that SetIf() is actually a CAS operation on the entire branch (staging_token, sealed_tokens and commit_id).

arielshaqed · 2021-09-19T09:05:46Z

design/metadata_kv/index.md

+1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.
+1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit
+1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
+1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.


Suggested change

1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.

1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.

Fixed, thanks!

arielshaqed · 2021-09-19T09:15:45Z

design/metadata_kv/index.md

+1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.
+1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit
+1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
+1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.


What is this SetIf on? AFAICT just staging_token or just commit_id are not enough to provide ordering in the face of concurrent commits. It might be if the entire value is unchanged, or possibly staging_token and sealed_tokens.

This was indeed unclear - made it explicit that SetIf() is actually a CAS operation on the entire branch (staging_token, sealed_tokens and commit_id).

arielshaqed · 2021-09-19T09:18:56Z

design/metadata_kv/index.md

+1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.
+
+An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off,
+adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with.


I am less optimistic (sorry) here. Suppose that we have an automated commit, once every X seconds (X ~ 1s). Typically everything works really well and we're happy. I would claim that this is a good way to implement streaming input: write to branch input, every X seconds commit and on success merge to main. (It is actually hard to do something else! E.g. writing to branch input-NNNNNN with a running commit counter NNNNNN pushes this huge problem onto the writers, who now need essentially to resolve the same issue that we are resolving here.)

Now suppose that our database (or maybe that part of our object store) has a hiccup and a commit takes >X seconds. Then sealed_tokens starts growing. AFAIU this requires every concurrent commit to fail (there is no point in retrying if sealed_tokens has grown since the last call, it will never go back to what we expect. So typically only the last concurrent commit will succeed. So until some time after recovery, not only do we have no commits, but also reads get progressively slower.

It may be OK to leave this open in the first version(s), but I do want to document this limitation while we have it.

Added a suggestion on how to address it. If you have other strategies that could be used here, let's add them!

Thing is, we run into trouble even if we never retry commits (so exponential backoff doesn't solve the issue)! Every commit tries and fails, growing sealed_tokens. As long as there's a nice fugue (commit starts by growing sealed_tokens, then runs until after the next commit starts) then no commit succeeds, and sealed_tokens grows, and reads take longer and longer.

I believe the only way around this will be to trim prefixes of sealed_tokens. E.g. accept a commit if the end of its sealed_tokens overlaps that start of the current sealed_tokens, and trim the shared bit.

Example

sealed_tokens: 1,2,3; staging_token: 4

Process A starts to commit tokens 1,2,3,4.

sealed_tokens: 1,2,3,4; staging_token: 5

Process B starts to commit tokens 1,2,3,4,5.

sealed_tokens: 1,2,3,4,5; staging_token: 6

Process C starts to commit tokens 1,2,3,4,5,6.

sealed_tokens: 1,2,3,4,5,6; staging_token: 7

Process A ends committing: it tries for a "fast-path commit", but since staging_token is no longer 5 that fails. Instead it reads the branch (or uses the value return by Set-If, if we want to require such an interface). Now 1,2,3,4 is a prefix of 1,2,3,4,5,6, so A tries again to commit the remaining sealed_tokens 5,6 and a staging_token of 7. That works, so A succeeded!

sealed_tokens: 5,6; staging_token: 7

Process C ends committing: its fast-path commit works (staging_token is 7), so it drops all sealed_tokens.

sealed_tokens: (); staging_token: 7

Process B ends committing: its fast-path commit doesn't work, and it has an empty prefix, so it knows it has already succeeded.

🔓 Lock freedom 🔓

Let's talk about how to make this lock-free (in the sense of non-blocking algorithms).

Why is this important? Consider a loaded system where commits are not even lock-free (i.e., it is possible to make noi progress for an arbitrarily long period of time). For instance, let's say we have a commit every second. For some reason we have a delay (maybe the KV store went down for a while) and now we have 60 commits queued. They all race to finish the commit (by writing the old staging token with a shorter new sealed tokens queue) with incoming new commits (which try to write a new staging token and push a longer sealed tokens queue). Now there is no progress guarantee, we can have more and more new commits admitted. Request timeouts don't really help here, because new commit requests arrive at the same rate at which timeouts occur. So this is at best a steady-state -- and commits finish by a random walk, i.e. we recover in time quadratic in the number of pending commits.

We can only talk about lock-free (and wait-free operations) for constant-time operations, so let's separate the "write metarange" operation (which is trivially concurrent but take time linear in the number of changes) from the "update commit" (which can run in constant time). Also we cannot assume that the KV store is lock-free, so for our purposes we assume either that it is or that we count time in number of operations on the KV. (This is the same as any discussion of nonblocking operations, which invariably assume the memory is lock-free even though it probably is not...).

Becoming lock-free

With the process above, commits are still not lock-free. But we can make them so! The key is to have a "bag" of pending commits -- some data structure with Put, UpdateIfUnchanged, and GetAny operations. For instance, PUTs just write a value to key zzz_<bag-id>_<random>, and GETANYs just read the first key with prefix zzz_<bag-id>_. (A roughly-ordered queue is an even better choice, and we can do that too... but that is not necessary for correctness here, so this comment is too narrow for heuristics to put in such a queue...).

Now when trying (or possibly retrying) to commit, just add the desired commit record to the bag for that branch, and periodically probe to see if the commit succeeded. Meanwhile, every start commit and every finish-commit probe first repeatedly runs GetAny and tries to commit, until the bag is empty.

Now, if there is a pool of pending commits, then after all start-commits finish every committer (including new committers) is trying to end a commit! The only way a committer has to retry is that another committer succeeded, so we continue to make progress until the bag is (momentarily) empty.

Becoming wait-free

Wait-freedom is (much) harder. I think we could "just" use a queue as a bag above... but that is quite a powerful operation to expect our KV store to have, so it involves either implementing a lock-free queue (algorithms exist) or using a KV that has one (Redis).

arielshaqed · 2021-09-19T09:21:10Z

design/metadata_kv/index.md

+#### Writer flow
+
+1. Read the branch's existing staging token: currently, we batch requests for a short duration and amortize the cost across requests.
+1. Write to the staging token received


Talking F2F I now understand that we store the path not so much in the object store as in the KV store. Can we specify this here, too?

Indeed - specified.

arielshaqed · 2021-09-19T09:22:00Z

design/metadata_kv/index.md

+It is important to understand that the current pessimistic approach locks the branch for the entire duration of the commit. 
+This takes time proportional to the amount of changes to be committted and is unbounded. All writes to a branch are blocked for that period of time.
+
+With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first SetIf() that replaces the `staging_token`. 


Suggested change

With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first SetIf() that replaces the `staging_token`.

With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first `SetIf()` that replaces the `staging_token`.

Thanks, Fixed.

arielshaqed

Asking for more details about SetIf.

arielshaqed · 2021-09-19T09:26:11Z

design/metadata_kv/index.md

+    // SetIf returns an ErrPredicateFailed error if the valuePredicate passed
+    //  doesn't match the currently stored value.
+    SetIf(key, value, valuePredicate []byte) error


This might allow a very powerful but complex predicate -- not clear because we don't document it anywhere. Or is this a CAS? Because CAS is weak, but hard to use efficiently.

Ideally we would break the predicate down into a much weaker language, allowing more efficient implementation on many KVs. Or support CAS and document

This is a simple CAS. documented.

johnnyaug · 2021-09-19T09:36:40Z

design/metadata_kv/index.md

+1. get branch, find current `staging_token`
+1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.
+1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit
+1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.


When is it safe to clear the staged entries of a particular staging token? After a commit has ended, there could ongoing reads still reading entries from the staging token, so we can't clear them right away.

That's a very good question - the answer to "when is it safe" would be "never" :)
I revised the reader flow to address this, also using optimistic concurrency control. Interested in hearing your opinion on the suggested solution (also, I'm not sure I did a good job explaining it).

johnnyaug · 2021-09-19T11:14:46Z

design/metadata_kv/index.md

+
+#### Writer flow
+
+1. Read the branch's existing staging token: currently, we batch requests for a short duration and amortize the cost across requests.


As far as I understand, this flow can cause a single write to appear in more than one commit. In particular, it can't be described as an idempotent operation in (3) below. Not saying this is necessarily bad but I'm finding it hard to describe it in terms of a consistency model.

In the following example, write 001 appears on commit a, reverted by commit b and then appears again as a new write on commit c.

Write 001 start – to staging token A
Commit a for token A
Write 002 start – to staging token B
Commit b for token B
Write 002 finalize: staging token was changed – write again to token C
Write 001 finalize: staging token was changed – write again to token C
Commit c for token C

Update: talked f2f with @arielshaqed, who explained that this "limitation" was discussed by the team. @ozkatz can you please add an explanation about it?

#2480 (created after F2F'ing with @ozkatz ) is my attempt to describe what guarantees we do give.

Great minds do think alike I guess. Thanks @arielshaqed for the description (should we maybe add it to this PR?)!

…from stale staging_token(s)

arielshaqed

Neat! I think you addressed a different issue from the one that I had with periodic commits, so I blathered on a bit about the difference and how we could solve that one.

I do NOT think that this blocks an initial revision. (However, having written it down, I now think that it is simple enough to do immediately.)

arielshaqed · 2021-09-22T08:30:28Z

design/metadata_kv/index.md

+1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.
+
+An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off,
+adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with.


Thing is, we run into trouble even if we never retry commits (so exponential backoff doesn't solve the issue)! Every commit tries and fails, growing sealed_tokens. As long as there's a nice fugue (commit starts by growing sealed_tokens, then runs until after the next commit starts) then no commit succeeds, and sealed_tokens grows, and reads take longer and longer.

I believe the only way around this will be to trim prefixes of sealed_tokens. E.g. accept a commit if the end of its sealed_tokens overlaps that start of the current sealed_tokens, and trim the shared bit.

Example

sealed_tokens: 1,2,3; staging_token: 4

Process A starts to commit tokens 1,2,3,4.

sealed_tokens: 1,2,3,4; staging_token: 5

Process B starts to commit tokens 1,2,3,4,5.

sealed_tokens: 1,2,3,4,5; staging_token: 6

Process C starts to commit tokens 1,2,3,4,5,6.

sealed_tokens: 1,2,3,4,5,6; staging_token: 7

Process A ends committing: it tries for a "fast-path commit", but since staging_token is no longer 5 that fails. Instead it reads the branch (or uses the value return by Set-If, if we want to require such an interface). Now 1,2,3,4 is a prefix of 1,2,3,4,5,6, so A tries again to commit the remaining sealed_tokens 5,6 and a staging_token of 7. That works, so A succeeded!

sealed_tokens: 5,6; staging_token: 7

Process C ends committing: its fast-path commit works (staging_token is 7), so it drops all sealed_tokens.

sealed_tokens: (); staging_token: 7

Process B ends committing: its fast-path commit doesn't work, and it has an empty prefix, so it knows it has already succeeded.

ozkatz · 2021-09-22T08:43:33Z

Neat! I think you addressed a different issue from the one that I had with periodic commits, so I blathered on a bit about the difference and how we could solve that one.

I do NOT think that this blocks an initial revision. (However, having written it down, I now think that it is simple enough to do immediately.)

@arielshaqed I like this idea a lot - and it is simple.
Can you contribute it to this PR? Preferably as another sub-section so its independent (just thinking ahead to how we might break down this change into smaller bits if/when we decide we want to go for it)

itaiad200 · 2021-09-26T12:44:41Z

design/metadata_kv/index.md

+
+* A successful commit holds the contents of some listing of the entire keyspace.
+
+* Mutating operations (commits and writes) may succeed or fail.  When they fail, their contents


This is sort of a non guarantee: Failing mutating operations are not guaranteed to not be visible.

I feel this item is mostly redundant since it's true for pretty much all object stores and databases that I can think of. perhaps @arielshaqed can clarify?

Everybody is right! If you know the state after a mutation then you already have consistency -- that's the C. If the mutation failed then you've lost either availability (A) or partition-freedom (P). Now if we drop A then the client might never return after a mutation failed, and if you drop P then the client is no longer connected. In both cases that means the client cannot know what happened after a failed mutation unless it retries forever on failure (and potentially never succeeds).

That's why it's true of all stores.

I would prefer to keep this explicit non-guarantee. It reminds readers that we are not requiring an ACID DB such as Postgres.

itaiad200 · 2021-09-26T12:58:36Z

design/metadata_kv/index.md

+
+### Graveler Metadata - Commits, Tags, Repositories
+
+These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference).


Tags are mutable, they can be deleted and recreated.

Fair enough. I do think they could be treated as immutable when discussing concurrency (i.e. it's ok to assume low/non contention when writing a single tag entry) - so less prone to read-modify-write issues.

Can we spell this out? E.g. something like

When tags are mutated, *no* ordering guarantees apply.

design/metadata_kv/index.md

johnnyaug · 2021-10-06T11:58:22Z

design/metadata_kv/index.md

+1. Read the branch's existing staging token: if branch exists in the cache, use it! Otherwise, do an amortized read (see [above](#caching-branch-pointers-and-amortized-reads)) and cache the result for a very short duration.
+1. Write to the staging token received - this is another key/value record (e.g. `"graveler/staging/${repoId}/${stagingToken}/${path}"`)
+1. Read the branch's existing staging token **again**. This is always an amortized read, not a cache read. If we get the same `staging_token` - great, no commit has *started while writing* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case.
+1. If the `staging_token` *has* changed - **retry the operation**. If the previous write made it in time to be included in the commit, we'll end up writing a record with the same identity - an idempotent operation.


Suggestion: instead of retrying the operation - fail the operation. The user can then retry.

Pros:

Avoid strange consistency issues like mentioned above. Changes appear only once in the history, unless the user explicitly retries.

Simpler to understand and implement

Con:

Increased chance of a failed operation being persisted.

Also, in any case, this operation cannot be described as "idempotent".

Changes appear only once in history anyway, what are the strange consistency issues?

Writing the exact same entry again is idempotent: the identity field written would be exactly the same, meaning it won't show as a diff between versions. Just like in Git: if you overwrite a local file with the exact same content, it's a no-op, you won't see it as a diff even tough you modified the file.

I was referring to the banana problem - the retry ~~will not be writing the same identity again~~ will cause the identity to change, so it will not be idempotent.
And, the same write will appear twice in the history.

ok I undertand I need to better phrase where this retry happens:

this is an top-level graveler operation. It doesn't delegate it back to the caller (catalog), so it is done using the same value that was given to it (i.e. the same serialized entry and same calculated identity)

I do think that we have a B-A-N-A problem here. If I try write A & N to the same path and try committing twice while doing it. The value could very well be:

B - before anything was done.

A - in the first commit.

N - in the second commit.

A - staging area after the second commit.

So from the A was "written twice" for a single write operation.

I think that the design assumes that no other value is being written to the same path concurrently, hence the mistake.

arielshaqed

I shall add the sealed_tokens prefix removal suggestion separately.

arielshaqed · 2021-09-27T07:44:04Z

design/metadata_kv/index.md

+
+* A successful commit holds the contents of some listing of the entire keyspace.
+
+* Mutating operations (commits and writes) may succeed or fail.  When they fail, their contents


Everybody is right! If you know the state after a mutation then you already have consistency -- that's the C. If the mutation failed then you've lost either availability (A) or partition-freedom (P). Now if we drop A then the client might never return after a mutation failed, and if you drop P then the client is no longer connected. In both cases that means the client cannot know what happened after a failed mutation unless it retries forever on failure (and potentially never succeeds).

That's why it's true of all stores.

I would prefer to keep this explicit non-guarantee. It reminds readers that we are not requiring an ACID DB such as Postgres.

arielshaqed · 2021-09-27T07:46:40Z

design/metadata_kv/index.md

+
+### Graveler Metadata - Commits, Tags, Repositories
+
+These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference).


Can we spell this out? E.g. something like

When tags are mutated, *no* ordering guarantees apply.

arielshaqed

I suggest we add (this) explicit suggestion for "prefix match" with an explanation of why it ensures progress.

arielshaqed · 2021-10-10T09:25:47Z

design/metadata_kv/index.md

+1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
+1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.


The following ensures liveness for the concurrent progressive commits scenario in this comment. I think we must have it to ensure progression. (To be clear: the current design without it is perfectly correct: no mistakes will be made in this scenario. But it cannot guarantee progression: time to completion of a single commit is unbounded (and the commit may never happen).

Suggested change

1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.

1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.

1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform a `SetIfPrefixMatch()` operation (below) to updates the branch key/value pair again: replace its commit ID with the new value, and remove the _suffix_ of this commit's `sealed_tokens` that matches the prefix of the branch's `sealed_tokens`. These are precisely the prefixes have materialized into the new commit.

1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.

The `SetIfPrefixMatch()` operation is defined at the metadata level (_not_ the KV store), and can use the existing `SetIf` operation of the KV store. It performs the following atomically / optimistically / optimistically with retries: Given a `committed_sealed_tokens`:

1. If there is a no nonempty prefix of the current branch `sealed_tokens` that is a suffix of `committed_sealed_tokens`: return as "no work to be done".

1. Otherwise, set `new_sealed_tokens` to be this matching overlap.

1. Update the branch value to replace its commit ID with the new commit ID, and its `sealed_tokens` with `new_sealed_tokens`.

We can implement `SetIfPrefixMatch` by being optimistic about the current branch `sealed_tokens`. It might not always work, but for any concurrent executions _at least one will terminate_, which ensures global progress. Now for any execution, if we need to retry then the length of the prefix must be shorter, so any execution also progresses.

arielshaqed · 2021-10-18T14:48:59Z

design/metadata_kv/index.md

+1. Write to the staging token received - this is another key/value record (e.g. `"graveler/staging/${repoId}/${stagingToken}/${path}"`)
+1. Read the branch's existing staging token **again**. This is always an amortized read, not a cache read. If we get the same `staging_token` - great, no commit has *started while writing* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case.
+1. If the `staging_token` *has* changed - **retry the operation**. If the previous write made it in time to be included in the commit, we'll end up writing a record with the same identity - an idempotent operation.
+


Suggested change

##### Optimization: Detect successful commits

We can start without this optimization and add it later. However it can greatly help avoid meltdowns when many writes compete with commits. That's important because writes -- unlike commits with the staging queue overlap above -- are not lock-free or even wait-free. (In the presence of repeated concurrent commits they can suffer from livelock.) It is **not** sufficient to guarantee wait-freedom, it just makes things a bit better when thing go bad,

When a staging token changes during a write, instead of immediately retrying, _look up the physical address in the commit for the original staging token or for some successive token_. If it is found there, then it *was* committed concurrently and no rewrite is necessary.

arielshaqed · 2021-10-19T07:54:21Z

design/metadata_kv/index.md

+1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.
+
+An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off,
+adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with.


🔓 Lock freedom 🔓

Let's talk about how to make this lock-free (in the sense of non-blocking algorithms).

Why is this important? Consider a loaded system where commits are not even lock-free (i.e., it is possible to make noi progress for an arbitrarily long period of time). For instance, let's say we have a commit every second. For some reason we have a delay (maybe the KV store went down for a while) and now we have 60 commits queued. They all race to finish the commit (by writing the old staging token with a shorter new sealed tokens queue) with incoming new commits (which try to write a new staging token and push a longer sealed tokens queue). Now there is no progress guarantee, we can have more and more new commits admitted. Request timeouts don't really help here, because new commit requests arrive at the same rate at which timeouts occur. So this is at best a steady-state -- and commits finish by a random walk, i.e. we recover in time quadratic in the number of pending commits.

We can only talk about lock-free (and wait-free operations) for constant-time operations, so let's separate the "write metarange" operation (which is trivially concurrent but take time linear in the number of changes) from the "update commit" (which can run in constant time). Also we cannot assume that the KV store is lock-free, so for our purposes we assume either that it is or that we count time in number of operations on the KV. (This is the same as any discussion of nonblocking operations, which invariably assume the memory is lock-free even though it probably is not...).

Becoming lock-free

With the process above, commits are still not lock-free. But we can make them so! The key is to have a "bag" of pending commits -- some data structure with Put, UpdateIfUnchanged, and GetAny operations. For instance, PUTs just write a value to key zzz_<bag-id>_<random>, and GETANYs just read the first key with prefix zzz_<bag-id>_. (A roughly-ordered queue is an even better choice, and we can do that too... but that is not necessary for correctness here, so this comment is too narrow for heuristics to put in such a queue...).

Now when trying (or possibly retrying) to commit, just add the desired commit record to the bag for that branch, and periodically probe to see if the commit succeeded. Meanwhile, every start commit and every finish-commit probe first repeatedly runs GetAny and tries to commit, until the bag is empty.

Now, if there is a pool of pending commits, then after all start-commits finish every committer (including new committers) is trying to end a commit! The only way a committer has to retry is that another committer succeeded, so we continue to make progress until the bag is (momentarily) empty.

Becoming wait-free

Wait-freedom is (much) harder. I think we could "just" use a queue as a bag above... but that is quite a powerful operation to expect our KV store to have, so it involves either implementing a lock-free queue (algorithms exist) or using a KV that has one (Redis).

This handles these comments on the original design review proposal: * #2466#discussion_r713713932 * #2466#discussion_r725608233 (this one actually goes into _more_ detail than the current proposal -- reviewers please let me know if you prefer the phrasing there!)

Adding a proposal for running lakeFS on top of a key/value store

4ef2927

ozkatz added area/DB Improvements or additions to the DB proposal labels Sep 13, 2021

ozkatz self-assigned this Sep 13, 2021

ozkatz mentioned this pull request Sep 14, 2021

proposal doc for object store based LSM #1688

Closed

ozkatz requested review from arielshaqed, guy-har, itaiad200, itaidavid, johnnyaug, lynnro314, nopcoder and talSofer September 19, 2021 08:19

arielshaqed reviewed Sep 19, 2021

View reviewed changes

johnnyaug reviewed Sep 19, 2021

View reviewed changes

address review comments, implement safety mechanism to avoid reading …

db880b5

…from stale staging_token(s)

ozkatz requested review from arielshaqed and johnnyaug September 20, 2021 14:12

arielshaqed reviewed Sep 22, 2021

View reviewed changes

Add summary of consistency semantics that we can give (#2480)

cb9a307

itaiad200 mentioned this pull request Sep 26, 2021

[Proposal] Repo level settings to achieve full consistency #2491

Closed

itaiad200 reviewed Sep 26, 2021

View reviewed changes

arielshaqed requested a review from eden-ohana October 6, 2021 08:53

johnnyaug reviewed Oct 6, 2021

View reviewed changes

arielshaqed reviewed Oct 10, 2021

View reviewed changes

nopcoder approved these changes Oct 11, 2021

View reviewed changes

arielshaqed reviewed Oct 19, 2021

View reviewed changes

arielshaqed mentioned this pull request Oct 20, 2021

Compaction: Common prefix listing on a branch is linear in # of uncommitted entries #2092

Open

ozkatz added 2 commits March 17, 2022 17:48

Merge branch 'master' into proposal/lakef-on-kv

1166036

restructure design proposals

dc74747

ozkatz added the exclude-changelog PR description should not be included in next release changelog label Mar 17, 2022

ozkatz marked this pull request as ready for review March 17, 2022 15:52

ozkatz requested review from itaiad200 and nopcoder March 17, 2022 15:52

nopcoder approved these changes Mar 17, 2022

View reviewed changes

ozkatz merged commit 26fffdf into master Mar 17, 2022

ozkatz deleted the proposal/lakef-on-kv branch March 17, 2022 17:01

arielshaqed mentioned this pull request Mar 22, 2022

Change KV storage proposal to guarantee progress #3091

Merged

arielshaqed mentioned this pull request Mar 28, 2022

Remove unused funcs in tx.go #3128

Merged

arielshaqed mentioned this pull request Apr 17, 2022

proposal doc for staging with a queue #1685

Closed

	1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.
	1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.

	With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first SetIf() that replaces the `staging_token`.
	With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first `SetIf()` that replaces the `staging_token`.


		#### Writer flow

		1. Read the branch's existing staging token: currently, we batch requests for a short duration and amortize the cost across requests.


		* A successful commit holds the contents of some listing of the entire keyspace.

		* Mutating operations (commits and writes) may succeed or fail. When they fail, their contents


		### Graveler Metadata - Commits, Tags, Repositories

		These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference).

		1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
		1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.

-. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
-. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.
+. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform a `SetIfPrefixMatch()` operation (below) to updates the branch key/value pair again: replace its commit ID with the new value, and remove the _suffix_ of this commit's `sealed_tokens` that matches the prefix of the branch's `sealed_tokens`.  These are precisely the prefixes have materialized into the new commit.
+. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.
+The `SetIfPrefixMatch()` operation is defined at the metadata level (_not_ the KV store), and can use the existing `SetIf` operation of the KV store.  It performs the following atomically / optimistically / optimistically with retries:  Given a `committed_sealed_tokens`:
+. If there is a no nonempty prefix of the current branch `sealed_tokens` that is a suffix of `committed_sealed_tokens`: return as "no work to be done".
+. Otherwise, set `new_sealed_tokens` to be this matching overlap.
+. Update the branch value to replace its commit ID with the new commit ID, and its `sealed_tokens` with `new_sealed_tokens`.
+We can implement `SetIfPrefixMatch` by being optimistic about the current branch `sealed_tokens`.  It might not always work, but for any concurrent executions _at least one will terminate_, which ensures global progress.  Now for any execution, if we need to retry then the length of the prefix must be shorter, so any execution also progresses.

+##### Optimization: Detect successful commits
+We can start without this optimization and add it later.  However it can greatly help avoid meltdowns when many writes compete with commits.  That's important because writes -- unlike commits with the staging queue overlap above -- are not lock-free or even wait-free.  (In the presence of repeated concurrent commits they can suffer from livelock.)  It is **not** sufficient to guarantee wait-freedom, it just makes things a bit better when thing go bad,
+When a staging token changes during a write, instead of immediately retrying, _look up the physical address in the commit for the original staging token or for some successive token_.   If it is found there, then it *was* committed concurrently and no rewrite is necessary.

Proposal for running lakeFS on top of a key/value store #2466

Proposal for running lakeFS on top of a key/value store #2466

Uh oh!

Conversation

ozkatz commented Sep 13, 2021

Uh oh!

arielshaqed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Example

Uh oh!

Choose a reason for hiding this comment

🔓 Lock freedom 🔓

Becoming lock-free

Becoming wait-free

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arielshaqed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arielshaqed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Example

Uh oh!

ozkatz commented Sep 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnnyaug Oct 6, 2021 •

edited

Loading

johnnyaug Oct 7, 2021 •

edited

Loading

arielshaqed left a comment •

edited

Loading