Skip to content

Conversation

@ozkatz
Copy link
Collaborator

@ozkatz ozkatz commented Sep 13, 2021

No description provided.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

Comment on lines 54 to 69
- PostgreSQL
- MySQL
- Embedded Pebble/RocksDB (great option for a "quickstart" environment?)
- MongoDB
- AWS DynamoDB
- FoundationDB
- Azure Cosmos
- Azure Blob Store
- Google BigTable
- Google Spanner
- Google Cloud Storage
- HBase
- Cassandra (or compatible systems such as ScyllaDB)
- Raft (embedded, or server implementations such as Consul/ETCd)
- Persistent Redis (Redis Enterprise, AWS MemoryDB)
- Simple in-memory tree (for simplifying and speeding up tests?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a list (or multiple lists) of required primitives for the KV store?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added! thanks.

We add an additional field to each `Branch` object: In addition to the existing `staging_token`, we add an array of strings named `sealed_tokens`.

1. get branch, find current `staging_token`
1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is SetIf the (existing, pushed onto staging_token) staging_token is unchanged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it clear that SetIf() is actually a CAS operation on the entire branch (staging_token, sealed_tokens and commit_id).

1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.
1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit
1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.
1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.
1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit
1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this SetIf on? AFAICT just staging_token or just commit_id are not enough to provide ordering in the face of concurrent commits. It might be if the entire value is unchanged, or possibly staging_token and sealed_tokens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was indeed unclear - made it explicit that SetIf() is actually a CAS operation on the entire branch (staging_token, sealed_tokens and commit_id).

1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.

An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off,
adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am less optimistic (sorry) here. Suppose that we have an automated commit, once every X seconds (X ~ 1s). Typically everything works really well and we're happy. I would claim that this is a good way to implement streaming input: write to branch input, every X seconds commit and on success merge to main. (It is actually hard to do something else! E.g. writing to branch input-NNNNNN with a running commit counter NNNNNN pushes this huge problem onto the writers, who now need essentially to resolve the same issue that we are resolving here.)

Now suppose that our database (or maybe that part of our object store) has a hiccup and a commit takes >X seconds. Then sealed_tokens starts growing. AFAIU this requires every concurrent commit to fail (there is no point in retrying if sealed_tokens has grown since the last call, it will never go back to what we expect. So typically only the last concurrent commit will succeed. So until some time after recovery, not only do we have no commits, but also reads get progressively slower.

It may be OK to leave this open in the first version(s), but I do want to document this limitation while we have it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a suggestion on how to address it. If you have other strategies that could be used here, let's add them!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is, we run into trouble even if we never retry commits (so exponential backoff doesn't solve the issue)! Every commit tries and fails, growing sealed_tokens. As long as there's a nice fugue (commit starts by growing sealed_tokens, then runs until after the next commit starts) then no commit succeeds, and sealed_tokens grows, and reads take longer and longer.

I believe the only way around this will be to trim prefixes of sealed_tokens. E.g. accept a commit if the end of its sealed_tokens overlaps that start of the current sealed_tokens, and trim the shared bit.

Example

sealed_tokens: 1,2,3; staging_token: 4

Process A starts to commit tokens 1,2,3,4.

sealed_tokens: 1,2,3,4; staging_token: 5

Process B starts to commit tokens 1,2,3,4,5.

sealed_tokens: 1,2,3,4,5; staging_token: 6

Process C starts to commit tokens 1,2,3,4,5,6.

sealed_tokens: 1,2,3,4,5,6; staging_token: 7

Process A ends committing: it tries for a "fast-path commit", but since staging_token is no longer 5 that fails. Instead it reads the branch (or uses the value return by Set-If, if we want to require such an interface). Now 1,2,3,4 is a prefix of 1,2,3,4,5,6, so A tries again to commit the remaining sealed_tokens 5,6 and a staging_token of 7. That works, so A succeeded!

sealed_tokens: 5,6; staging_token: 7

Process C ends committing: its fast-path commit works (staging_token is 7), so it drops all sealed_tokens.

sealed_tokens: (); staging_token: 7

Process B ends committing: its fast-path commit doesn't work, and it has an empty prefix, so it knows it has already succeeded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔓 Lock freedom 🔓

Let's talk about how to make this lock-free (in the sense of non-blocking algorithms).

Why is this important? Consider a loaded system where commits are not even lock-free (i.e., it is possible to make noi progress for an arbitrarily long period of time). For instance, let's say we have a commit every second. For some reason we have a delay (maybe the KV store went down for a while) and now we have 60 commits queued. They all race to finish the commit (by writing the old staging token with a shorter new sealed tokens queue) with incoming new commits (which try to write a new staging token and push a longer sealed tokens queue). Now there is no progress guarantee, we can have more and more new commits admitted. Request timeouts don't really help here, because new commit requests arrive at the same rate at which timeouts occur. So this is at best a steady-state -- and commits finish by a random walk, i.e. we recover in time quadratic in the number of pending commits.

We can only talk about lock-free (and wait-free operations) for constant-time operations, so let's separate the "write metarange" operation (which is trivially concurrent but take time linear in the number of changes) from the "update commit" (which can run in constant time). Also we cannot assume that the KV store is lock-free, so for our purposes we assume either that it is or that we count time in number of operations on the KV. (This is the same as any discussion of nonblocking operations, which invariably assume the memory is lock-free even though it probably is not...).

Becoming lock-free

With the process above, commits are still not lock-free. But we can make them so! The key is to have a "bag" of pending commits -- some data structure with Put, UpdateIfUnchanged, and GetAny operations. For instance, PUTs just write a value to key zzz_<bag-id>_<random>, and GETANYs just read the first key with prefix zzz_<bag-id>_. (A roughly-ordered queue is an even better choice, and we can do that too... but that is not necessary for correctness here, so this comment is too narrow for heuristics to put in such a queue...).

Now when trying (or possibly retrying) to commit, just add the desired commit record to the bag for that branch, and periodically probe to see if the commit succeeded. Meanwhile, every start commit and every finish-commit probe first repeatedly runs GetAny and tries to commit, until the bag is empty.

Now, if there is a pool of pending commits, then after all start-commits finish every committer (including new committers) is trying to end a commit! The only way a committer has to retry is that another committer succeeded, so we continue to make progress until the bag is (momentarily) empty.

Becoming wait-free

Wait-freedom is (much) harder. I think we could "just" use a queue as a bag above... but that is quite a powerful operation to expect our KV store to have, so it involves either implementing a lock-free queue (algorithms exist) or using a KV that has one (Redis).

#### Writer flow

1. Read the branch's existing staging token: currently, we batch requests for a short duration and amortize the cost across requests.
1. Write to the staging token received
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking F2F I now understand that we store the path not so much in the object store as in the KV store. Can we specify this here, too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - specified.

It is important to understand that the current pessimistic approach locks the branch for the entire duration of the commit.
This takes time proportional to the amount of changes to be committted and is unbounded. All writes to a branch are blocked for that period of time.

With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first SetIf() that replaces the `staging_token`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first SetIf() that replaces the `staging_token`.
With the optimistic version, writes are retried for a bounded period of time: the time between committer's initial branch read, and the first `SetIf()` that replaces the `staging_token`.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Fixed.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for more details about SetIf.

Comment on lines 44 to 46
// SetIf returns an ErrPredicateFailed error if the valuePredicate passed
// doesn't match the currently stored value.
SetIf(key, value, valuePredicate []byte) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might allow a very powerful but complex predicate -- not clear because we don't document it anywhere. Or is this a CAS? Because CAS is weak, but hard to use efficiently.

Ideally we would break the predicate down into a much weaker language, allowing more efficient implementation on many KVs. Or support CAS and document

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple CAS. documented.

1. get branch, find current `staging_token`
1. use `SetIf()` to update the branch (if not modified by another process): push existing `staging_token` into `sealed_tokens`, set new uuid as `staging_token`.
1. take the list of sealed tokens, and using the [`CombinedIterator()`](https://github.com/treeverse/lakeFS/blob/master/pkg/graveler/combined_iterator.go#L11), turn them into a single iterator to be applied on top of the existing commit
1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it safe to clear the staged entries of a particular staging token? After a commit has ended, there could ongoing reads still reading entries from the staging token, so we can't clear them right away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question - the answer to "when is it safe" would be "never" :)
I revised the reader flow to address this, also using optimistic concurrency control. Interested in hearing your opinion on the suggested solution (also, I'm not sure I did a good job explaining it).


#### Writer flow

1. Read the branch's existing staging token: currently, we batch requests for a short duration and amortize the cost across requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, this flow can cause a single write to appear in more than one commit. In particular, it can't be described as an idempotent operation in (3) below. Not saying this is necessarily bad but I'm finding it hard to describe it in terms of a consistency model.

In the following example, write 001 appears on commit a, reverted by commit b and then appears again as a new write on commit c.

Write 001 start – to staging token A
Commit a for token A
Write 002 start – to staging token B
Commit b for token B
Write 002 finalize: staging token was changed – write again to token C
Write 001 finalize: staging token was changed – write again to token C
Commit c for token C

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: talked f2f with @arielshaqed, who explained that this "limitation" was discussed by the team. @ozkatz can you please add an explanation about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2480 (created after F2F'ing with @ozkatz ) is my attempt to describe what guarantees we do give.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great minds do think alike I guess. Thanks @arielshaqed for the description (should we maybe add it to this PR?)!

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I think you addressed a different issue from the one that I had with periodic commits, so I blathered on a bit about the difference and how we could solve that one.

I do NOT think that this blocks an initial revision. (However, having written it down, I now think that it is simple enough to do immediately.)

1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.

An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off,
adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is, we run into trouble even if we never retry commits (so exponential backoff doesn't solve the issue)! Every commit tries and fails, growing sealed_tokens. As long as there's a nice fugue (commit starts by growing sealed_tokens, then runs until after the next commit starts) then no commit succeeds, and sealed_tokens grows, and reads take longer and longer.

I believe the only way around this will be to trim prefixes of sealed_tokens. E.g. accept a commit if the end of its sealed_tokens overlaps that start of the current sealed_tokens, and trim the shared bit.

Example

sealed_tokens: 1,2,3; staging_token: 4

Process A starts to commit tokens 1,2,3,4.

sealed_tokens: 1,2,3,4; staging_token: 5

Process B starts to commit tokens 1,2,3,4,5.

sealed_tokens: 1,2,3,4,5; staging_token: 6

Process C starts to commit tokens 1,2,3,4,5,6.

sealed_tokens: 1,2,3,4,5,6; staging_token: 7

Process A ends committing: it tries for a "fast-path commit", but since staging_token is no longer 5 that fails. Instead it reads the branch (or uses the value return by Set-If, if we want to require such an interface). Now 1,2,3,4 is a prefix of 1,2,3,4,5,6, so A tries again to commit the remaining sealed_tokens 5,6 and a staging_token of 7. That works, so A succeeded!

sealed_tokens: 5,6; staging_token: 7

Process C ends committing: its fast-path commit works (staging_token is 7), so it drops all sealed_tokens.

sealed_tokens: (); staging_token: 7

Process B ends committing: its fast-path commit doesn't work, and it has an empty prefix, so it knows it has already succeeded.

@ozkatz
Copy link
Collaborator Author

ozkatz commented Sep 22, 2021

Neat! I think you addressed a different issue from the one that I had with periodic commits, so I blathered on a bit about the difference and how we could solve that one.

I do NOT think that this blocks an initial revision. (However, having written it down, I now think that it is simple enough to do immediately.)

@arielshaqed I like this idea a lot - and it is simple.
Can you contribute it to this PR? Preferably as another sub-section so its independent (just thinking ahead to how we might break down this change into smaller bits if/when we decide we want to go for it)


* A successful commit holds the contents of some listing of the entire keyspace.

* Mutating operations (commits and writes) may succeed or fail. When they fail, their contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of a non guarantee: Failing mutating operations are not guaranteed to not be visible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this item is mostly redundant since it's true for pretty much all object stores and databases that I can think of. perhaps @arielshaqed can clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everybody is right! If you know the state after a mutation then you already have consistency -- that's the C. If the mutation failed then you've lost either availability (A) or partition-freedom (P). Now if we drop A then the client might never return after a mutation failed, and if you drop P then the client is no longer connected. In both cases that means the client cannot know what happened after a failed mutation unless it retries forever on failure (and potentially never succeeds).

That's why it's true of all stores.

I would prefer to keep this explicit non-guarantee. It reminds readers that we are not requiring an ACID DB such as Postgres.


### Graveler Metadata - Commits, Tags, Repositories

These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tags are mutable, they can be deleted and recreated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I do think they could be treated as immutable when discussing concurrency (i.e. it's ok to assume low/non contention when writing a single tag entry) - so less prone to read-modify-write issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we spell this out? E.g. something like

When tags are mutated, *no* ordering guarantees apply.

@arielshaqed arielshaqed requested a review from eden-ohana October 6, 2021 08:53
1. Read the branch's existing staging token: if branch exists in the cache, use it! Otherwise, do an amortized read (see [above](#caching-branch-pointers-and-amortized-reads)) and cache the result for a very short duration.
1. Write to the staging token received - this is another key/value record (e.g. `"graveler/staging/${repoId}/${stagingToken}/${path}"`)
1. Read the branch's existing staging token **again**. This is always an amortized read, not a cache read. If we get the same `staging_token` - great, no commit has *started while writing* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case.
1. If the `staging_token` *has* changed - **retry the operation**. If the previous write made it in time to be included in the commit, we'll end up writing a record with the same identity - an idempotent operation.
Copy link
Contributor

@johnnyaug johnnyaug Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: instead of retrying the operation - fail the operation. The user can then retry.

Pros:

  • Avoid strange consistency issues like mentioned above. Changes appear only once in the history, unless the user explicitly retries.
  • Simpler to understand and implement

Con:

  • Increased chance of a failed operation being persisted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in any case, this operation cannot be described as "idempotent".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes appear only once in history anyway, what are the strange consistency issues?

Writing the exact same entry again is idempotent: the identity field written would be exactly the same, meaning it won't show as a diff between versions. Just like in Git: if you overwrite a local file with the exact same content, it's a no-op, you won't see it as a diff even tough you modified the file.

Copy link
Contributor

@johnnyaug johnnyaug Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the banana problem - the retry will not be writing the same identity again will cause the identity to change, so it will not be idempotent.
And, the same write will appear twice in the history.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I undertand I need to better phrase where this retry happens:

this is an top-level graveler operation. It doesn't delegate it back to the caller (catalog), so it is done using the same value that was given to it (i.e. the same serialized entry and same calculated identity)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think that we have a B-A-N-A problem here. If I try write A & N to the same path and try committing twice while doing it. The value could very well be:

  1. B - before anything was done.
  2. A - in the first commit.
  3. N - in the second commit.
  4. A - staging area after the second commit.

So from the A was "written twice" for a single write operation.

I think that the design assumes that no other value is being written to the same path concurrently, hence the mistake.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shall add the sealed_tokens prefix removal suggestion separately.


* A successful commit holds the contents of some listing of the entire keyspace.

* Mutating operations (commits and writes) may succeed or fail. When they fail, their contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everybody is right! If you know the state after a mutation then you already have consistency -- that's the C. If the mutation failed then you've lost either availability (A) or partition-freedom (P). Now if we drop A then the client might never return after a mutation failed, and if you drop P then the client is no longer connected. In both cases that means the client cannot know what happened after a failed mutation unless it retries forever on failure (and potentially never succeeds).

That's why it's true of all stores.

I would prefer to keep this explicit non-guarantee. It reminds readers that we are not requiring an ACID DB such as Postgres.


### Graveler Metadata - Commits, Tags, Repositories

These are simpler entities - commits and tags are immutable and could potentially be stored on the object store and not in the key/value store (they are cheap either way, so it doesn't make much of a difference).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we spell this out? E.g. something like

When tags are mutated, *no* ordering guarantees apply.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we add (this) explicit suggestion for "prefix match" with an explanation of why it ensures progress.

Comment on lines 195 to 196
1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following ensures liveness for the concurrent progressive commits scenario in this comment. I think we must have it to ensure progression. (To be clear: the current design without it is perfectly correct: no mistakes will be made in this scenario. But it cannot guarantee progression: time to completion of a single commit is unbounded (and the commit may never happen).

Suggested change
1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform another `SetIf()` that updates the branch key/value pair again: replacing its commit ID with the new value, and clearing `sealed_tokens`, as these have materialized into the new commit.
1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.
1. Once the commit has been persisted (metaranges and ranges stored in object store, commit itself stored to KV using `Set()`), perform a `SetIfPrefixMatch()` operation (below) to updates the branch key/value pair again: replace its commit ID with the new value, and remove the _suffix_ of this commit's `sealed_tokens` that matches the prefix of the branch's `sealed_tokens`. These are precisely the prefixes have materialized into the new commit.
1. If `SetIf()` fails, this means another commit is happening/has happened on the same branch. Can either retry or let the caller know.
The `SetIfPrefixMatch()` operation is defined at the metadata level (_not_ the KV store), and can use the existing `SetIf` operation of the KV store. It performs the following atomically / optimistically / optimistically with retries: Given a `committed_sealed_tokens`:
1. If there is a no nonempty prefix of the current branch `sealed_tokens` that is a suffix of `committed_sealed_tokens`: return as "no work to be done".
1. Otherwise, set `new_sealed_tokens` to be this matching overlap.
1. Update the branch value to replace its commit ID with the new commit ID, and its `sealed_tokens` with `new_sealed_tokens`.
We can implement `SetIfPrefixMatch` by being optimistic about the current branch `sealed_tokens`. It might not always work, but for any concurrent executions _at least one will terminate_, which ensures global progress. Now for any execution, if we need to retry then the length of the prefix must be shorter, so any execution also progresses.

1. Write to the staging token received - this is another key/value record (e.g. `"graveler/staging/${repoId}/${stagingToken}/${path}"`)
1. Read the branch's existing staging token **again**. This is always an amortized read, not a cache read. If we get the same `staging_token` - great, no commit has *started while writing* the record, return success to the user. For a system with low contention between writes and commits, this will be the usual case.
1. If the `staging_token` *has* changed - **retry the operation**. If the previous write made it in time to be included in the commit, we'll end up writing a record with the same identity - an idempotent operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Optimization: Detect successful commits
We can start without this optimization and add it later. However it can greatly help avoid meltdowns when many writes compete with commits. That's important because writes -- unlike commits with the staging queue overlap above -- are not lock-free or even wait-free. (In the presence of repeated concurrent commits they can suffer from livelock.) It is **not** sufficient to guarantee wait-freedom, it just makes things a bit better when thing go bad,
When a staging token changes during a write, instead of immediately retrying, _look up the physical address in the commit for the original staging token or for some successive token_. If it is found there, then it *was* committed concurrently and no rewrite is necessary.

1. If `SetIf()` fails, this means another commit is happenning/has happened on the same branch. Can either retry or let the caller know.

An important property here is delegating safety vs liveness to a single optimistic `SetIf()` operation: if a commit fails somewhere along the process, a subsequent commit would simply pick up where the failed one left off,
adding the current staging_token into the set of changes to apply. In an environment where there aren't many concurrent commits on the same branch, and that commits mostly succeed - the size of `sealed_tokens` would be relatively small. As an optimization, compaction strategies could be added to merge tokens together, but this *might not be necessary*, at least not for use cases we're familiar with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔓 Lock freedom 🔓

Let's talk about how to make this lock-free (in the sense of non-blocking algorithms).

Why is this important? Consider a loaded system where commits are not even lock-free (i.e., it is possible to make noi progress for an arbitrarily long period of time). For instance, let's say we have a commit every second. For some reason we have a delay (maybe the KV store went down for a while) and now we have 60 commits queued. They all race to finish the commit (by writing the old staging token with a shorter new sealed tokens queue) with incoming new commits (which try to write a new staging token and push a longer sealed tokens queue). Now there is no progress guarantee, we can have more and more new commits admitted. Request timeouts don't really help here, because new commit requests arrive at the same rate at which timeouts occur. So this is at best a steady-state -- and commits finish by a random walk, i.e. we recover in time quadratic in the number of pending commits.

We can only talk about lock-free (and wait-free operations) for constant-time operations, so let's separate the "write metarange" operation (which is trivially concurrent but take time linear in the number of changes) from the "update commit" (which can run in constant time). Also we cannot assume that the KV store is lock-free, so for our purposes we assume either that it is or that we count time in number of operations on the KV. (This is the same as any discussion of nonblocking operations, which invariably assume the memory is lock-free even though it probably is not...).

Becoming lock-free

With the process above, commits are still not lock-free. But we can make them so! The key is to have a "bag" of pending commits -- some data structure with Put, UpdateIfUnchanged, and GetAny operations. For instance, PUTs just write a value to key zzz_<bag-id>_<random>, and GETANYs just read the first key with prefix zzz_<bag-id>_. (A roughly-ordered queue is an even better choice, and we can do that too... but that is not necessary for correctness here, so this comment is too narrow for heuristics to put in such a queue...).

Now when trying (or possibly retrying) to commit, just add the desired commit record to the bag for that branch, and periodically probe to see if the commit succeeded. Meanwhile, every start commit and every finish-commit probe first repeatedly runs GetAny and tries to commit, until the bag is empty.

Now, if there is a pool of pending commits, then after all start-commits finish every committer (including new committers) is trying to end a commit! The only way a committer has to retry is that another committer succeeded, so we continue to make progress until the bag is (momentarily) empty.

Becoming wait-free

Wait-freedom is (much) harder. I think we could "just" use a queue as a bag above... but that is quite a powerful operation to expect our KV store to have, so it involves either implementing a lock-free queue (algorithms exist) or using a KV that has one (Redis).

@ozkatz ozkatz added the exclude-changelog PR description should not be included in next release changelog label Mar 17, 2022
@ozkatz ozkatz marked this pull request as ready for review March 17, 2022 15:52
@ozkatz ozkatz requested review from itaiad200 and nopcoder March 17, 2022 15:52
@ozkatz ozkatz merged commit 26fffdf into master Mar 17, 2022
@ozkatz ozkatz deleted the proposal/lakef-on-kv branch March 17, 2022 17:01
arielshaqed added a commit that referenced this pull request Mar 22, 2022
This handles these comments on the original design review proposal:

* #2466#discussion_r713713932

* #2466#discussion_r725608233 (this one actually goes into _more_ detail
  than the current proposal -- reviewers please let me know if you prefer
  the phrasing there!)
arielshaqed added a commit that referenced this pull request Mar 31, 2022
This handles these comments on the original design review proposal:

* #2466#discussion_r713713932

* #2466#discussion_r725608233 (this one actually goes into _more_ detail
  than the current proposal -- reviewers please let me know if you prefer
  the phrasing there!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/DB Improvements or additions to the DB exclude-changelog PR description should not be included in next release changelog proposal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants