-
Notifications
You must be signed in to change notification settings - Fork 413
KV3 - graveler - Discovery and Planning #3566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
design/open/kv3-execution-plan.md
Outdated
|
|
||
| ## Iterators | ||
| The following iterators are implemented and used by `pkg/graveler`: | ||
| * `branch_iterator` - iterates branches by repository. Has an option to order by ID (trivial) or by CommitID (Used by GC. How do we support that?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why anyone would like to sort anything based on a random generated string (commit ID) - maybe it's not really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this either, and I am not sure where GC looks up a branch by commit ID. Indeed, given that a commit belongs to multiple branches, I am not sure what this lookup would do.
Assuming "ID" here means "name" -- otherwise we may have difficulty paginating a lengthy branches display in the GUI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed - "ID" does mean "name"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnnyaug - can you please explain the purpose of sorting branches for iteration, by the associated commitID?
Used here
Thanks 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Itai-David, sure: this iterator is then used as input for a GCStartingPointIterator. This iterator assumes the branch heads are sorted by commit ID. This way, it can go over branch heads side by side with other commit IDs that are also sorted, and compare them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @johnnyaug.
@itaiad200 @arielshaqed - I looked at the GC code and it looks like we either need to support that with our Graveler over KV impl, or we will need to change the way GC traverses commits.
I think the first can be supported by maintaining a repo/<REPO_NAME>/commit/<COMMIT_ID>/branch/<BRANCH_NAME>. These keys will have to be constantly removed and added as commits take place, but we do have all the info at the time of the commit, so I don't think it is too difficult.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commits don't have branches.
If we try to attach branches to them, it will all end in tears.
I am not sure we need to fetch commits in sorted order to merge them. Let's define the algorithm and then see about speeding it up. I'm thinking maybe a Bloom filter on ancestors of each commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed I'm not sure I fully understand how a Bloom Filter can help here, but I guess this requires a bit of redesign to the GC commits gathering alogrithm (?)
If that is so, I think it should be out of scope for this KV milestone - I think we should "force" our graveler over KV impl to provide all the functionality currently supported by graveler over SQL, in terms of correctness, even in the painful price of performance regression, and handle any significant algorithmic change in another scope. WDYT?
For that purpose, I think defining a commitId to branchName(s) mapping (as a key?) can provide what we need, at least as a first step (regardless of the (very) wrong conception that commits "have" or "belong to" branch(s))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed and I discussed it a bit more and I think we now understand why it's needed for GCStartingPointIterator. We can have the same commitID order for a branch iterator on top of the kv implementation, by simply fetching all branches and sorting them in memory. It will not require to introduce any new indexing in KV. A counter argument would be memory consumption, but it's used by an algorithm that already stores all commits in memory, so there should not be any degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the doc - support of iteration by commitID by sorting branches in memory
| The following iterators are implemented and used by `pkg/graveler`: | ||
| * `branch_iterator` - iterates branches by repository. Has an option to order by ID (trivial) or by CommitID (Used by GC. How do we support that?) | ||
| * `commit_iterator` - iterates a commit ancestry by running BFS. Not an `iterator` in the `KV` sense. Should be supported as is with `KV` `Get` | ||
| * `commit_ordered_iterator` - iterates commits of a repo, ordered by commit ID (regular `KV` `Scan` supports that). However - has an option to iterate over not-first-parent commits only (e.g. commits that are not a first parent of any other commit). Probably could (and should) be implemented by a specific `Next()` for that case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs more details. The list of not-first-parent commits can always change since you may introduce a new child to any commit at any time. I wonder how do we use this list? And why would the ordering of CommitID matter?
I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commits are used side by side with the branch-by-commitID list, as explained by @johnnyaug, hence the order by commitID for that list too.
regarding the not-first-parent attribute, its usage and the fact it is subtle to changes - is it any different than the way it is used today, over SQL? I see that the decision of whether a commit is non-first-parent or not is made at the time of query execution, but this also can be changed by the time the GC algorithm get there. Am I wrong?
I guess I miss some context on that algorithm - @johnnyaug, can you please help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can think of the "not first parent" thing as a generalization of tree leaves. If you take the commit tree and drop all connections (edges) that are not first-parent, these commits are the leaves of this tree.
This is used to find all potential starting points for scanning the tree towards garbage collection.
In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.
@itaiad200 - I absolutely agree with that last sentence. Maybe we can maintain a live set of not-first-parent commits and modify it as commits occur. i.e. each new commit is by default not-first-parent and for each further operation, which makes it a first-parent, we try to remove that commit from the list (possibly already removed, hence the try) (Question: Can a commit become not-first-parent if it is already a first-parent?)
I can think of a couple of ways to implement this list, each optnio with its own drawbacks:
- We can use some
not_first_parent/<COMMIT_ID>. This will require searching the list for each commit we modify, but will make it easier to list allnot-first-parentcommits, or - We can "flag" the commit key itself i.e. commit
commit/<COMMIT_ID>is accompanied bycommit/<COMMIT_ID>/not_first_parent(not sure this works well with our scan logic, right?), or, for the same purpose add an attribute to the commit entryisFirstParent. However, I don't think this option supports the use case
What other options do we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.
Thank you for this input @johnnyaug.
What happens if I found a not-first-parent commit, return it as a starting point, and before the GC contniues, a commit is created over that commit, making it a first parent? Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we might be overworking this. The fact that something is efficiently possible on PostgreSQL does not mean that we have to do it efficiently in the KV world and in the same way.
Also please note that the proposal to add parameters to Next() simply makes it not an iterator. I would argue that the iterator abstraction adds nothing here.
We will not be able to manage this list of "not first-parent commits" correctly and efficiently. The question is really, do we need to? We will probably be scanning all commits, or all commits up to a previously-processed set of "safe" commits, on each GC. We could process not-first commits "just" by scanning all of these. (I am starting to think that the commits logic for GC should also move off of lakeFS and into the Spark cluster. @johnnyaug WDYT?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally open to it - the not-first-parent thing is a means to an end. We just need to start scanning the tree somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aiming at preserving the current behavior with KV support, as a first step, and improve that later, based on our needs. I think introducing the KV aspect is enough of a load already, and would like to stay away from redesigning our GC as a side effect.
Even though it will not be as efficient, I think it can work, at least as a first step. Moreover, since it is the GC, I think we can loosen up the efficiency requirements, again - at least as a first step.
@arielshaqed @johnnyaug - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed - either ways we are not changing the interface of Next()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Itai-David I understood that we will be reading everything to memory, applying the same logic that was used in Postgres, thus implementing the same iterators as today. If so, what are the remaining open questions?
design/open/kv3-execution-plan.md
Outdated
| * `staging/iterator` - iterates over a specific `staging_token` (as prefix) by order of key. Should be good with `KV` `Scan` | ||
|
|
||
| ## Deleting a Repository | ||
| Unlike other entities' delete, which are standalone, deleting a repository results in deleting all branches, commits and tags correlated to that repository. While being protected by a transaction for `SQL` we need to handle failures for `KV`. One option that comes in mind is to flag the repository as `"being_deleted"` (maybe a `KV` entry for `being_deleted` repos, that can be scanned), clean everything, and if successful remove the flag. Another option is to remove the repository entry first, and deploy a background GC that scans for branches/commits/tags that are related to a non-exist repository and delete them. Other ideas? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. The problem with flagging the repository is that now you would have to check it with every call. For example, trying to get a Tag would first need to check that flag which is stored somewhere else.
I think the best we can do here is to delete it in an order that would leave a consistent state in every possible failure. One way could be: Tags, Staging Tokens (as in workspaces), Branches (all except for the default), update the default branch commit to "NotACommit", Commits and finally the default branch and the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a repository (probably with a prefix) is a partition key, then you can delete a repository by deleting everything inside the partition. And you can keep a list of "repositories being deleted" (or an object inside the partition of each repository) and keep retrying to delete those repositories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right @arielshaqed, but now every time you try to get an object from that repo, you would have to query that list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the usage of repository as partition. It makes sense in the way that each scan operation (in that case, for deleting everything) is partition bounded eitherways, plus, I think the locallity of partition (supported by DynamoDB, as an example) seems to be working in our benefits with this one (be it performance-wise or GDPR-wise).
We will need to maintain the repository entry itself outside of the partition, for repositories listing, correct?
I think we can "mark" a repository as being-deleted by simply remove its entry from the repositories list, and once successful, start removing everything in the relevant partition
Any objection we go with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Itais are right. There are two issues here: immediately stop using a repository, and eventually delete it.
To delete a repository, mark it as deleted on its value. So now when you fetch it you can set it's deleted (but not yet removed). If this becomes too expensive, we'll need to put a single cache in front of the kv.
To ensure it gets cleaned up, add it to the repositories-to-delete list. And periodically treeverse that list to clean up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that with current path definitions - and as @eden-ohana mentioned, there's a risk of accidentally seeping into different entities when scanning on repositories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@N-o-Z - that was fixed with the recent update to the keys
design/open/kv3-execution-plan.md
Outdated
| ## Suggested Key Schema | ||
| Note: This is an initial suggestion that seems to support the required iterators. It tries to encapsulate the immutable parts of identifying an object (e.g. a branch can be identified by its name, but not by a commit, as it (a) changes and (b) a commit can be shared by branches. a repo and a tag has one:many relation, so a tag key contains its repo etc.) It is totally open to discussion and improvements. | ||
| * Repository - `repo/<REPO_NAME>` | ||
| * Branch - `repo/<REPO_NAME>/branch<BRANCH_NAME>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a /?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the naming convention to plural - 'branches', 'commits' etc. Consistent with other packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the repo/<REPO_NAME> should have a different name to not list the branches(repo/<REPO_NAME>/branch<BRANCH_NAME>) when listing repositories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itaiad200 - Yup. Fixed
@N-o-Z - done
@eden-ohana - Good catch, thank you. Fixed
design/open/kv3-execution-plan.md
Outdated
| * Branch - `repo/<REPO_NAME>/branch<BRANCH_NAME>` | ||
| * Commit - `repo/<REPO_NAME>/commit/<COMMIT_ID>` | ||
| * Tag - `repo/<REPO_NAME>/tag/<TAG_ID>` | ||
| * Staged Obj - `repo/<REPO_NAME>/staging_token/<STAGING_TOKEN>/key/<OBJ_KEY>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pray for 0 double slash bugs 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think staging token is per branch, should add branch prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eden-ohana - agree. Added under the new scema
@itaiad200 - double slash is the obvious. I added an option for triple slashes... My new suggestion removes the redundant keywords such as key and branch from anywhere except from the key prefix
design/open/kv3-execution-plan.md
Outdated
| Moreover, as these are quite exhaustive operations, may be used as part of performance verification | ||
|
|
||
| ## Suggested Key Schema | ||
| Note: This is an initial suggestion that seems to support the required iterators. It tries to encapsulate the immutable parts of identifying an object (e.g. a branch can be identified by its name, but not by a commit, as it (a) changes and (b) a commit can be shared by branches. a repo and a tag has one:many relation, so a tag key contains its repo etc.) It is totally open to discussion and improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it covers the special iterators above..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - we are still discussing it. I believe we should support these by additional keys, and strive to remove them as an improvements iteration out of the scope of this milestone (and probably out of the scope of KV all together, as suggeated by @N-o-Z )
design/open/kv3-execution-plan.md
Outdated
| # Execution Plan | ||
| * Agree on keys schema (see here after) - [#3567](https://github.com/treeverse/lakeFS/issues/3567) | ||
| * Supporting `KV` along side `SQL` (same as was done for previous modules) | ||
| * Step 1: Naive implementation with locking - `lakeFS` on KV can start with empty DB and run all graveler operations correctly, although performance may be degraded and exhaustive operations (concurrent commits and uploads, anything else?) should not be considered - [#3568](https://github.com/treeverse/lakeFS/issues/3568) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step alone is quite big. It means reimplementing Graveler in a single step.. I think we should support an hybrid mode, i.e. if it makes sense, move just the tags, then just the staging areas, then commits, finally branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting breaking it into smaller steps within the same deliverable, which has mainley administrative aspects, or creating a deliverable in each sub-step (e.g. a working, mergable, graveler with hybrid support)?
I'm not sure what does the later mean in terms of effort and overhead to get such a hybrid working
If we look at it, it is a big step, work wise, but I believe the auth implementation was bigger in terms of both size and complexity. Moreover, since testing coverege is excellent, it should converge much quicker IMHO
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Naive implementation with locking means replacing all graveler to work on top of the kv interface and adding a locking mechanism to support that, then I think it's too big. I also think auth was way smaller than that, and was also too big for a single step/PR which is what we did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to doc - implementatingan initial skeleton with no real KV access (delegating all DB related calls to DB gravlere, as we discussed) and deriving followup implementation tasks
design/open/kv3-execution-plan.md
Outdated
| * `graveler_staging_kv` - used by `pkg/graveler/staging` | ||
|
|
||
| `graveler_repositories`, `graveler_commits` and `graveler_tags` are immutable and does not update entries, and so should be straight forward to implement over our `KV` (`graveler_commits` actually performs update for commit generation, but this is only valid for `refs-restore` which can only take place on a `bare` repo and so it is of no concern)</br> | ||
| `graveler_branches` is currently being protected by a lock, during commit actions, and in order to support lockless commits with `KV`, the [following design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#graveler-metadata---branches-and-staged-writes) is proposed. However, as a first step we can implement it using a naive table lock, just to get it going, or better, a `branch-key` based locking, to prevent staging during commits</br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this approach.
Generally I think the lock-less feature is a post KV feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
Perfectly doable to start without lockfree commits. It will roughly mirror the intented current semantics -- which indeed starve commits. Also the actual current semantics are broken and can lose writes, so just locking (correctly) is an improvement!
However we do need lockfree commits in order to support many streaming ingestion use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're aligned on this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #3586 to fully support lock-free commits, out of KV3 scope
design/open/kv3-execution-plan.md
Outdated
| Unlike other entities' delete, which are standalone, deleting a repository results in deleting all branches, commits and tags correlated to that repository. While being protected by a transaction for `SQL` we need to handle failures for `KV`. One option that comes in mind is to flag the repository as `"being_deleted"` (maybe a `KV` entry for `being_deleted` repos, that can be scanned), clean everything, and if successful remove the flag. Another option is to remove the repository entry first, and deploy a background GC that scans for branches/commits/tags that are related to a non-exist repository and delete them. Other ideas? | ||
|
|
||
| ## Testing | ||
| `pkg/graveler/ref` and `pkg/graveler/staging` currently has a **very high** test coverage. This should make it easier to trust the transition to `KV`. However, one notable aspect that is the absence of `benchmarks`. As `graveler` will introduce new algorithms to support lockless commits, there is a risk of performance degradation which need to be identified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already stated that lock-less commits are out of scope for KV, I think this part is irrelevant for this doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of KV3 scope, or out of KV scope all together?
I think we need a wider opinions range here
@eden-ohana @itaiad200 - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be more specific - what are the things we think can degradate? There might be paths we would like to benchmark now. Graveler is pretty wide so if you have any specific paths you think we need to benchmark let's discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the doc and to #3571 to comments we discussed earlier, regarding which scenarios are interesting and should be benchmarked
design/open/kv3-execution-plan.md
Outdated
| * Agree on keys schema (see here after) - [#3567](https://github.com/treeverse/lakeFS/issues/3567) | ||
| * Supporting `KV` along side `SQL` (same as was done for previous modules) | ||
| * Step 1: Naive implementation with locking - `lakeFS` on KV can start with empty DB and run all graveler operations correctly, although performance may be degraded and exhaustive operations (concurrent commits and uploads, anything else?) should not be considered - [#3568](https://github.com/treeverse/lakeFS/issues/3568) | ||
| * Step 2: Lockless Commits - as described above - [#3569](https://github.com/treeverse/lakeFS/issues/3569) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seems to misunderstand the previous section - I really think lock free commits are not part of the KV scope. I believe it's best to first focus on the transition to KV with the current implementation and only after that working on the lock free feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely live with lock-free out of KV scope
| * Staged Obj - `repo/<REPO_NAME>/staging_token/<STAGING_TOKEN>/key/<OBJ_KEY>` | ||
|
|
||
| # Execution Plan | ||
| * Agree on keys schema (see here after) - [#3567](https://github.com/treeverse/lakeFS/issues/3567) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm missing is the data modelling for KV store. It can either be part of #3567 or as a new issue. But it is required as part of the common code base that we all work over. In auth we didn't do it well enough and as a result there were many changes during implementation and technical debts left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data modeling added to doc. As graveler pretty much uses the SQL DB as KVStore, there is not much of a difference, excpet for a few things we will need to add
N-o-Z
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Itai-David,
I have 2 main concerns:
- Missing data models
- Adding the lock free commits as part of the KV milestone (maybe this can be discussed further)
Other than that - it seems like a very thorough plan!
arielshaqed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
My chief concern is that the GC calls will continue to work with reasonable performance. I would like to understand the proposed iterators better to be sure of this...
design/open/kv3-execution-plan.md
Outdated
| * `graveler_staging_kv` - used by `pkg/graveler/staging` | ||
|
|
||
| `graveler_repositories`, `graveler_commits` and `graveler_tags` are immutable and does not update entries, and so should be straight forward to implement over our `KV` (`graveler_commits` actually performs update for commit generation, but this is only valid for `refs-restore` which can only take place on a `bare` repo and so it is of no concern)</br> | ||
| `graveler_branches` is currently being protected by a lock, during commit actions, and in order to support lockless commits with `KV`, the [following design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#graveler-metadata---branches-and-staged-writes) is proposed. However, as a first step we can implement it using a naive table lock, just to get it going, or better, a `branch-key` based locking, to prevent staging during commits</br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
Perfectly doable to start without lockfree commits. It will roughly mirror the intented current semantics -- which indeed starve commits. Also the actual current semantics are broken and can lose writes, so just locking (correctly) is an improvement!
However we do need lockfree commits in order to support many streaming ingestion use-cases.
design/open/kv3-execution-plan.md
Outdated
|
|
||
| `graveler_repositories`, `graveler_commits` and `graveler_tags` are immutable and does not update entries, and so should be straight forward to implement over our `KV` (`graveler_commits` actually performs update for commit generation, but this is only valid for `refs-restore` which can only take place on a `bare` repo and so it is of no concern)</br> | ||
| `graveler_branches` is currently being protected by a lock, during commit actions, and in order to support lockless commits with `KV`, the [following design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#graveler-metadata---branches-and-staged-writes) is proposed. However, as a first step we can implement it using a naive table lock, just to get it going, or better, a `branch-key` based locking, to prevent staging during commits</br> | ||
| `graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockless-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockless-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br> | |
| `graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockfree-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br> |
The term is "lock-free", not "lock-less".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased
design/open/kv3-execution-plan.md
Outdated
|
|
||
| ## Iterators | ||
| The following iterators are implemented and used by `pkg/graveler`: | ||
| * `branch_iterator` - iterates branches by repository. Has an option to order by ID (trivial) or by CommitID (Used by GC. How do we support that?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this either, and I am not sure where GC looks up a branch by commit ID. Indeed, given that a commit belongs to multiple branches, I am not sure what this lookup would do.
Assuming "ID" here means "name" -- otherwise we may have difficulty paginating a lengthy branches display in the GUI.
design/open/kv3-execution-plan.md
Outdated
| * `staging/iterator` - iterates over a specific `staging_token` (as prefix) by order of key. Should be good with `KV` `Scan` | ||
|
|
||
| ## Deleting a Repository | ||
| Unlike other entities' delete, which are standalone, deleting a repository results in deleting all branches, commits and tags correlated to that repository. While being protected by a transaction for `SQL` we need to handle failures for `KV`. One option that comes in mind is to flag the repository as `"being_deleted"` (maybe a `KV` entry for `being_deleted` repos, that can be scanned), clean everything, and if successful remove the flag. Another option is to remove the repository entry first, and deploy a background GC that scans for branches/commits/tags that are related to a non-exist repository and delete them. Other ideas? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a repository (probably with a prefix) is a partition key, then you can delete a repository by deleting everything inside the partition. And you can keep a list of "repositories being deleted" (or an object inside the partition of each repository) and keep retrying to delete those repositories.
|
(Also cool that this is really happening!) |
design/open/kv3-execution-plan.md
Outdated
| Note: This is an initial suggestion that seems to support the required iterators. It tries to encapsulate the immutable parts of identifying an object (e.g. a branch can be identified by its name, but not by a commit, as it (a) changes and (b) a commit can be shared by branches. a repo and a tag has one:many relation, so a tag key contains its repo etc.) It is totally open to discussion and improvements. | ||
| * Repository - `repo/<REPO_NAME>` | ||
| * Branch - `repo/<REPO_NAME>/branch<BRANCH_NAME>` | ||
| * Commit - `repo/<REPO_NAME>/commit/<COMMIT_ID>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REPO_NAME == repository ID
BRANCH_NAME == branch ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. These are the same. I tried to maintain the SQL names, just for simplicity of this doc
design/open/kv3-execution-plan.md
Outdated
| ## Suggested Key Schema | ||
| Note: This is an initial suggestion that seems to support the required iterators. It tries to encapsulate the immutable parts of identifying an object (e.g. a branch can be identified by its name, but not by a commit, as it (a) changes and (b) a commit can be shared by branches. a repo and a tag has one:many relation, so a tag key contains its repo etc.) It is totally open to discussion and improvements. | ||
| * Repository - `repo/<REPO_NAME>` | ||
| * Branch - `repo/<REPO_NAME>/branch<BRANCH_NAME>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the repo/<REPO_NAME> should have a different name to not list the branches(repo/<REPO_NAME>/branch<BRANCH_NAME>) when listing repositories?
design/open/kv3-execution-plan.md
Outdated
| * Branch - `repo/<REPO_NAME>/branch<BRANCH_NAME>` | ||
| * Commit - `repo/<REPO_NAME>/commit/<COMMIT_ID>` | ||
| * Tag - `repo/<REPO_NAME>/tag/<TAG_ID>` | ||
| * Staged Obj - `repo/<REPO_NAME>/staging_token/<STAGING_TOKEN>/key/<OBJ_KEY>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think staging token is per branch, should add branch prefix
design/open/kv3-execution-plan.md
Outdated
| * Repository - `repo/<REPO_NAME>` | ||
| * Branch - `repo/<REPO_NAME>/branch<BRANCH_NAME>` | ||
| * Commit - `repo/<REPO_NAME>/commit/<COMMIT_ID>` | ||
| * Tag - `repo/<REPO_NAME>/tag/<TAG_ID>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag path should save its commit ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the key?
It will be in the value
design/open/kv3-execution-plan.md
Outdated
| * `graveler_staging_kv` - used by `pkg/graveler/staging` | ||
|
|
||
| `graveler_repositories`, `graveler_commits` and `graveler_tags` are immutable and does not update entries, and so should be straight forward to implement over our `KV` (`graveler_commits` actually performs update for commit generation, but this is only valid for `refs-restore` which can only take place on a `bare` repo and so it is of no concern)</br> | ||
| `graveler_branches` is currently being protected by a lock, during commit actions, and in order to support lockless commits with `KV`, the [following design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#graveler-metadata---branches-and-staged-writes) is proposed. However, as a first step we can implement it using a naive table lock, just to get it going, or better, a `branch-key` based locking, to prevent staging during commits</br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading it once again with the help of @arielshaqed and @N-o-Z comments, I think I have a better idea of what you were aiming for here. A naive table lock means that we would be using the same locking technique we have for the current postgres DB, right? Well - that cannot be achieved with KV. It's not part of the KV interface and adding it to the current 3 implementations we have is harder than just straightforward use the mechanisms described in the original proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by now we are arguing about different words to describe the exact same thing. I would strongly encourage that we simply stop using the word "lock", CAS-type operations ("SetIf") like those used here are at a lower level and easier to use directly. The term "lock" is overloaded and has many different meaning. (Note that "lock-free" does not talk about locks, intentionally, so we can continue to use that term.)
Let's be concrete. I call the following "the naïve way" to perform a protected action on a branch:
- Get the original branch state
o. - (Maybe perform some actions, they might all be lost in the last step.)
- Compute a new branch state
n. - SetIf the branch state to
nIF it is stillo.- (If setting failed and some other actions were performed, maybe clean them up.)
This type of process is sometimes loosely called "optimistic locking".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what I had in mind for naive and temporary locking is locking of some common resource, on a logical level, not related to KV.
Both staging and commit operations (on the same branch) mutually block each other (this is the case, right? any other operatios need to be blocked), by locking a common resource (mutex?), just for the correctness of things.
We can reduce the contention by having NumLocks (say 2^10) such locks, and each operatation locks branchID%NumLocks, letting several operations (up to NumLocks, not sure how to calculate the average) run in parallel.
Any reason why that should not work as a naive, temporary yet correct, solution?
…phrasing lockless to lock-free
|
Having a discussion with @itaiad200 @N-o-Z, we came to the following understandings. Unless anything is missing, wrong or unacceptable, I will update the doc to reflect these, so we can move forward:
|
itaiad200
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there
design/open/kv3-execution-plan.md
Outdated
| ### Repository | ||
| A repository is identified by a unique name (type `RepositoryID`, a `string`) and contains its storage name space (type `StorageNamespace`, a string), default branch ID (type `BranchID`, a string) and its creation date (type `Time`). | ||
| Repositories are iterable, and so all repositories should exist in a common partition (`graveler`) | ||
| As this design suggests, among other things, that each repository should be a represented in a dedicated partition and that all other repository related entities will exist under this partition, and since we would like to use the existence/inexistence of a repository entry, to identify a valid/deleted repository, respectively, the repository entry should also include a random `guid`, which, when concatenated with the repo name, will be the partition key for that repository entities. That way it will be easy to distinguish between a partition related to "aRepo" that was already deleted and a partition related to a new "aRepo" that currently exist, as their partition names differ by the `guid` (`aRepo-guid1` vs `eRepo-guid2`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a request for aRepo, how do I know to look for aRepo-guid2 in the KV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You perform GetRepository by aRepo. that will be translated to kv.Get(partition = "graveler", key = "repos/aRepo")
Upon success, you will get a Repository object that has, among other things, a my_guid (or similar name. TBD) field set to "guid2"
This value will be autogenerated upon repository creation and will be initially used to create the repository with the default branch and initial commit
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope :) You're referring to a situation where lakeFS has two repositories with the same name. How would lakeFS know to delete everything under aRepo-guid2 and not aRepo-guid1?
design/open/kv3-execution-plan.md
Outdated
|
|
||
| ## Deleting a Repository | ||
| Unlike other entities' delete, which are standalone, deleting a repository results in deleting all branches, commits and tags correlated to that repository. While being protected by a transaction for `SQL` we need to handle failures for `KV`. One option that comes in mind is to flag the repository as `"being_deleted"` (maybe a `KV` entry for `being_deleted` repos, that can be scanned), clean everything, and if successful remove the flag. Another option is to remove the repository entry first, and deploy a background GC that scans for branches/commits/tags that are related to a non-exist repository and delete them. Other ideas? | ||
| **Update:** It looks like most (if not all) `graveler`'s operations, perform a repository existence check by issuing a call to `GetRepository`, which performs a DB access. If that is indeed the case, we can easily "mark" a repository as deleted by removing the repository entry from the common partition `graveler`, thus making it unreachable. We can later iterate all entries under this repository partition and delete them one by one, preceding the branch entry deletion by iteratively deleting all relevant staging partitions. Failures can be handled by a background garbage collector (which, BTW, can handle the initial deletion too), but that is only for space saving purposes and should have no effect on the deleted repository itself, once the repository entry is deleted. A point to consider - what happens if a repository entry is deleted, the data in the relevant repository partition fails to delete, and a new repository with the same name is created? I think this is solvable by adding a guid to the repository and use it as the partition key (as suggested in the `Repository` entity above) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to "Update" - it is not checked in yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, let's bring this document to reflect what we decide on and the next steps. We can have alternatives that were offered and even rejected, but currently it feels more of a discussion log and not an agreed upon plan (e.g. Other ideas?, Update, options that have been ruled out appears on top, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
design/open/kv3-execution-plan.md
Outdated
| * Repository Deletion - as suggested, maintaining all repository relevant entries under a unique partition will make it easy to delete as an entire partition. The repository key itself is outside of the partition (in a general common partition, to support listing) and will be deleted first. Once deleted, the entire partition will become unavailable, making the data state consistent either ways | ||
|
|
||
| ## Update Commits, staging and lock-free | ||
| we will follow @arielshaqed suggestion to prevent commits overriding each other by using `KV`'s `SetIf` on the branch update, making sure no racing commit is already finished and being overridden. In case a commit fails to finish due to another commit "beating" it to the finish line - a retry should be attempted. This implementation will require to support multiple staging tokens, as proposed by [lakeFS on KV design proposal](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#committer-flow). Lock-free commits will be handled out of the scope of KV3 milestone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually Ariel's suggestion was the Lock-free commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are actually implementaing the fully-blown lock-free commits as part of KV3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually figured it out. By Ariel's suggestion, I was referring to this comment and you were referring to #3091
Modified the doc to clarify that
design/open/kv3-execution-plan.md
Outdated
| * Supporting `KV` along side `SQL` (same as was done for previous modules) | ||
| * Step 1: Building a initial `KV graveler` and a `DB graveler`. `KV graveler` should delegate all calls to `DB gravler` (no `KV` usage at all for step) in order to allow partial `KV` support. | ||
| [#3568](https://github.com/treeverse/lakeFS/issues/3568) was modified to reflect that, and should result with continuation tasks for the `KV` support steps within | ||
| * Step 2: Basic implementation of `KV` usage within the `KV graveler` from the previous item. Tasks to be defined as part of [#3568](https://github.com/treeverse/lakeFS/issues/3568) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not define them now? They're the most important part of this plan!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will add these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Added #3590 (out of KV3 scope) for failed repository deletion cleanup |
itaiad200
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, just need to break down step 2 to concrete tasks.
…miting approach to use
Thank you |
design/open/kv3-execution-plan.md
Outdated
| ### Repository | ||
| A repository is identified by a unique name (type `RepositoryID`, a `string`) and contains its storage name space (type `StorageNamespace`, a string), default branch ID (type `BranchID`, a string) and its creation date (type `Time`). | ||
| Repositories are iterable, and so all repositories should exist in a common partition (`graveler`) | ||
| As this design suggests, among other things, that each repository should be a represented in a dedicated partition and that all other repository related entities will exist under this partition, and since we would like to use the existence/inexistence of a repository entry, to identify a valid/deleted repository, respectively, the repository entry should also include a random `guid`, which, when concatenated with the repo name, will be the partition key for that repository entities. That way it will be easy to distinguish between a partition related to "aRepo" that was already deleted and a partition related to a new "aRepo" that currently exist, as their partition names differ by the `guid` (`aRepo-guid1` vs `eRepo-guid2`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say I'm not a fan of this solution, but nevertheless, why not add the guid to the path instead of concatenating it to the repo name? e.g. instead of aRepo-guid1/branches/... do aRepo/guid1/branches/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@N-o-Z the repo with guid is the partition key and not part of the path.
Neverthelss, the use od - and concat was illustrative. Change the doc to reflect aRepo/guid1 etc.
N-o-Z
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Itai-David. It's really hard to understand the data models for each entity as a textual description. I think it will be much more readable if you just model it
arielshaqed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR mentions GUIDs, which AFAIK is just another term for UUID. I cannot overstate how much I dislike these: they are a really long representation, hard for me to identify and distinguish by sight, support versions that are pretty much useless, require a particular text format which is meaningless for the version commonly used, and cannot be configured to decrease or increase collision resistance.
I would much prefer to use a nanoid. :-)
| The following iterators are implemented and used by `pkg/graveler`: | ||
| * `branch_iterator` - iterates branches by repository. Has an option to order by ID (trivial) or by CommitID (Used by GC. How do we support that?) | ||
| * `commit_iterator` - iterates a commit ancestry by running BFS. Not an `iterator` in the `KV` sense. Should be supported as is with `KV` `Get` | ||
| * `commit_ordered_iterator` - iterates commits of a repo, ordered by commit ID (regular `KV` `Scan` supports that). However - has an option to iterate over not-first-parent commits only (e.g. commits that are not a first parent of any other commit). Probably could (and should) be implemented by a specific `Next()` for that case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we might be overworking this. The fact that something is efficiently possible on PostgreSQL does not mean that we have to do it efficiently in the KV world and in the same way.
Also please note that the proposal to add parameters to Next() simply makes it not an iterator. I would argue that the iterator abstraction adds nothing here.
We will not be able to manage this list of "not first-parent commits" correctly and efficiently. The question is really, do we need to? We will probably be scanning all commits, or all commits up to a previously-processed set of "safe" commits, on each GC. We could process not-first commits "just" by scanning all of these. (I am starting to think that the commits logic for GC should also move off of lakeFS and into the Spark cluster. @johnnyaug WDYT?)
| * Tags support - [#3593](https://github.com/treeverse/lakeFS/issues/3593) | ||
| * Commits support (not including optimistic locking commit flow) - [#3594](https://github.com/treeverse/lakeFS/issues/3594) | ||
| * Staging support - [#3595](https://github.com/treeverse/lakeFS/issues/3595) | ||
| * Step 3: Optimistic Lock Commits, as described above - a working commit algorithm that prevents commits from overriding each other, by using `KV`'s `SetIf` to set the branch's commit, and failing (and retrying) to commit if `SetIf` fails, as described above. [#3569](https://github.com/treeverse/lakeFS/issues/3569) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider retrying commits on the client side. It will make it easier to provide different retry strategies for different clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed - are you referring to explicit retries by the user, or implicit retries by the (various) client? I believe it is the first?
Co-authored-by: N-o-Z <[email protected]>
Co-authored-by: arielshaqed <[email protected]>
Co-authored-by: arielshaqed <[email protected]>
Thanks @arielshaqed |
@N-o-Z the thing is that other than the new |
|
@N-o-Z - added the agreed code snippets. Please re-review |
design/open/kv3-execution-plan.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to consider using pointers for both DefaultBranchID and RepositoryToken as we need to support bare repositories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare repos has the same default branch logic as non-bare repos (i.e. they "hold" the default branch ID == name, upon creation). The difference is the branch object is not really created. I think it would be better to leave it as is, and apply the same logic
N-o-Z
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes!
It is much more clearer for me now.
One comment you might want to take into consideration
4d74191 to
ac994e1
Compare
closes #3531