-
Notifications
You must be signed in to change notification settings - Fork 416
KV3 - graveler - Discovery and Planning #3566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
63a20f4
5574429
02b61d2
2c57896
a6a58b6
79c0e6b
60e4397
154005c
5e9bee6
9f19edf
114aaf5
e054200
cefdaf5
d99a1d8
ac994e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,52 @@ | ||||||
|
|
||||||
| # General | ||||||
| ## Tables | ||||||
| Graveler uses 5 DB tables: | ||||||
| * `graveler_repositories`,`graveler_branches`,`graveler_commits`,`graveler_tags` - used by `pkg/graveler/ref` | ||||||
| * `graveler_staging_kv` - used by `pkg/graveler/staging` | ||||||
|
|
||||||
| `graveler_repositories`, `graveler_commits` and `graveler_tags` are immutable and does not update entries, and so should be straight forward to implement over our `KV` (`graveler_commits` actually performs update for commit generation, but this is only valid for `refs-restore` which can only take place on a `bare` repo and so it is of no concern)</br> | ||||||
| `graveler_branches` is currently being protected by a lock, during commit actions, and in order to support lockless commits with `KV`, the [following design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#graveler-metadata---branches-and-staged-writes) is proposed. However, as a first step we can implement it using a naive table lock, just to get it going, or better, a `branch-key` based locking, to prevent staging during commits</br> | ||||||
|
||||||
| `graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockless-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br> | ||||||
|
||||||
| `graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockless-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br> | |
| `graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockfree-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br> |
The term is "lock-free", not "lock-less".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why anyone would like to sort anything based on a random generated string (commit ID) - maybe it's not really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this either, and I am not sure where GC looks up a branch by commit ID. Indeed, given that a commit belongs to multiple branches, I am not sure what this lookup would do.
Assuming "ID" here means "name" -- otherwise we may have difficulty paginating a lengthy branches display in the GUI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed - "ID" does mean "name"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnnyaug - can you please explain the purpose of sorting branches for iteration, by the associated commitID?
Used here
Thanks 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Itai-David, sure: this iterator is then used as input for a GCStartingPointIterator. This iterator assumes the branch heads are sorted by commit ID. This way, it can go over branch heads side by side with other commit IDs that are also sorted, and compare them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @johnnyaug.
@itaiad200 @arielshaqed - I looked at the GC code and it looks like we either need to support that with our Graveler over KV impl, or we will need to change the way GC traverses commits.
I think the first can be supported by maintaining a repo/<REPO_NAME>/commit/<COMMIT_ID>/branch/<BRANCH_NAME>. These keys will have to be constantly removed and added as commits take place, but we do have all the info at the time of the commit, so I don't think it is too difficult.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commits don't have branches.
If we try to attach branches to them, it will all end in tears.
I am not sure we need to fetch commits in sorted order to merge them. Let's define the algorithm and then see about speeding it up. I'm thinking maybe a Bloom filter on ancestors of each commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed I'm not sure I fully understand how a Bloom Filter can help here, but I guess this requires a bit of redesign to the GC commits gathering alogrithm (?)
If that is so, I think it should be out of scope for this KV milestone - I think we should "force" our graveler over KV impl to provide all the functionality currently supported by graveler over SQL, in terms of correctness, even in the painful price of performance regression, and handle any significant algorithmic change in another scope. WDYT?
For that purpose, I think defining a commitId to branchName(s) mapping (as a key?) can provide what we need, at least as a first step (regardless of the (very) wrong conception that commits "have" or "belong to" branch(s))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed and I discussed it a bit more and I think we now understand why it's needed for GCStartingPointIterator. We can have the same commitID order for a branch iterator on top of the kv implementation, by simply fetching all branches and sorting them in memory. It will not require to introduce any new indexing in KV. A counter argument would be memory consumption, but it's used by an algorithm that already stores all commits in memory, so there should not be any degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the doc - support of iteration by commitID by sorting branches in memory
arielshaqed marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs more details. The list of not-first-parent commits can always change since you may introduce a new child to any commit at any time. I wonder how do we use this list? And why would the ordering of CommitID matter?
I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commits are used side by side with the branch-by-commitID list, as explained by @johnnyaug, hence the order by commitID for that list too.
regarding the not-first-parent attribute, its usage and the fact it is subtle to changes - is it any different than the way it is used today, over SQL? I see that the decision of whether a commit is non-first-parent or not is made at the time of query execution, but this also can be changed by the time the GC algorithm get there. Am I wrong?
I guess I miss some context on that algorithm - @johnnyaug, can you please help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can think of the "not first parent" thing as a generalization of tree leaves. If you take the commit tree and drop all connections (edges) that are not first-parent, these commits are the leaves of this tree.
This is used to find all potential starting points for scanning the tree towards garbage collection.
In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.
@itaiad200 - I absolutely agree with that last sentence. Maybe we can maintain a live set of not-first-parent commits and modify it as commits occur. i.e. each new commit is by default not-first-parent and for each further operation, which makes it a first-parent, we try to remove that commit from the list (possibly already removed, hence the try) (Question: Can a commit become not-first-parent if it is already a first-parent?)
I can think of a couple of ways to implement this list, each optnio with its own drawbacks:
- We can use some
not_first_parent/<COMMIT_ID>. This will require searching the list for each commit we modify, but will make it easier to list allnot-first-parentcommits, or - We can "flag" the commit key itself i.e. commit
commit/<COMMIT_ID>is accompanied bycommit/<COMMIT_ID>/not_first_parent(not sure this works well with our scan logic, right?), or, for the same purpose add an attribute to the commit entryisFirstParent. However, I don't think this option supports the use case
What other options do we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.
Thank you for this input @johnnyaug.
What happens if I found a not-first-parent commit, return it as a starting point, and before the GC contniues, a commit is created over that commit, making it a first parent? Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we might be overworking this. The fact that something is efficiently possible on PostgreSQL does not mean that we have to do it efficiently in the KV world and in the same way.
Also please note that the proposal to add parameters to Next() simply makes it not an iterator. I would argue that the iterator abstraction adds nothing here.
We will not be able to manage this list of "not first-parent commits" correctly and efficiently. The question is really, do we need to? We will probably be scanning all commits, or all commits up to a previously-processed set of "safe" commits, on each GC. We could process not-first commits "just" by scanning all of these. (I am starting to think that the commits logic for GC should also move off of lakeFS and into the Spark cluster. @johnnyaug WDYT?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally open to it - the not-first-parent thing is a means to an end. We just need to start scanning the tree somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aiming at preserving the current behavior with KV support, as a first step, and improve that later, based on our needs. I think introducing the KV aspect is enough of a load already, and would like to stay away from redesigning our GC as a side effect.
Even though it will not be as efficient, I think it can work, at least as a first step. Moreover, since it is the GC, I think we can loosen up the efficiency requirements, again - at least as a first step.
@arielshaqed @johnnyaug - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed - either ways we are not changing the interface of Next()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Itai-David I understood that we will be reading everything to memory, applying the same logic that was used in Postgres, thus implementing the same iterators as today. If so, what are the remaining open questions?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. The problem with flagging the repository is that now you would have to check it with every call. For example, trying to get a Tag would first need to check that flag which is stored somewhere else.
I think the best we can do here is to delete it in an order that would leave a consistent state in every possible failure. One way could be: Tags, Staging Tokens (as in workspaces), Branches (all except for the default), update the default branch commit to "NotACommit", Commits and finally the default branch and the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a repository (probably with a prefix) is a partition key, then you can delete a repository by deleting everything inside the partition. And you can keep a list of "repositories being deleted" (or an object inside the partition of each repository) and keep retrying to delete those repositories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right @arielshaqed, but now every time you try to get an object from that repo, you would have to query that list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the usage of repository as partition. It makes sense in the way that each scan operation (in that case, for deleting everything) is partition bounded eitherways, plus, I think the locallity of partition (supported by DynamoDB, as an example) seems to be working in our benefits with this one (be it performance-wise or GDPR-wise).
We will need to maintain the repository entry itself outside of the partition, for repositories listing, correct?
I think we can "mark" a repository as being-deleted by simply remove its entry from the repositories list, and once successful, start removing everything in the relevant partition
Any objection we go with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Itais are right. There are two issues here: immediately stop using a repository, and eventually delete it.
To delete a repository, mark it as deleted on its value. So now when you fetch it you can set it's deleted (but not yet removed). If this becomes too expensive, we'll need to put a single cache in front of the kv.
To ensure it gets cleaned up, add it to the repositories-to-delete list. And periodically treeverse that list to clean up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that with current path definitions - and as @eden-ohana mentioned, there's a risk of accidentally seeping into different entities when scanning on repositories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@N-o-Z - that was fixed with the recent update to the keys
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already stated that lock-less commits are out of scope for KV, I think this part is irrelevant for this doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of KV3 scope, or out of KV scope all together?
I think we need a wider opinions range here
@eden-ohana @itaiad200 - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be more specific - what are the things we think can degradate? There might be paths we would like to benchmark now. Graveler is pretty wide so if you have any specific paths you think we need to benchmark let's discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the doc and to #3571 to comments we discussed earlier, regarding which scenarios are interesting and should be benchmarked
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it covers the special iterators above..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - we are still discussing it. I believe we should support these by additional keys, and strive to remove them as an improvements iteration out of the scope of this milestone (and probably out of the scope of KV all together, as suggeated by @N-o-Z )
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a /?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the naming convention to plural - 'branches', 'commits' etc. Consistent with other packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the repo/<REPO_NAME> should have a different name to not list the branches(repo/<REPO_NAME>/branch<BRANCH_NAME>) when listing repositories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itaiad200 - Yup. Fixed
@N-o-Z - done
@eden-ohana - Good catch, thank you. Fixed
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REPO_NAME == repository ID
BRANCH_NAME == branch ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. These are the same. I tried to maintain the SQL names, just for simplicity of this doc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag path should save its commit ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the key?
It will be in the value
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pray for 0 double slash bugs 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think staging token is per branch, should add branch prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eden-ohana - agree. Added under the new scema
@itaiad200 - double slash is the obvious. I added an option for triple slashes... My new suggestion removes the redundant keywords such as key and branch from anywhere except from the key prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm missing is the data modelling for KV store. It can either be part of #3567 or as a new issue. But it is required as part of the common code base that we all work over. In auth we didn't do it well enough and as a result there were many changes during implementation and technical debts left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data modeling added to doc. As graveler pretty much uses the SQL DB as KVStore, there is not much of a difference, excpet for a few things we will need to add
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step alone is quite big. It means reimplementing Graveler in a single step.. I think we should support an hybrid mode, i.e. if it makes sense, move just the tags, then just the staging areas, then commits, finally branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting breaking it into smaller steps within the same deliverable, which has mainley administrative aspects, or creating a deliverable in each sub-step (e.g. a working, mergable, graveler with hybrid support)?
I'm not sure what does the later mean in terms of effort and overhead to get such a hybrid working
If we look at it, it is a big step, work wise, but I believe the auth implementation was bigger in terms of both size and complexity. Moreover, since testing coverege is excellent, it should converge much quicker IMHO
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Naive implementation with locking means replacing all graveler to work on top of the kv interface and adding a locking mechanism to support that, then I think it's too big. I also think auth was way smaller than that, and was also too big for a single step/PR which is what we did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to doc - implementatingan initial skeleton with no real KV access (delegating all DB related calls to DB gravlere, as we discussed) and deriving followup implementation tasks
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seems to misunderstand the previous section - I really think lock free commits are not part of the KV scope. I believe it's best to first focus on the transition to KV with the current implementation and only after that working on the lock free feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely live with lock-free out of KV scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this approach.
Generally I think the lock-less feature is a post KV feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
Perfectly doable to start without lockfree commits. It will roughly mirror the intented current semantics -- which indeed starve commits. Also the actual current semantics are broken and can lose writes, so just locking (correctly) is an improvement!
However we do need lockfree commits in order to support many streaming ingestion use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're aligned on this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #3586 to fully support lock-free commits, out of KV3 scope