Skip to content
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions design/open/kv3-execution-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@

# General
## Tables
Graveler uses 5 DB tables:
* `graveler_repositories`,`graveler_branches`,`graveler_commits`,`graveler_tags` - used by `pkg/graveler/ref`
* `graveler_staging_kv` - used by `pkg/graveler/staging`

`graveler_repositories`, `graveler_commits` and `graveler_tags` are immutable and does not update entries, and so should be straight forward to implement over our `KV` (`graveler_commits` actually performs update for commit generation, but this is only valid for `refs-restore` which can only take place on a `bare` repo and so it is of no concern)</br>
`graveler_branches` is currently being protected by a lock, during commit actions, and in order to support lock-free commits with `KV`, the [following design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#graveler-metadata---branches-and-staged-writes) is proposed. However, as a first step we can implement it using a naive table lock, just to get it going, or better, a `branch-key` based locking, to prevent staging during commits</br>
`graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lock-free-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br>
Note: All the above tables (along with the late `auth_users`) are also used by `pkg/diagnostics` for diagnostics collection (RO).

## Entities
### Repository
A repository is identified by a unique name (type `RepositoryID`, a `string`) and contains its storage name space (type `StorageNamespace`, a string), default branch ID (type `BranchID`, a string) and its creation date (type `Time`).
Repositories are iterable, and so all repositories should exist in a common partition (`graveler`)
As this design suggests, among other things, that each repository should be a represented in a dedicated partition and that all other repository related entities will exist under this partition, and since we would like to use the existence/inexistence of a repository entry, to identify a valid/deleted repository, respectively, the repository entry should also include a random `guid`, which, when concatenated with the repo name, will be the partition key for that repository entities. That way it will be easy to distinguish between a partition related to "aRepo" that was already deleted and a partition related to a new "aRepo" that currently exist, as their partition names differ by the `guid` (`aRepo-guid1` vs `eRepo-guid2`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a request for aRepo, how do I know to look for aRepo-guid2 in the KV?

Copy link
Contributor Author

@itaidavid itaidavid Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You perform GetRepository by aRepo. that will be translated to kv.Get(partition = "graveler", key = "repos/aRepo")
Upon success, you will get a Repository object that has, among other things, a my_guid (or similar name. TBD) field set to "guid2"
This value will be autogenerated upon repository creation and will be initially used to create the repository with the default branch and initial commit
Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope :) You're referring to a situation where lakeFS has two repositories with the same name. How would lakeFS know to delete everything under aRepo-guid2 and not aRepo-guid1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I'm not a fan of this solution, but nevertheless, why not add the guid to the path instead of concatenating it to the repo name? e.g. instead of aRepo-guid1/branches/... do aRepo/guid1/branches/...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@N-o-Z the repo with guid is the partition key and not part of the path.
Neverthelss, the use od - and concat was illustrative. Change the doc to reflect aRepo/guid1 etc.

Having that said, it is still essential to verify and agree that each data access essentially implies fetch of the repository entry, in order to (a)verify its existence and (b) get the correct `guid` in order to build the partition key
### Branch
A branch is identified by its unique name **within** a repository, hance the combination `<REPO_NAME>`+`<BRANCH_NAME>` uniquely identifies a branch in the system. Since all repository entities (except for the aforementioned Repository entity itself) belong to the repository partition, the Branch will be identified by its unique name within that partition.
Other than that, a branch contains its current commit ID (type `CommitID`, a string) and its current staging token (type `StagingToken` a string).
In order to support the `commit flow`, proposed in [lakeFS on KV design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#committer-flow), a branch will also have the propsed `sealed_token` field - an array of `StagingToken`s, as specified in the design doc
### Tag
A tag is identified by its unique ID (type `TagID`, a string) within a repository, hence will be a unique tag identifier within the repository partition. It contains a commit ID
### Commit
A commit is identified by a unique commit ID within a repository. It contains the following commit data: version (type `CommitVersion`, an int), committer (string) and commit message (string), metarange ID (type `MetaRangeID`, a string), creation date (type `Time`), parent commits (type `CommitParents`, an array of `CommitID`), metadata (string => string mapping) and generation (int).
As a commit exists within a repository context, it can be identified by its unique ID (Commit ID) withing the repository partition
### Staged Object
Uniquely identified by a staging token and the object path within the branch. It contains the object data (type `DBEntry`) serialized as `protobuf`, and the checksum of this data. Looks like we can simply reuse the same `DBEntry` type as it is
In order to minimize bottlenecks in our implementation, each staging token will be represented by designated partition, where the `staging_token` acts as partition key, and the staged object paths, as keys within. Note the since `staging_token` is unique within a branch (moreover, it contains the repository name and the branch name as prefixes) it is an appropriate partition key

## Iterators
The following iterators are implemented and used by `pkg/graveler`:
* `branch_iterator` - iterates branches by repository. Has an option to order by ID (trivial) or by CommitID.
Iterating by commit Id will be implemented by fetching all the branches for the repository, and sorting by CommitID in memory.
* `commit_iterator` - iterates a commit ancestry by running BFS. Not an `iterator` in the `KV` sense. Should be supported as is with `KV` `Get`
* `commit_ordered_iterator` - iterates commits of a repo, ordered by commit ID (regular `KV` `Scan` supports that). However - has an option to iterate over not-first-parent commits only (e.g. commits that are not a first parent of any other commit). Probably could (and should) be implemented by a specific `Next()` for that case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs more details. The list of not-first-parent commits can always change since you may introduce a new child to any commit at any time. I wonder how do we use this list? And why would the ordering of CommitID matter?

I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commits are used side by side with the branch-by-commitID list, as explained by @johnnyaug, hence the order by commitID for that list too.
regarding the not-first-parent attribute, its usage and the fact it is subtle to changes - is it any different than the way it is used today, over SQL? I see that the decision of whether a commit is non-first-parent or not is made at the time of query execution, but this also can be changed by the time the GC algorithm get there. Am I wrong?
I guess I miss some context on that algorithm - @johnnyaug, can you please help here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can think of the "not first parent" thing as a generalization of tree leaves. If you take the commit tree and drop all connections (edges) that are not first-parent, these commits are the leaves of this tree.

This is used to find all potential starting points for scanning the tree towards garbage collection.

In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.

Copy link
Contributor Author

@itaidavid itaidavid Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.

@itaiad200 - I absolutely agree with that last sentence. Maybe we can maintain a live set of not-first-parent commits and modify it as commits occur. i.e. each new commit is by default not-first-parent and for each further operation, which makes it a first-parent, we try to remove that commit from the list (possibly already removed, hence the try) (Question: Can a commit become not-first-parent if it is already a first-parent?)
I can think of a couple of ways to implement this list, each optnio with its own drawbacks:

  • We can use some not_first_parent/<COMMIT_ID>. This will require searching the list for each commit we modify, but will make it easier to list all not-first-parent commits, or
  • We can "flag" the commit key itself i.e. commit commit/<COMMIT_ID> is accompanied by commit/<COMMIT_ID>/not_first_parent (not sure this works well with our scan logic, right?), or, for the same purpose add an attribute to the commit entry isFirstParent. However, I don't think this option supports the use case

What other options do we have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.

Thank you for this input @johnnyaug.
What happens if I found a not-first-parent commit, return it as a starting point, and before the GC contniues, a commit is created over that commit, making it a first parent? Is that possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we might be overworking this. The fact that something is efficiently possible on PostgreSQL does not mean that we have to do it efficiently in the KV world and in the same way.

Also please note that the proposal to add parameters to Next() simply makes it not an iterator. I would argue that the iterator abstraction adds nothing here.

We will not be able to manage this list of "not first-parent commits" correctly and efficiently. The question is really, do we need to? We will probably be scanning all commits, or all commits up to a previously-processed set of "safe" commits, on each GC. We could process not-first commits "just" by scanning all of these. (I am starting to think that the commits logic for GC should also move off of lakeFS and into the Spark cluster. @johnnyaug WDYT?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally open to it - the not-first-parent thing is a means to an end. We just need to start scanning the tree somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aiming at preserving the current behavior with KV support, as a first step, and improve that later, based on our needs. I think introducing the KV aspect is enough of a load already, and would like to stay away from redesigning our GC as a side effect.
Even though it will not be as efficient, I think it can work, at least as a first step. Moreover, since it is the GC, I think we can loosen up the efficiency requirements, again - at least as a first step.
@arielshaqed @johnnyaug - WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed - either ways we are not changing the interface of Next()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Itai-David I understood that we will be reading everything to memory, applying the same logic that was used in Postgres, thus implementing the same iterators as today. If so, what are the remaining open questions?

* `repository_iterator`, `tag_iterator` - simple iterators ordered by ID. `KV` standard `Scan` should be sufficient
* `staging/iterator` - iterates over a specific `staging_token` (as prefix) by order of key. Should be good with `KV` `Scan`

## Deleting a Repository
Unlike other entities' delete, which are standalone, deleting a repository results in deleting all branches, commits and tags correlated to that repository. While being protected by a transaction for `SQL` we need to handle failures for `KV`. One option that comes in mind is to flag the repository as `"being_deleted"` (maybe a `KV` entry for `being_deleted` repos, that can be scanned), clean everything, and if successful remove the flag. Another option is to remove the repository entry first, and deploy a background GC that scans for branches/commits/tags that are related to a non-exist repository and delete them. Other ideas?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The problem with flagging the repository is that now you would have to check it with every call. For example, trying to get a Tag would first need to check that flag which is stored somewhere else.
I think the best we can do here is to delete it in an order that would leave a consistent state in every possible failure. One way could be: Tags, Staging Tokens (as in workspaces), Branches (all except for the default), update the default branch commit to "NotACommit", Commits and finally the default branch and the repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a repository (probably with a prefix) is a partition key, then you can delete a repository by deleting everything inside the partition. And you can keep a list of "repositories being deleted" (or an object inside the partition of each repository) and keep retrying to delete those repositories.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @arielshaqed, but now every time you try to get an object from that repo, you would have to query that list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the usage of repository as partition. It makes sense in the way that each scan operation (in that case, for deleting everything) is partition bounded eitherways, plus, I think the locallity of partition (supported by DynamoDB, as an example) seems to be working in our benefits with this one (be it performance-wise or GDPR-wise).
We will need to maintain the repository entry itself outside of the partition, for repositories listing, correct?
I think we can "mark" a repository as being-deleted by simply remove its entry from the repositories list, and once successful, start removing everything in the relevant partition
Any objection we go with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Itais are right. There are two issues here: immediately stop using a repository, and eventually delete it.

To delete a repository, mark it as deleted on its value. So now when you fetch it you can set it's deleted (but not yet removed). If this becomes too expensive, we'll need to put a single cache in front of the kv.

To ensure it gets cleaned up, add it to the repositories-to-delete list. And periodically treeverse that list to clean up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that with current path definitions - and as @eden-ohana mentioned, there's a risk of accidentally seeping into different entities when scanning on repositories

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@N-o-Z - that was fixed with the recent update to the keys

**Update:** It looks like most (if not all) `graveler`'s operations, perform a repository existence check by issuing a call to `GetRepository`, which performs a DB access. If that is indeed the case, we can easily "mark" a repository as deleted by removing the repository entry from the common partition `graveler`, thus making it unreachable. We can later iterate all entries under this repository partition and delete them one by one, preceding the branch entry deletion by iteratively deleting all relevant staging partitions. Failures can be handled by a background garbage collector (which, BTW, can handle the initial deletion too), but that is only for space saving purposes and should have no effect on the deleted repository itself, once the repository entry is deleted. A point to consider - what happens if a repository entry is deleted, the data in the relevant repository partition fails to delete, and a new repository with the same name is created? I think this is solvable by adding a guid to the repository and use it as the partition key (as suggested in the `Repository` entity above)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to "Update" - it is not checked in yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, let's bring this document to reflect what we decide on and the next steps. We can have alternatives that were offered and even rejected, but currently it feels more of a discussion log and not an agreed upon plan (e.g. Other ideas?, Update, options that have been ruled out appears on top, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Another option - As suggested by @itaiad200 - performing the deletion in a way that preserves consistency with each possible failure - e.g. staging partitions need to be deleted before the branch, as not to leave any hanging in case of failure, commits cannot be deleted before branches, as this may leave a branch pointing to a deleted commit, the default branch cannot be deleted until the repository is deleted, etc. Part of the discovery should define the correct order
Issue [#3570](https://github.com/treeverse/lakeFS/issues/3570) should handle this decision

## Testing
`pkg/graveler/ref` and `pkg/graveler/staging` currently has a **very high** test coverage. This should make it easier to trust the transition to `KV`. However, one notable aspect that is the absence of `benchmarks`. As `graveler` will introduce new algorithms to support lock-free commits, there is a risk of performance degradation which need to be identified
**Update:** A decision is to be made regarding the required benchmarks - which scenarios do we really want to test. One example is multiple parallel commits, that will definitely fail each other until all retries will succeed - while this is an exhaustive scenario, which stress the performance to an edge, it might not be as interesting as a "real life" scenario. The bottom line is we can predict that such a scenario will show performance degradation for our not-lock-free commit algorithm, but it is probably not interesting. Another example is a single long commit, with multiple staging operations taking place at the same time. This scenario is much more common in real life usage and a performance regression in this scenario is something we better discover sooner than later. This note was added to [#3571](https://github.com/treeverse/lakeFS/issues/3571)

## Dump/Restore Refs
Uses `graveler` ref-store for dumping a whole repository (branches, commits and tags) and restoring it to a bare repository. As there are no direct DB accesses, this should go seamlessly, but definitely need to be verified</br>
Moreover, as these are quite exhaustive operations, may be used as part of performance verification

## Suggested Key Schema
**Note:** This is an initial suggestion that seems to support the required iterators. It tries to encapsulate the immutable parts of identifying an object (e.g. a branch can be identified by its name, but not by a commit, as it (a) changes and (b) a commit can be shared by branches. a repo and a tag has one:many relation, so a tag key contains its repo etc.) It is totally open to discussion and improvements.
**Update:** We will use multiple partitions. One general named TBD (`graveler`) and another partition for each repository, named after that repository. The idea is to isolate the entire repository data, as either ways each operation is bounded to within a specified repository. Repositories themselves will be in the `graveler` repository, as to allow listing of all repositories. As no longer needed, the `<REPO_NAME>` can be removed from all keys under the `<REPO_NAME>` partition
* Partition `graveler`
* Repository - `repos/<REPO_NAME>`

* Partition `<REPO_NAME>`
* Branch - `branches/<BRANCH_NAME>`
* Commit - `commits/<COMMIT_ID>`
* Tag - `tags/<TAG_ID>`

* Partition `<STAGING_TOKEN>` // Note: the token contains the repo name and branch name
* Staged Obj - `key/<OBJ_KEY>`

### Key Schema Support of Required Functionalities
**Note** Only covering non trivial functionalities
* `commit_iterator` - this is not an iterator in the `KV` sense, but rather an iteration algorithm where a commit is scanned and are parents are queued for later scan. There is no real `KV` iteration. Instead, a `Get` is used for each commit, by its ID, which is trivially supported by the above **Commit** key
* `commit_ordered_iterator` - scan by order of commitID, which is the natural behavior of our `KV iterator` implementation, and will be supported trivially. The option to scan only `not-first-parent` commits is TBD (still being discussed)
* `branch_iterator`, sorted by branch name - trivial. sorted by commitID - TBD (still being discussed)
* Repository Deletion - as suggested, maintaining all repository relevant entries under a unique partition will make it easy to delete as an entire partition. The repository key itself is outside of the partition (in a general common partition, to support listing) and will be deleted first. Once deleted, the entire partition will become unavailable, making the data state consistent either ways

## Update Commits, staging and lock-free
we will follow @arielshaqed suggestion to prevent commits overriding each other by using `KV`'s `SetIf` on the branch update, making sure no racing commit is already finished and being overridden. In case a commit fails to finish due to another commit "beating" it to the finish line - a retry should be attempted. This implementation will require to support multiple staging tokens, as proposed by [lakeFS on KV design proposal](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#committer-flow). Lock-free commits will be handled out of the scope of KV3 milestone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually Ariel's suggestion was the Lock-free commits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are actually implementaing the fully-blown lock-free commits as part of KV3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually figured it out. By Ariel's suggestion, I was referring to this comment and you were referring to #3091
Modified the doc to clarify that


# Execution Plan
* Agree on keys schema (see here after) - [#3567](https://github.com/treeverse/lakeFS/issues/3567)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm missing is the data modelling for KV store. It can either be part of #3567 or as a new issue. But it is required as part of the common code base that we all work over. In auth we didn't do it well enough and as a result there were many changes during implementation and technical debts left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data modeling added to doc. As graveler pretty much uses the SQL DB as KVStore, there is not much of a difference, excpet for a few things we will need to add

* Supporting `KV` along side `SQL` (same as was done for previous modules)
* Step 1: Building a initial `KV graveler` and a `DB graveler`. `KV graveler` should delegate all calls to `DB gravler` (no `KV` usage at all for step) in order to allow partial `KV` support.
[#3568](https://github.com/treeverse/lakeFS/issues/3568) was modified to reflect that, and should result with continuation tasks for the `KV` support steps within
* Step 2: Basic implementation of `KV` usage within the `KV graveler` from the previous item. Tasks to be defined as part of [#3568](https://github.com/treeverse/lakeFS/issues/3568)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not define them now? They're the most important part of this plan!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will add these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Step 3: Optimistic Lock Commits, as described above - a working commit algorithm that prevents commits from overriding each other, by using `KV`'s `SetIf` to set the branch's commit, and failing (and retrying) to commit if `SetIf` fails, as described above. [#3569](https://github.com/treeverse/lakeFS/issues/3569)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider retrying commits on the client side. It will make it easier to provide different retry strategies for different clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed - are you referring to explicit retries by the user, or implicit retries by the (various) client? I believe it is the first?

* Step 4: Decide and implement a working solution for **Deleting a Repository** - [#3570](https://github.com/treeverse/lakeFS/issues/3570)
(Steps 3 and 4 are independent)
* Benchmarks for common operations - to be executed manually, on both `SQL` and `KV` to verify performance are satisfactory - [#3571](https://github.com/treeverse/lakeFS/issues/3571)
* Migration of existing `SQL` data to `KV`. Most of the tables are trivial to migrate. I believe `graveler_branches` migration should also be trivial, with no specific consideration for `sealed_tokens` but that needs to be verified - [#3572](https://github.com/treeverse/lakeFS/issues/3572)
* KV Migration Test is, obviously, included
* Add `graveler` migration to `Esti` Migration Tests - [#3573](https://github.com/treeverse/lakeFS/issues/3573)
* Optional - Dump/Restore Refs tests for `SQL` and `KV` - [#3574](https://github.com/treeverse/lakeFS/issues/3574)
* Optional - `pkg/disagnostics` support of KV. Should be trivial - [#3575](https://github.com/treeverse/lakeFS/issues/3575)