Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions design/open/kv3-execution-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@

# General
## Tables
Graveler uses 5 DB tables:
* `graveler_repositories`,`graveler_branches`,`graveler_commits`,`graveler_tags` - used by `pkg/graveler/ref`
* `graveler_staging_kv` - used by `pkg/graveler/staging`

`graveler_repositories`, `graveler_commits` and `graveler_tags` are immutable and does not update entries, and so should be straight forward to implement over our `KV` (`graveler_commits` actually performs update for commit generation, but this is only valid for `refs-restore` which can only take place on a `bare` repo and so it is of no concern)</br>
`graveler_branches` is currently being protected by a lock, during commit actions, and in order to support lockless commits with `KV`, the [following design](https://github.com/treeverse/lakeFS/blob/b3204edad00f88f8eb98524ad940fde96e02ab0a/design/open/metadata_kv/index.md#graveler-metadata---branches-and-staged-writes) is proposed. However, as a first step we can implement it using a naive table lock, just to get it going, or better, a `branch-key` based locking, to prevent staging during commits</br>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this approach.
Generally I think the lock-less feature is a post KV feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!

Perfectly doable to start without lockfree commits. It will roughly mirror the intented current semantics -- which indeed starve commits. Also the actual current semantics are broken and can lose writes, so just locking (correctly) is an improvement!

However we do need lockfree commits in order to support many streaming ingestion use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're aligned on this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #3586 to fully support lock-free commits, out of KV3 scope

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading it once again with the help of @arielshaqed and @N-o-Z comments, I think I have a better idea of what you were aiming for here. A naive table lock means that we would be using the same locking technique we have for the current postgres DB, right? Well - that cannot be achieved with KV. It's not part of the KV interface and adding it to the current 3 implementations we have is harder than just straightforward use the mechanisms described in the original proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by now we are arguing about different words to describe the exact same thing. I would strongly encourage that we simply stop using the word "lock", CAS-type operations ("SetIf") like those used here are at a lower level and easier to use directly. The term "lock" is overloaded and has many different meaning. (Note that "lock-free" does not talk about locks, intentionally, so we can continue to use that term.)

Let's be concrete. I call the following "the naïve way" to perform a protected action on a branch:

  1. Get the original branch state o.
  2. (Maybe perform some actions, they might all be lost in the last step.)
  3. Compute a new branch state n.
  4. SetIf the branch state to n IF it is still o.
    • (If setting failed and some other actions were performed, maybe clean them up.)

This type of process is sometimes loosely called "optimistic locking".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what I had in mind for naive and temporary locking is locking of some common resource, on a logical level, not related to KV.
Both staging and commit operations (on the same branch) mutually block each other (this is the case, right? any other operatios need to be blocked), by locking a common resource (mutex?), just for the correctness of things.
We can reduce the contention by having NumLocks (say 2^10) such locks, and each operatation locks branchID%NumLocks, letting several operations (up to NumLocks, not sure how to calculate the average) run in parallel.
Any reason why that should not work as a naive, temporary yet correct, solution?

`graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockless-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockless-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br>
`graveler_staging_kv` creates entries based on the current branch `staging_token` and is protected from creating entries during a commit under the same lock. This should also be supported lockfree-ly, as referred to by the above proposal, and can also be solved with naive locking, as a first step</br>

The term is "lock-free", not "lock-less".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased

Note: All the above tables (along with the late `auth_users`) are also used by `pkg/diagnostics` for diagnostics collection (RO).

## Iterators
The following iterators are implemented and used by `pkg/graveler`:
* `branch_iterator` - iterates branches by repository. Has an option to order by ID (trivial) or by CommitID (Used by GC. How do we support that?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why anyone would like to sort anything based on a random generated string (commit ID) - maybe it's not really needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this either, and I am not sure where GC looks up a branch by commit ID. Indeed, given that a commit belongs to multiple branches, I am not sure what this lookup would do.

Assuming "ID" here means "name" -- otherwise we may have difficulty paginating a lengthy branches display in the GUI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed - "ID" does mean "name"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnyaug - can you please explain the purpose of sorting branches for iteration, by the associated commitID?
Used here
Thanks 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Itai-David, sure: this iterator is then used as input for a GCStartingPointIterator. This iterator assumes the branch heads are sorted by commit ID. This way, it can go over branch heads side by side with other commit IDs that are also sorted, and compare them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @johnnyaug.
@itaiad200 @arielshaqed - I looked at the GC code and it looks like we either need to support that with our Graveler over KV impl, or we will need to change the way GC traverses commits.
I think the first can be supported by maintaining a repo/<REPO_NAME>/commit/<COMMIT_ID>/branch/<BRANCH_NAME>. These keys will have to be constantly removed and added as commits take place, but we do have all the info at the time of the commit, so I don't think it is too difficult.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commits don't have branches.
If we try to attach branches to them, it will all end in tears.
I am not sure we need to fetch commits in sorted order to merge them. Let's define the algorithm and then see about speeding it up. I'm thinking maybe a Bloom filter on ancestors of each commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed I'm not sure I fully understand how a Bloom Filter can help here, but I guess this requires a bit of redesign to the GC commits gathering alogrithm (?)
If that is so, I think it should be out of scope for this KV milestone - I think we should "force" our graveler over KV impl to provide all the functionality currently supported by graveler over SQL, in terms of correctness, even in the painful price of performance regression, and handle any significant algorithmic change in another scope. WDYT?
For that purpose, I think defining a commitId to branchName(s) mapping (as a key?) can provide what we need, at least as a first step (regardless of the (very) wrong conception that commits "have" or "belong to" branch(s))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed and I discussed it a bit more and I think we now understand why it's needed for GCStartingPointIterator. We can have the same commitID order for a branch iterator on top of the kv implementation, by simply fetching all branches and sorting them in memory. It will not require to introduce any new indexing in KV. A counter argument would be memory consumption, but it's used by an algorithm that already stores all commits in memory, so there should not be any degradation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the doc - support of iteration by commitID by sorting branches in memory

* `commit_iterator` - iterates a commit ancestry by running BFS. Not an `iterator` in the `KV` sense. Should be supported as is with `KV` `Get`
* `commit_ordered_iterator` - iterates commits of a repo, ordered by commit ID (regular `KV` `Scan` supports that). However - has an option to iterate over not-first-parent commits only (e.g. commits that are not a first parent of any other commit). Probably could (and should) be implemented by a specific `Next()` for that case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs more details. The list of not-first-parent commits can always change since you may introduce a new child to any commit at any time. I wonder how do we use this list? And why would the ordering of CommitID matter?

I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commits are used side by side with the branch-by-commitID list, as explained by @johnnyaug, hence the order by commitID for that list too.
regarding the not-first-parent attribute, its usage and the fact it is subtle to changes - is it any different than the way it is used today, over SQL? I see that the decision of whether a commit is non-first-parent or not is made at the time of query execution, but this also can be changed by the time the GC algorithm get there. Am I wrong?
I guess I miss some context on that algorithm - @johnnyaug, can you please help here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can think of the "not first parent" thing as a generalization of tree leaves. If you take the commit tree and drop all connections (edges) that are not first-parent, these commits are the leaves of this tree.

This is used to find all potential starting points for scanning the tree towards garbage collection.

In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.

Copy link
Contributor Author

@itaidavid itaidavid Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to come up with an iterator on top of the kv implementation that would cover these requirements.

@itaiad200 - I absolutely agree with that last sentence. Maybe we can maintain a live set of not-first-parent commits and modify it as commits occur. i.e. each new commit is by default not-first-parent and for each further operation, which makes it a first-parent, we try to remove that commit from the list (possibly already removed, hence the try) (Question: Can a commit become not-first-parent if it is already a first-parent?)
I can think of a couple of ways to implement this list, each optnio with its own drawbacks:

  • We can use some not_first_parent/<COMMIT_ID>. This will require searching the list for each commit we modify, but will make it easier to list all not-first-parent commits, or
  • We can "flag" the commit key itself i.e. commit commit/<COMMIT_ID> is accompanied by commit/<COMMIT_ID>/not_first_parent (not sure this works well with our scan logic, right?), or, for the same purpose add an attribute to the commit entry isFirstParent. However, I don't think this option supports the use case

What other options do we have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, this doesn't need to be a consistent list: when you find a commit that is not a first parent, you can safely return it, regardless of whether the resulting list as a whole makes sense.

Thank you for this input @johnnyaug.
What happens if I found a not-first-parent commit, return it as a starting point, and before the GC contniues, a commit is created over that commit, making it a first parent? Is that possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we might be overworking this. The fact that something is efficiently possible on PostgreSQL does not mean that we have to do it efficiently in the KV world and in the same way.

Also please note that the proposal to add parameters to Next() simply makes it not an iterator. I would argue that the iterator abstraction adds nothing here.

We will not be able to manage this list of "not first-parent commits" correctly and efficiently. The question is really, do we need to? We will probably be scanning all commits, or all commits up to a previously-processed set of "safe" commits, on each GC. We could process not-first commits "just" by scanning all of these. (I am starting to think that the commits logic for GC should also move off of lakeFS and into the Spark cluster. @johnnyaug WDYT?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally open to it - the not-first-parent thing is a means to an end. We just need to start scanning the tree somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aiming at preserving the current behavior with KV support, as a first step, and improve that later, based on our needs. I think introducing the KV aspect is enough of a load already, and would like to stay away from redesigning our GC as a side effect.
Even though it will not be as efficient, I think it can work, at least as a first step. Moreover, since it is the GC, I think we can loosen up the efficiency requirements, again - at least as a first step.
@arielshaqed @johnnyaug - WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed - either ways we are not changing the interface of Next()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Itai-David I understood that we will be reading everything to memory, applying the same logic that was used in Postgres, thus implementing the same iterators as today. If so, what are the remaining open questions?

* `repository_iterator`, `tag_iterator` - simple iterators ordered by ID. `KV` standard `Scan` should be sufficient
* `staging/iterator` - iterates over a specific `staging_token` (as prefix) by order of key. Should be good with `KV` `Scan`

## Deleting a Repository
Unlike other entities' delete, which are standalone, deleting a repository results in deleting all branches, commits and tags correlated to that repository. While being protected by a transaction for `SQL` we need to handle failures for `KV`. One option that comes in mind is to flag the repository as `"being_deleted"` (maybe a `KV` entry for `being_deleted` repos, that can be scanned), clean everything, and if successful remove the flag. Another option is to remove the repository entry first, and deploy a background GC that scans for branches/commits/tags that are related to a non-exist repository and delete them. Other ideas?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. The problem with flagging the repository is that now you would have to check it with every call. For example, trying to get a Tag would first need to check that flag which is stored somewhere else.
I think the best we can do here is to delete it in an order that would leave a consistent state in every possible failure. One way could be: Tags, Staging Tokens (as in workspaces), Branches (all except for the default), update the default branch commit to "NotACommit", Commits and finally the default branch and the repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a repository (probably with a prefix) is a partition key, then you can delete a repository by deleting everything inside the partition. And you can keep a list of "repositories being deleted" (or an object inside the partition of each repository) and keep retrying to delete those repositories.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @arielshaqed, but now every time you try to get an object from that repo, you would have to query that list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the usage of repository as partition. It makes sense in the way that each scan operation (in that case, for deleting everything) is partition bounded eitherways, plus, I think the locallity of partition (supported by DynamoDB, as an example) seems to be working in our benefits with this one (be it performance-wise or GDPR-wise).
We will need to maintain the repository entry itself outside of the partition, for repositories listing, correct?
I think we can "mark" a repository as being-deleted by simply remove its entry from the repositories list, and once successful, start removing everything in the relevant partition
Any objection we go with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Itais are right. There are two issues here: immediately stop using a repository, and eventually delete it.

To delete a repository, mark it as deleted on its value. So now when you fetch it you can set it's deleted (but not yet removed). If this becomes too expensive, we'll need to put a single cache in front of the kv.

To ensure it gets cleaned up, add it to the repositories-to-delete list. And periodically treeverse that list to clean up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that with current path definitions - and as @eden-ohana mentioned, there's a risk of accidentally seeping into different entities when scanning on repositories

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@N-o-Z - that was fixed with the recent update to the keys


## Testing
`pkg/graveler/ref` and `pkg/graveler/staging` currently has a **very high** test coverage. This should make it easier to trust the transition to `KV`. However, one notable aspect that is the absence of `benchmarks`. As `graveler` will introduce new algorithms to support lockless commits, there is a risk of performance degradation which need to be identified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you already stated that lock-less commits are out of scope for KV, I think this part is irrelevant for this doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of KV3 scope, or out of KV scope all together?
I think we need a wider opinions range here
@eden-ohana @itaiad200 - WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more specific - what are the things we think can degradate? There might be paths we would like to benchmark now. Graveler is pretty wide so if you have any specific paths you think we need to benchmark let's discuss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the doc and to #3571 to comments we discussed earlier, regarding which scenarios are interesting and should be benchmarked


## Dump/Restore Refs
Uses `graveler` ref-store for dumping a whole repository (branches, commits and tags) and restoring it to a bare repository. As there are no direct DB accesses, this should go seamlessly, but definitely need to be verified</br>
Moreover, as these are quite exhaustive operations, may be used as part of performance verification

## Suggested Key Schema
Note: This is an initial suggestion that seems to support the required iterators. It tries to encapsulate the immutable parts of identifying an object (e.g. a branch can be identified by its name, but not by a commit, as it (a) changes and (b) a commit can be shared by branches. a repo and a tag has one:many relation, so a tag key contains its repo etc.) It is totally open to discussion and improvements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it covers the special iterators above..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - we are still discussing it. I believe we should support these by additional keys, and strive to remove them as an improvements iteration out of the scope of this milestone (and probably out of the scope of KV all together, as suggeated by @N-o-Z )

* Repository - `repo/<REPO_NAME>`
* Branch - `repo/<REPO_NAME>/branch<BRANCH_NAME>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a /?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the naming convention to plural - 'branches', 'commits' etc. Consistent with other packages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the repo/<REPO_NAME> should have a different name to not list the branches(repo/<REPO_NAME>/branch<BRANCH_NAME>) when listing repositories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itaiad200 - Yup. Fixed
@N-o-Z - done
@eden-ohana - Good catch, thank you. Fixed

* Commit - `repo/<REPO_NAME>/commit/<COMMIT_ID>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REPO_NAME == repository ID
BRANCH_NAME == branch ID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. These are the same. I tried to maintain the SQL names, just for simplicity of this doc

* Tag - `repo/<REPO_NAME>/tag/<TAG_ID>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag path should save its commit ID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the key?
It will be in the value

* Staged Obj - `repo/<REPO_NAME>/staging_token/<STAGING_TOKEN>/key/<OBJ_KEY>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pray for 0 double slash bugs 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think staging token is per branch, should add branch prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eden-ohana - agree. Added under the new scema
@itaiad200 - double slash is the obvious. I added an option for triple slashes... My new suggestion removes the redundant keywords such as key and branch from anywhere except from the key prefix


# Execution Plan
* Agree on keys schema (see here after) - [#3567](https://github.com/treeverse/lakeFS/issues/3567)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm missing is the data modelling for KV store. It can either be part of #3567 or as a new issue. But it is required as part of the common code base that we all work over. In auth we didn't do it well enough and as a result there were many changes during implementation and technical debts left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data modeling added to doc. As graveler pretty much uses the SQL DB as KVStore, there is not much of a difference, excpet for a few things we will need to add

* Supporting `KV` along side `SQL` (same as was done for previous modules)
* Step 1: Naive implementation with locking - `lakeFS` on KV can start with empty DB and run all graveler operations correctly, although performance may be degraded and exhaustive operations (concurrent commits and uploads, anything else?) should not be considered - [#3568](https://github.com/treeverse/lakeFS/issues/3568)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step alone is quite big. It means reimplementing Graveler in a single step.. I think we should support an hybrid mode, i.e. if it makes sense, move just the tags, then just the staging areas, then commits, finally branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting breaking it into smaller steps within the same deliverable, which has mainley administrative aspects, or creating a deliverable in each sub-step (e.g. a working, mergable, graveler with hybrid support)?
I'm not sure what does the later mean in terms of effort and overhead to get such a hybrid working

If we look at it, it is a big step, work wise, but I believe the auth implementation was bigger in terms of both size and complexity. Moreover, since testing coverege is excellent, it should converge much quicker IMHO
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Naive implementation with locking means replacing all graveler to work on top of the kv interface and adding a locking mechanism to support that, then I think it's too big. I also think auth was way smaller than that, and was also too big for a single step/PR which is what we did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to doc - implementatingan initial skeleton with no real KV access (delegating all DB related calls to DB gravlere, as we discussed) and deriving followup implementation tasks

* Step 2: Lockless Commits - as described above - [#3569](https://github.com/treeverse/lakeFS/issues/3569)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seems to misunderstand the previous section - I really think lock free commits are not part of the KV scope. I believe it's best to first focus on the transition to KV with the current implementation and only after that working on the lock free feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely live with lock-free out of KV scope

* Step 3: Decide and implement a working solution for **Deleting a Repository** - [#3570](https://github.com/treeverse/lakeFS/issues/3570)
(Steps 2 and 3 are independent)
* Benchmarks for common operations - to be executed manually, on both `SQL` and `KV` to verify performance are satisfactory - [#3571](https://github.com/treeverse/lakeFS/issues/3571)
* Migration of existing `SQL` data to `KV`. Most of the tables are trivial to migrate. I believe `graveler_branches` migration should also be trivial, with no specific consideration for `sealed_tokens` but that needs to be verified - [#3572](https://github.com/treeverse/lakeFS/issues/3572)
* KV Migration Test is, obviously, included
* Add `graveler` migration to `Esti` Migration Tests - [#3573](https://github.com/treeverse/lakeFS/issues/3573)
* Optional - Dump/Restore Refs tests for `SQL` and `KV` - [#3574](https://github.com/treeverse/lakeFS/issues/3574)
* Optional - `pkg/disagnostics` support of KV. Should be trivial - [#3575](https://github.com/treeverse/lakeFS/issues/3575)