Design "pack files" for Graveler committed data

## Why

This is an issue to design a particular performance improvement that applies to many cross-commit operations.  It should be able to help speed up all of these on large repositories that perform many lakeFS operations:

* Computing expired addresses in GC
* Path-specific `lakectl log` and "blame"
* Finding merge bases (requires only a smaller part)

I have seen the first 2 as user and customer issues, and I have seen the third in testing.

## Packs

[![European Gray Wolf Pack, linked to image page and license](https://upload.wikimedia.org/wikipedia/commons/thumb/d/da/Wolf_Pack.jpg/1024px-Wolf_Pack.jpg)](https://commons.wikimedia.org/wiki/File:Wolf_Pack.jpg)

Periodically create "pack files" for lakeFS immutable data.  Packs are smaller and held in fewer objects than the corresponding commits.  We will need two kinds of pack files:

* Graph packs: A Parquet file that holds part of the commits DAG.  Having the commits DAG allows rapid graph traversal.

* Path packs: A Parquet file (so actually many objects on the backing store) that holds a large number of Graveler metaranges contents.  Keys are TBD <lakeFS key,physical path> _or_ <lakeFS key,some commit where this physical path appeared>, so each physical path for a path can appear.  Values include all object metadata as well as identification of all commits where that key holds that physical path.  Sketches of some ways to do this below.

## The power of packs

Packs are strictly optional.  Code can never assume it has a pack for every commit that it needs to traverse.  Packs can speed things up!

### GC

Finding expired objects in GC is data intensive: every physical  path in every range in every metarange in every commit must be annotated with whether or not it is expired, and then physical addresses must be reduced.  So all ranges are scanned!

A path pack allows much faster computation: fewer objects need to be fetched from the backing store, and we can use algorithms on the commits graph to speed up listing the objects in the path pack which appear only in expired commits.

### Blame

Path packs provide obvious opportunities to speed up logging by path.

### Find merge base

When traversing the commits graph, as soon as a a graph pack is hit it can be traversed more rapidly than DynamoDB: just read the "parents" column from the pack.

## How to hold commits

This is an interesting part of discovery: what's a good way to hold the commits at which a given <lakeFS path, physical path> appeared?

A good first phase might be to hold a _bitmap_ of whether each keypair appears in each commit.

But we can probably do much better as a second phase!  Hold a for each keypair the commits at which it _appeared_ and the commits at which it _vanished_.  Typically a keypair appears on a _single_ commit (and new changes in lakeFS uncommitted data will guarantee this in future), and vanishes on _few_ commits.  So this list can be short.  Now we can perform algorithms on ranges in graphs to determine when the object on a path changed :-)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design "pack files" for Graveler committed data #5613

Why

Packs

The power of packs

GC

Blame

Find merge base

How to hold commits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design "pack files" for Graveler committed data #5613

Description

Why

Packs

The power of packs

GC

Blame

Find merge base

How to hold commits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions