[pull] main from polarsignals:main #2

pull · 2023-02-28T17:04:57Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

This commit removes `(*Table).Write` in favor of `(*GenericTable[T]).Write` `GenericTable` offers a lot of improvements from the previous API. One of the biggest advantage is there is no longer need to define schema before hand instead the schema is part of the struct that is written in the form of struct tags. Example code has been refactored to use `map[string]T` for dynamic columns, however slice struct dynamic columns still works, this is intentional to avoid the PR from blowing out of proportion. `GenericTable` embeds `Table` so from user perspective only few changes are needed to work with the new API.

No special tag is need any field of type map[string]T where T is `int64|bool|float64|string` is a dynamic column. Example ``go type Sample struct { Labels map[string]string `frostdb:"labels,rle_dict,asc(1),null_first"` } ```

* pqarrow/arrowutils: Add SortRecord and ReorderRecord This is extract from a previous PR #461. * pqarrow/arrowutils: Update SortRecord to allow for multiple sort columns This isn't implemented yet, just the function signature is future proof. * pqarrow/arrowutils: Use compute.Take for ReorderRecord * pqarrow/arrowutils: Add support for sorting NULL NULL always gets sorted to the back. This seems to be the default for other language implementations. It can be made configurable in the future. * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * pqarrow/arrowutils: Remove sorting *array.Binary This isn't properly unit tested and was more of an experiment. * pqarrow/arrowutils: Add context and reserve indices length --------- Co-authored-by: Geofrey Ernest <[email protected]>

The api has changed, this updates the README to point to new api.

Before, we had `null` tag that set schema column to nullable. However the base types allowed were `int64|float64|bool|string`. This begged the question, what is `null`? for a `nullable` field of type `int64` do we interpret `0` as `null`?, probably not. We were offering option for something that was impossible to represent. Fortunately, we can safely represent null values with out type system. For base types `*int64|*float64|*bool|*string` defines nullable base types. This commit add support for both `int64|float64|bool|string` and `*int64|*float64|*bool|*string` for base types . This works with dynamic columns too. With this change `null` tag is redundant so It was removed. Fields of type `*int64|*float64|*bool|*string` will automatically generate nullable schema column. NOTE: I also discovered a bug where when `T` has dynamic column and you append multiple `T` without any dynamic column followed by T with dynamic column caused a panic. The issue was how we adjusted dynamic columns to match current `.Append` rows state. The fix is included in this commit because I needed this behavior in tests.

…646) * parquet/converter: New Record call ResetFull on dictionary builders

We have already removed `(*Table)Write`. There is no longer any use for this function.

This function is called on a hot path. I couldn't resist the temptation to squeeze some performance gains from it. Benchmark results: ``` goos: darwin goarch: amd64 pkg: github.com/polarsignals/frostdb/pqarrow cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RecordDynamicCols-8 1470.0n ± ∞ ¹ 792.6n ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RecordDynamicCols-8 936.0 ± ∞ ¹ 536.0 ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RecordDynamicCols-8 14.000 ± ∞ ¹ 6.000 ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ```

It is no longer used

Benchmark result: ``` goos: darwin goarch: amd64 pkg: github.com/polarsignals/frostdb/dynparquet cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Deserialization-8 715.9n ± ∞ ¹ 443.1n ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ B/op │ B/op vs base │ Deserialization-8 592.0 ± ∞ ¹ 432.0 ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ Deserialization-8 9.000 ± ∞ ¹ 4.000 ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ```

This function is used in a hot path. Benchmark results ``` goos: darwin goarch: amd64 pkg: github.com/polarsignals/frostdb/dynparquet cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Serialize-8 480.5n ± ∞ ¹ 314.6n ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ B/op │ B/op vs base │ Serialize-8 248.0 ± ∞ ¹ 120.0 ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ Serialize-8 6.000 ± ∞ ¹ 3.000 ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ```

Comments should start with exported symbol name.

Replace `golang.org/x/exp/slices` with standard library `slices` package

`(true && boolean) == boolean` no need for extra work. I also removed a comment that might be out of place, `right` is on the stack, there is no heap allocation.

From #599 It indicated to me any gains here will be good for `parca`. This is my first pass on it, there are still many opportunities to squeeze more performance from it. Benchmark results ``` goos: darwin goarch: amd64 pkg: github.com/polarsignals/frostdb/pqarrow cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RecordsToFile-8 497.1µ ± ∞ ¹ 364.1µ ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RecordsToFile-8 56.57Ki ± ∞ ¹ 32.84Ki ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RecordsToFile-8 3.079k ± ∞ ¹ 2.062k ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ```

Previously we were operating under assumptions that repeated fields were dictionary of `*array.Binary`. This commit implements conversion on all arrow arrays of repeated base types `int64|float64|bool|string` I have also added `*array.Binary` in the mix although it is not categorized as base type (Based on the use, It will make sense to introduce binary type in our schema proto)

Reasons - It is useful in tests. Tying it to dynparquet introduces cyclic dependency - It is not directly related to dynparquet - Discourage direct use and promote `GenericTable` I got tired constructing arrow records in tests, its messy and error prone. This way we can dog food the implementation fostering evolution and more use case coverage.

This commit adds `*array.String` handling inside `copyArrToBuilder` cases for both plain `*array.String` and dictionary of `*array.String` are handled. Fixes #661

…edSetRanges (#665) Closes #663

* Optimize dynparquet.MergeDynamicColumnSets This commit - Adds test (There were no test covering this function) - Introduce a new implementation for merging column sets that can be reused. - Use `*sync.Pool` to reuse the new dynamic column set merger Benchmark result ``` goos: darwin goarch: amd64 pkg: github.com/polarsignals/frostdb/dynparquet cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ MergeDynamicColumnSets-8 2.220µ ± 1% 1.330µ ± 1% -40.09% (p=0.000 n=10) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ MergeDynamicColumnSets-8 1104.0 ± 0% 544.0 ± 0% -50.72% (p=0.000 n=10) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ MergeDynamicColumnSets-8 16.000 ± 0% 6.000 ± 0% -62.50% (p=0.000 n=10) ```

Signed-off-by: Nicolas Takashi <[email protected]>

* Upgrade arrowutils.SortRecord This commit enhances `arrowutils.SortRecord` to support - Multi column sort - Column direction (ascending, descending) - NullFirst (for both ascending and descending) Also optimizes `arrowutils.ReorderRecord` to support dictionary columns. * fix lint: gofumpt * fix lint * define SortingColumn and pass it to arrowutils.SortRecord * use *builder.OptInt32Builder to avoid allocations * add constructor for multiColSorter * use DataType().ID() == arrow.DICTIONARY to check for dictionary * fix lint * fix lint: remove unused param * fix lint: gofumpt -w -extra * fix lint: godoc comment * fix lint: godoc * fix lint: godoc * panic on invalid direction * move *builder.OptInt32Builder inside SortRecord * remove extra space in comment * return error for take* functions * rename (*OptInt32Builder).ReserveToLength to Reserve

This would otherwise leak goroutines if an error was encountered closing the WAL.

Currently, when converting a parquet page to an arrow record, all the writers would repeat the slow path of allocating a parquet.Values slice, read all values, and write them to their underlying builder. However, this code already existed one level above and is more efficient since it reuses a parquet.Values slice. This commit removes the repetition from the writers and leaves only the concrete implementation of writing existing values to an arrow builder. Callers can also check if the ValueWriter implements the PageWriter interface, which can also offer a fast path for writing a parquet page directly. The improvement is especially noticeable in Query/Values since the slow path would previously fall back to write all the page values, rather than just the dictionary values. ``` │ benchmain │ benchpw │ │ sec/op │ sec/op vs base │ Query/Types-12 109.9m ± 1% 109.7m ± 2% ~ (p=0.353 n=10) Query/Labels-12 219.5µ ± 1% 214.1µ ± 2% -2.46% (p=0.011 n=10) Query/Values-12 7716.3µ ± 3% 207.7µ ± 4% -97.31% (p=0.000 n=10) Query/Merge-12 223.1m ± 2% 220.6m ± 1% -1.08% (p=0.035 n=10) Query/Range-12 117.5m ± 1% 116.1m ± 2% ~ (p=0.218 n=10) Query/Filter-12 9.888m ± 3% 10.025m ± 4% ~ (p=0.684 n=10) geomean 19.08m 10.38m -45.58% │ benchmain │ benchpw │ │ B/op │ B/op vs base │ Query/Types-12 254.3Mi ± 1% 252.1Mi ± 3% ~ (p=0.353 n=10) Query/Labels-12 400.6Ki ± 0% 400.7Ki ± 0% ~ (p=0.796 n=10) Query/Values-12 12644.7Ki ± 0% 853.5Ki ± 0% -93.25% (p=0.000 n=10) Query/Merge-12 574.7Mi ± 1% 576.6Mi ± 1% ~ (p=0.247 n=10) Query/Range-12 212.0Mi ± 0% 212.0Mi ± 0% ~ (p=0.190 n=10) Query/Filter-12 13.52Mi ± 0% 13.52Mi ± 0% ~ (p=0.739 n=10) geomean 35.56Mi 22.67Mi -36.25% │ benchmain │ benchpw │ │ allocs/op │ allocs/op vs base │ Query/Types-12 64.32k ± 6% 64.30k ± 4% ~ (p=0.631 n=10) Query/Labels-12 1.802k ± 0% 1.802k ± 0% ~ (p=0.840 n=10) Query/Values-12 3.677k ± 0% 2.192k ± 0% -40.37% (p=0.000 n=10) Query/Merge-12 1.435M ± 0% 1.435M ± 0% ~ (p=0.424 n=10) Query/Range-12 174.3k ± 0% 174.2k ± 0% -0.00% (p=0.044 n=10) Query/Filter-12 4.255k ± 0% 4.255k ± 0% ~ (p=0.487 n=10) geomean 27.72k 25.43k -8.25% │ benchmain │ benchpw │ │ B/msec │ B/msec vs base │ Query/Filter-12 3.238Mi ± 3% 3.194Mi ± 4% ~ (p=0.724 n=10) ```

This adds some more context to what exactly is failing to open. Especially helpful for knowing which blocks my be broken.

The total byte size recorded in the row group represents the uncompressed size, so to calculate the percentage correctly we also need to use the uncompressed pages of the column.

cmd/parquet-tool: Fix percentage calculation

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

….0 (#890) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Matthias Loibl <[email protected]>

…912) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

…y] (#923) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Renaming unused parameters to `_` * Update golangci-lint action * Renaming unused parameters to `_`

@thorfour

…xedSizeBinary (#936) * pqarrow/arrowutils: Add FixedSizeBinaryDictionary support * pqarrow/arrowutils: Add support for Struct and RunEndEncoded * pqarrow/arrowutils: Handle empty structs correctly * pqarrow/arrowutils: Retain record if unmodified * query/physicalplan: Sampler requires LessOrEqual 1024 bytes allocations * remove trailing newline * pqarrow/arrowutils: Limit by using the arrowutils Take helper This will also allow us to limit all the newly supported column types, timestamp, struct and runendencoded * query/physicalplan: Guard against s.size==0 panic Please take a look at this fix, @thorfour. I'm not sure why this started panicking now. * pqarrow/arrowutils: Release List ValueBuilder

This fixes compilation issues using go1.23+.

….1 (#931) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

…20.5 (#928) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

…952) I'm having a hard time reproducing this panic. The if statement is exactly copied from the underlying code, but we are able to give more context here.

* *: Upgrade arrow to v18 * Upgrade custom go version for DST * .github: Upgrade golangci-lint * Fix linter issues * Use latest golangci-lint

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

pull bot added the ⤵️ pull label Feb 28, 2023

gernest and others added 29 commits December 14, 2023 09:34

Define dynamic column as map[string]T (#632)

b41edc6

No special tag is need any field of type map[string]T where T is `int64|bool|float64|string` is a dynamic column. Example ``go type Sample struct { Labels map[string]string `frostdb:"labels,rle_dict,asc(1),null_first"` } ```

Update frostprometheus schema on README (#637)

81927ae

Update link to example on README (#636)

b4575ce

The api has changed, this updates the README to point to new api.

Remove unused function (#641)

caf08d8

parquet-tool: check in cmd col (#643)

757649b

parquet/converter: New Record call ResetFull on dictionary builders (#…

e3813ef

…646) * parquet/converter: New Record call ResetFull on dictionary builders

dynparquet/samples: Make ToRecord consistent (#648)

71a019d

parts list no longer used (#649)

d14894d

Remove dynparquet.ValuesToBuffer (#645)

02c6f61

We have already removed `(*Table)Write`. There is no longer any use for this function.

Remove bufutils.Dedupe (#650)

439cf91

It is no longer used

fix godoc comment on exported symbols (#655)

e280490

Comments should start with exported symbol name.

Use slices package (#654)

7a79911

Replace `golang.org/x/exp/slices` with standard library `slices` package

Simplify (*AndExpr).Eval (#653)

aea51dd

`(true && boolean) == boolean` no need for extra work. I also removed a comment that might be out of place, `right` is on the stack, there is no heap allocation.

fix handling of *array.String in ParquetConverter (#662)

53f4ea3

This commit adds `*array.String` handling inside `copyArrToBuilder` cases for both plain `*array.String` and dictionary of `*array.String` are handled. Fixes #661

Add *array.String dictionary handling in arrowutils.GetGroupsAndOrder…

a00a62d

…edSetRanges (#665) Closes #663

Add support for int32 columns (#647)

8b67228

Signed-off-by: Nicolas Takashi <[email protected]>

Added stale bot (#668)

953e965

*: Add mathematical projections (#669)

aab9fe8

asubiotto and others added 30 commits June 26, 2024 19:23

*: clean up txPool even on error (#914)

56013c3

This would otherwise leak goroutines if an error was encountered closing the WAL.

Add file path to reading error (#918)

defafee

This adds some more context to what exactly is failing to open. Especially helpful for knowing which blocks my be broken.

cmd/parquet-tool: Fix percentage calculation

26a10df

The total byte size recorded in the row group represents the uncompressed size, so to calculate the percentage correctly we also need to use the uncompressed pages of the column.

Merge pull request #920 from polarsignals/fix-percentage

bd8afbd

cmd/parquet-tool: Fix percentage calculation

Add stats tool for total file stats (#921)

808dfa8

stats: Add total row (#922)

695fa06

chore(deps): update module github.com/roaringbitmap/roaring to v2 (#902)

043c480

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update module github.com/charmbracelet/lipgloss to v0.13…

2d8d961

….0 (#890) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Matthias Loibl <[email protected]>

chore(deps): update github.com/thanos-io/objstore digest to 0363dad (#…

5b849e6

…912) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update module google.golang.org/grpc to v1.64.1 [securit…

3df2168

…y] (#923) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update protobuf-go to v1.34.2 (#852)

ecd6b80

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

pqarrow/arrowutils: Add support for sorting Timestamps (#934)

3139e12

Renaming unused parameters to _ (#937)

8ee3beb

* Renaming unused parameters to `_` * Update golangci-lint action * Renaming unused parameters to `_`

deps: update parquet to v0.23.0 (#939)

d21be79

This fixes compilation issues using go1.23+.

chore(deps): update module github.com/charmbracelet/lipgloss to v0.13…

04049e7

….1 (#931) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Update to apache/arrow v17 (#941)

9f202fa

chore(deps): update module github.com/prometheus/client_golang to v1.…

5e1d101

…20.5 (#928) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update actions/checkout action to v4.2.2 (#924)

c71ff55

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update module github.com/stretchr/testify to v1.10.0 (#942)

25c6c56

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update module go.opentelemetry.io/otel to v1.33.0 (#944)

e761df0

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update github.com/dgryski/go-metro digest to edb8663 (#943)

e9ab6f3

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

chore(deps): update golang.org/x/exp digest to e0ece0d (#893)

3bf8ce9

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

pqarrow/arrowutils: Support sorting structs within lists (#949)

a422bde

pqarrow/arrowutils: Support merging nested list and structs (#950)

9c99724

pqarrow/builder: Return error with more context instead of panicking (#…

07a893f

…952) I'm having a hard time reproducing this panic. The if statement is exactly copied from the underlying code, but we are able to give more context here.

*: Upgrade arrow to v18 (#955)

764bcf4

* *: Upgrade arrow to v18 * Upgrade custom go version for DST * .github: Upgrade golangci-lint * Fix linter issues * Use latest golangci-lint

Add AppendEmptyValue to opt builders (#969)

11b0ba9

chore(deps): update dependency go to v1.24.5 (#953)

906ebba

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] main from polarsignals:main #2

[pull] main from polarsignals:main #2

Uh oh!

pull bot commented Feb 28, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[pull] main from polarsignals:main #2

Are you sure you want to change the base?

[pull] main from polarsignals:main #2

Uh oh!

Conversation

pull bot commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pull bot commented Feb 28, 2023 •

edited

Loading