forked from polarsignals/frostdb
-
Notifications
You must be signed in to change notification settings - Fork 0
[pull] main from polarsignals:main #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pull
wants to merge
463
commits into
rizalgowandy:main
Choose a base branch
from
polarsignals:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit removes `(*Table).Write` in favor of `(*GenericTable[T]).Write` `GenericTable` offers a lot of improvements from the previous API. One of the biggest advantage is there is no longer need to define schema before hand instead the schema is part of the struct that is written in the form of struct tags. Example code has been refactored to use `map[string]T` for dynamic columns, however slice struct dynamic columns still works, this is intentional to avoid the PR from blowing out of proportion. `GenericTable` embeds `Table` so from user perspective only few changes are needed to work with the new API.
No special tag is need any field of type map[string]T where T is
`int64|bool|float64|string` is a dynamic column.
Example
``go
type Sample struct {
Labels map[string]string `frostdb:"labels,rle_dict,asc(1),null_first"`
}
```
* pqarrow/arrowutils: Add SortRecord and ReorderRecord This is extract from a previous PR #461. * pqarrow/arrowutils: Update SortRecord to allow for multiple sort columns This isn't implemented yet, just the function signature is future proof. * pqarrow/arrowutils: Use compute.Take for ReorderRecord * pqarrow/arrowutils: Add support for sorting NULL NULL always gets sorted to the back. This seems to be the default for other language implementations. It can be made configurable in the future. * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * Update pqarrow/arrowutils/sort.go Co-authored-by: Geofrey Ernest <[email protected]> * pqarrow/arrowutils: Remove sorting *array.Binary This isn't properly unit tested and was more of an experiment. * pqarrow/arrowutils: Add context and reserve indices length --------- Co-authored-by: Geofrey Ernest <[email protected]>
The api has changed, this updates the README to point to new api.
Before, we had `null` tag that set schema column to nullable. However the base types allowed were `int64|float64|bool|string`. This begged the question, what is `null`? for a `nullable` field of type `int64` do we interpret `0` as `null`?, probably not. We were offering option for something that was impossible to represent. Fortunately, we can safely represent null values with out type system. For base types `*int64|*float64|*bool|*string` defines nullable base types. This commit add support for both `int64|float64|bool|string` and `*int64|*float64|*bool|*string` for base types . This works with dynamic columns too. With this change `null` tag is redundant so It was removed. Fields of type `*int64|*float64|*bool|*string` will automatically generate nullable schema column. NOTE: I also discovered a bug where when `T` has dynamic column and you append multiple `T` without any dynamic column followed by T with dynamic column caused a panic. The issue was how we adjusted dynamic columns to match current `.Append` rows state. The fix is included in this commit because I needed this behavior in tests.
…646) * parquet/converter: New Record call ResetFull on dictionary builders
We have already removed `(*Table)Write`. There is no longer any use for this function.
This function is called on a hot path. I couldn't resist the temptation to
squeeze some performance gains from it.
Benchmark results:
```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/pqarrow
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
RecordDynamicCols-8 1470.0n ± ∞ ¹ 792.6n ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
RecordDynamicCols-8 936.0 ± ∞ ¹ 536.0 ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
RecordDynamicCols-8 14.000 ± ∞ ¹ 6.000 ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```
It is no longer used
Benchmark result:
```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/dynparquet
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Deserialization-8 715.9n ± ∞ ¹ 443.1n ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
Deserialization-8 592.0 ± ∞ ¹ 432.0 ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
Deserialization-8 9.000 ± ∞ ¹ 4.000 ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```
This function is used in a hot path.
Benchmark results
```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/dynparquet
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Serialize-8 480.5n ± ∞ ¹ 314.6n ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
Serialize-8 248.0 ± ∞ ¹ 120.0 ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
Serialize-8 6.000 ± ∞ ¹ 3.000 ± ∞ ¹ ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```
Comments should start with exported symbol name.
Replace `golang.org/x/exp/slices` with standard library `slices` package
`(true && boolean) == boolean` no need for extra work. I also removed a comment that might be out of place, `right` is on the stack, there is no heap allocation.
From #599 It indicated to me any gains here will be good for `parca`. This is my first pass on it, there are still many opportunities to squeeze more performance from it. Benchmark results ``` goos: darwin goarch: amd64 pkg: github.com/polarsignals/frostdb/pqarrow cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ RecordsToFile-8 497.1µ ± ∞ ¹ 364.1µ ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ B/op │ B/op vs base │ RecordsToFile-8 56.57Ki ± ∞ ¹ 32.84Ki ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ RecordsToFile-8 3.079k ± ∞ ¹ 2.062k ± ∞ ¹ ~ (p=1.000 n=1) ² ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ```
Previously we were operating under assumptions that repeated fields were dictionary of `*array.Binary`. This commit implements conversion on all arrow arrays of repeated base types `int64|float64|bool|string` I have also added `*array.Binary` in the mix although it is not categorized as base type (Based on the use, It will make sense to introduce binary type in our schema proto)
Reasons - It is useful in tests. Tying it to dynparquet introduces cyclic dependency - It is not directly related to dynparquet - Discourage direct use and promote `GenericTable` I got tired constructing arrow records in tests, its messy and error prone. This way we can dog food the implementation fostering evolution and more use case coverage.
This commit adds `*array.String` handling inside `copyArrToBuilder` cases for both plain `*array.String` and dictionary of `*array.String` are handled. Fixes #661
* Optimize dynparquet.MergeDynamicColumnSets
This commit
- Adds test (There were no test covering this function)
- Introduce a new implementation for merging column sets that can be reused.
- Use `*sync.Pool` to reuse the new dynamic column set merger
Benchmark result
```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/dynparquet
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
MergeDynamicColumnSets-8 2.220µ ± 1% 1.330µ ± 1% -40.09% (p=0.000 n=10)
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
MergeDynamicColumnSets-8 1104.0 ± 0% 544.0 ± 0% -50.72% (p=0.000 n=10)
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
MergeDynamicColumnSets-8 16.000 ± 0% 6.000 ± 0% -62.50% (p=0.000 n=10)
```
Signed-off-by: Nicolas Takashi <[email protected]>
* Upgrade arrowutils.SortRecord This commit enhances `arrowutils.SortRecord` to support - Multi column sort - Column direction (ascending, descending) - NullFirst (for both ascending and descending) Also optimizes `arrowutils.ReorderRecord` to support dictionary columns. * fix lint: gofumpt * fix lint * define SortingColumn and pass it to arrowutils.SortRecord * use *builder.OptInt32Builder to avoid allocations * add constructor for multiColSorter * use DataType().ID() == arrow.DICTIONARY to check for dictionary * fix lint * fix lint: remove unused param * fix lint: gofumpt -w -extra * fix lint: godoc comment * fix lint: godoc * fix lint: godoc * panic on invalid direction * move *builder.OptInt32Builder inside SortRecord * remove extra space in comment * return error for take* functions * rename (*OptInt32Builder).ReserveToLength to Reserve
This would otherwise leak goroutines if an error was encountered closing the WAL.
Currently, when converting a parquet page to an arrow record, all the writers
would repeat the slow path of allocating a parquet.Values slice, read all
values, and write them to their underlying builder. However, this code already
existed one level above and is more efficient since it reuses a parquet.Values
slice.
This commit removes the repetition from the writers and leaves only the
concrete implementation of writing existing values to an arrow builder. Callers
can also check if the ValueWriter implements the PageWriter interface, which
can also offer a fast path for writing a parquet page directly.
The improvement is especially noticeable in Query/Values since the slow path
would previously fall back to write all the page values, rather than just the
dictionary values.
```
│ benchmain │ benchpw │
│ sec/op │ sec/op vs base │
Query/Types-12 109.9m ± 1% 109.7m ± 2% ~ (p=0.353 n=10)
Query/Labels-12 219.5µ ± 1% 214.1µ ± 2% -2.46% (p=0.011 n=10)
Query/Values-12 7716.3µ ± 3% 207.7µ ± 4% -97.31% (p=0.000 n=10)
Query/Merge-12 223.1m ± 2% 220.6m ± 1% -1.08% (p=0.035 n=10)
Query/Range-12 117.5m ± 1% 116.1m ± 2% ~ (p=0.218 n=10)
Query/Filter-12 9.888m ± 3% 10.025m ± 4% ~ (p=0.684 n=10)
geomean 19.08m 10.38m -45.58%
│ benchmain │ benchpw │
│ B/op │ B/op vs base │
Query/Types-12 254.3Mi ± 1% 252.1Mi ± 3% ~ (p=0.353 n=10)
Query/Labels-12 400.6Ki ± 0% 400.7Ki ± 0% ~ (p=0.796 n=10)
Query/Values-12 12644.7Ki ± 0% 853.5Ki ± 0% -93.25% (p=0.000 n=10)
Query/Merge-12 574.7Mi ± 1% 576.6Mi ± 1% ~ (p=0.247 n=10)
Query/Range-12 212.0Mi ± 0% 212.0Mi ± 0% ~ (p=0.190 n=10)
Query/Filter-12 13.52Mi ± 0% 13.52Mi ± 0% ~ (p=0.739 n=10)
geomean 35.56Mi 22.67Mi -36.25%
│ benchmain │ benchpw │
│ allocs/op │ allocs/op vs base │
Query/Types-12 64.32k ± 6% 64.30k ± 4% ~ (p=0.631 n=10)
Query/Labels-12 1.802k ± 0% 1.802k ± 0% ~ (p=0.840 n=10)
Query/Values-12 3.677k ± 0% 2.192k ± 0% -40.37% (p=0.000 n=10)
Query/Merge-12 1.435M ± 0% 1.435M ± 0% ~ (p=0.424 n=10)
Query/Range-12 174.3k ± 0% 174.2k ± 0% -0.00% (p=0.044 n=10)
Query/Filter-12 4.255k ± 0% 4.255k ± 0% ~ (p=0.487 n=10)
geomean 27.72k 25.43k -8.25%
│ benchmain │ benchpw │
│ B/msec │ B/msec vs base │
Query/Filter-12 3.238Mi ± 3% 3.194Mi ± 4% ~ (p=0.724 n=10)
```
This adds some more context to what exactly is failing to open. Especially helpful for knowing which blocks my be broken.
The total byte size recorded in the row group represents the uncompressed size, so to calculate the percentage correctly we also need to use the uncompressed pages of the column.
cmd/parquet-tool: Fix percentage calculation
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
….0 (#890) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Matthias Loibl <[email protected]>
…912) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…y] (#923) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…xedSizeBinary (#936) * pqarrow/arrowutils: Add FixedSizeBinaryDictionary support * pqarrow/arrowutils: Add support for Struct and RunEndEncoded * pqarrow/arrowutils: Handle empty structs correctly * pqarrow/arrowutils: Retain record if unmodified * query/physicalplan: Sampler requires LessOrEqual 1024 bytes allocations * remove trailing newline * pqarrow/arrowutils: Limit by using the arrowutils Take helper This will also allow us to limit all the newly supported column types, timestamp, struct and runendencoded * query/physicalplan: Guard against s.size==0 panic Please take a look at this fix, @thorfour. I'm not sure why this started panicking now. * pqarrow/arrowutils: Release List ValueBuilder
This fixes compilation issues using go1.23+.
….1 (#931) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…20.5 (#928) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…952) I'm having a hard time reproducing this panic. The if statement is exactly copied from the underlying code, but we are able to give more context here.
* *: Upgrade arrow to v18 * Upgrade custom go version for DST * .github: Upgrade golangci-lint * Fix linter issues * Use latest golangci-lint
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )