Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Feb 28, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Feb 28, 2023
gernest and others added 29 commits December 14, 2023 09:34
This commit removes `(*Table).Write` in favor of `(*GenericTable[T]).Write`

`GenericTable` offers a lot of improvements from the previous API. One of the
biggest advantage is there is no longer need to define schema before hand
instead the schema is part of the struct that is written in the form of struct
tags.

Example code has been refactored to use `map[string]T` for dynamic columns,
however slice struct dynamic columns still works, this is intentional to avoid
the PR from blowing out of proportion.

`GenericTable` embeds `Table` so from user perspective only few changes are
needed to work with the new API.
No special tag is need any field of type map[string]T where T is
`int64|bool|float64|string` is a dynamic column.

Example

``go
type Sample struct {
	Labels      map[string]string `frostdb:"labels,rle_dict,asc(1),null_first"`
}
```
* pqarrow/arrowutils: Add SortRecord and ReorderRecord

This is extract from a previous PR #461.

* pqarrow/arrowutils: Update SortRecord to allow for multiple sort columns

This isn't implemented yet, just the function signature is future proof.

* pqarrow/arrowutils: Use compute.Take for ReorderRecord

* pqarrow/arrowutils: Add support for sorting NULL

NULL always gets sorted to the back. This seems to be the default for other language implementations. It can be made configurable in the future.

* Update pqarrow/arrowutils/sort.go

Co-authored-by: Geofrey Ernest <[email protected]>

* Update pqarrow/arrowutils/sort.go

Co-authored-by: Geofrey Ernest <[email protected]>

* Update pqarrow/arrowutils/sort.go

Co-authored-by: Geofrey Ernest <[email protected]>

* Update pqarrow/arrowutils/sort.go

Co-authored-by: Geofrey Ernest <[email protected]>

* Update pqarrow/arrowutils/sort.go

Co-authored-by: Geofrey Ernest <[email protected]>

* pqarrow/arrowutils: Remove sorting *array.Binary

This isn't properly unit tested and was more of an experiment.

* pqarrow/arrowutils: Add context and reserve indices length

---------

Co-authored-by: Geofrey Ernest <[email protected]>
The api has changed, this updates the README to point to new api.
Before, we had `null` tag that set schema column to nullable. However the base
types allowed were `int64|float64|bool|string`.

This begged the question, what is `null`? for a `nullable` field of type
`int64` do we interpret `0` as `null`?, probably not. We were offering option
for something that was impossible to represent.

Fortunately, we can safely represent null values with out type system. For
base types   `*int64|*float64|*bool|*string` defines nullable base types.

This commit add support for both `int64|float64|bool|string` and
`*int64|*float64|*bool|*string` for base types . This works with dynamic
columns too.

With this change `null` tag is redundant so It was removed. Fields of type
`*int64|*float64|*bool|*string` will automatically generate nullable schema
 column.

NOTE: I also discovered a bug where when `T` has dynamic column and you append
multiple `T` without any dynamic column followed by T with dynamic column
caused a panic. The issue was how we adjusted dynamic columns to match current
`.Append` rows state. The fix is included in this commit because I needed this
behavior in tests.
…646)

* parquet/converter: New Record call ResetFull on dictionary builders
We have already removed `(*Table)Write`. There is no longer any use for this
function.
This function is called on a hot path. I couldn't resist the temptation to
squeeze some performance gains from it.

Benchmark results:

```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/pqarrow
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                    │    old.txt    │             new.txt             │
                    │    sec/op     │    sec/op     vs base           │
RecordDynamicCols-8   1470.0n ± ∞ ¹   792.6n ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                    │   old.txt   │            new.txt             │
                    │    B/op     │    B/op      vs base           │
RecordDynamicCols-8   936.0 ± ∞ ¹   536.0 ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                    │   old.txt    │            new.txt             │
                    │  allocs/op   │  allocs/op   vs base           │
RecordDynamicCols-8   14.000 ± ∞ ¹   6.000 ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```
It is no longer used
Benchmark result:

```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/dynparquet
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                  │   old.txt    │             new.txt             │
                  │    sec/op    │    sec/op     vs base           │
Deserialization-8   715.9n ± ∞ ¹   443.1n ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                  │   old.txt   │            new.txt             │
                  │    B/op     │    B/op      vs base           │
Deserialization-8   592.0 ± ∞ ¹   432.0 ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                  │   old.txt   │            new.txt             │
                  │  allocs/op  │  allocs/op   vs base           │
Deserialization-8   9.000 ± ∞ ¹   4.000 ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```
This function is used in a hot path.

Benchmark results

```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/dynparquet
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
            │   old.txt    │             new.txt             │
            │    sec/op    │    sec/op     vs base           │
Serialize-8   480.5n ± ∞ ¹   314.6n ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

            │   old.txt   │            new.txt             │
            │    B/op     │    B/op      vs base           │
Serialize-8   248.0 ± ∞ ¹   120.0 ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

            │   old.txt   │            new.txt             │
            │  allocs/op  │  allocs/op   vs base           │
Serialize-8   6.000 ± ∞ ¹   3.000 ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```
Comments should start with exported symbol name.
Replace `golang.org/x/exp/slices` with standard library `slices` package
`(true && boolean) == boolean` no need for extra work. I also removed a
comment that might be out of place, `right` is on the stack, there is no heap
allocation.
From #599 It indicated to me any gains here will be good for `parca`. This is
my first pass on it, there are still many opportunities to squeeze more
performance from it.

Benchmark results

```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/pqarrow
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                │   old.txt    │             new.txt             │
                │    sec/op    │    sec/op     vs base           │
RecordsToFile-8   497.1µ ± ∞ ¹   364.1µ ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                │    old.txt    │             new.txt              │
                │     B/op      │     B/op       vs base           │
RecordsToFile-8   56.57Ki ± ∞ ¹   32.84Ki ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                │   old.txt    │             new.txt             │
                │  allocs/op   │  allocs/op    vs base           │
RecordsToFile-8   3.079k ± ∞ ¹   2.062k ± ∞ ¹  ~ (p=1.000 n=1) ²
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
```
Previously we were operating under assumptions that repeated fields were
dictionary of `*array.Binary`.

This commit implements conversion on all arrow arrays of repeated base types
`int64|float64|bool|string`

I have also added `*array.Binary` in the mix although it is not categorized as
base type (Based on the use, It will make sense to introduce binary type in our
schema proto)
Reasons

- It is useful in tests. Tying it to dynparquet introduces cyclic dependency
- It is not directly related to dynparquet
- Discourage direct use and promote `GenericTable`

I got tired constructing arrow records in tests, its messy and error prone.
This way we can dog food the implementation fostering evolution and more use
case coverage.
This commit adds `*array.String` handling inside `copyArrToBuilder` cases for
both plain `*array.String` and dictionary of `*array.String` are handled.

Fixes #661
* Optimize dynparquet.MergeDynamicColumnSets

This commit

- Adds test (There were no test covering this function)
- Introduce a new implementation for merging column sets that can be reused.
- Use `*sync.Pool` to reuse the new dynamic column set merger

Benchmark result

```
goos: darwin
goarch: amd64
pkg: github.com/polarsignals/frostdb/dynparquet
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                         │   old.txt   │               new.txt               │
                         │   sec/op    │   sec/op     vs base                │
MergeDynamicColumnSets-8   2.220µ ± 1%   1.330µ ± 1%  -40.09% (p=0.000 n=10)

                         │   old.txt   │              new.txt               │
                         │    B/op     │    B/op     vs base                │
MergeDynamicColumnSets-8   1104.0 ± 0%   544.0 ± 0%  -50.72% (p=0.000 n=10)

                         │   old.txt   │              new.txt               │
                         │  allocs/op  │ allocs/op   vs base                │
MergeDynamicColumnSets-8   16.000 ± 0%   6.000 ± 0%  -62.50% (p=0.000 n=10)
```
* Upgrade  arrowutils.SortRecord

This commit enhances `arrowutils.SortRecord` to support

- Multi column sort
- Column direction (ascending, descending)
- NullFirst (for both ascending and descending)

Also optimizes `arrowutils.ReorderRecord` to support dictionary columns.

* fix lint: gofumpt

* fix lint

* define SortingColumn and pass it to arrowutils.SortRecord

* use *builder.OptInt32Builder to avoid allocations

* add constructor for multiColSorter

* use DataType().ID() == arrow.DICTIONARY to check for dictionary

* fix lint

* fix lint: remove unused param

* fix lint: gofumpt -w -extra

* fix lint: godoc comment

* fix lint: godoc

* fix lint: godoc

* panic on invalid direction

* move *builder.OptInt32Builder inside SortRecord

* remove  extra space in comment

* return error for take* functions

* rename (*OptInt32Builder).ReserveToLength to Reserve
asubiotto and others added 30 commits June 26, 2024 19:23
This would otherwise leak goroutines if an error was encountered closing the
WAL.
Currently, when converting a parquet page to an arrow record, all the writers
would repeat the slow path of allocating a parquet.Values slice, read all
values, and write them to their underlying builder. However, this code already
existed one level above and is more efficient since it reuses a parquet.Values
slice.

This commit removes the repetition from the writers and leaves only the
concrete implementation of writing existing values to an arrow builder. Callers
can also check if the ValueWriter implements the PageWriter interface, which
can also offer a fast path for writing a parquet page directly.

The improvement is especially noticeable in Query/Values since the slow path
would previously fall back to write all the page values, rather than just the
dictionary values.

```
                │  benchmain   │               benchpw                │
                │    sec/op    │    sec/op     vs base                │
Query/Types-12     109.9m ± 1%    109.7m ± 2%        ~ (p=0.353 n=10)
Query/Labels-12    219.5µ ± 1%    214.1µ ± 2%   -2.46% (p=0.011 n=10)
Query/Values-12   7716.3µ ± 3%    207.7µ ± 4%  -97.31% (p=0.000 n=10)
Query/Merge-12     223.1m ± 2%    220.6m ± 1%   -1.08% (p=0.035 n=10)
Query/Range-12     117.5m ± 1%    116.1m ± 2%        ~ (p=0.218 n=10)
Query/Filter-12    9.888m ± 3%   10.025m ± 4%        ~ (p=0.684 n=10)
geomean            19.08m         10.38m       -45.58%

                │   benchmain    │               benchpw                │
                │      B/op      │     B/op      vs base                │
Query/Types-12      254.3Mi ± 1%   252.1Mi ± 3%        ~ (p=0.353 n=10)
Query/Labels-12     400.6Ki ± 0%   400.7Ki ± 0%        ~ (p=0.796 n=10)
Query/Values-12   12644.7Ki ± 0%   853.5Ki ± 0%  -93.25% (p=0.000 n=10)
Query/Merge-12      574.7Mi ± 1%   576.6Mi ± 1%        ~ (p=0.247 n=10)
Query/Range-12      212.0Mi ± 0%   212.0Mi ± 0%        ~ (p=0.190 n=10)
Query/Filter-12     13.52Mi ± 0%   13.52Mi ± 0%        ~ (p=0.739 n=10)
geomean             35.56Mi        22.67Mi       -36.25%

                │  benchmain  │               benchpw               │
                │  allocs/op  │  allocs/op   vs base                │
Query/Types-12    64.32k ± 6%   64.30k ± 4%        ~ (p=0.631 n=10)
Query/Labels-12   1.802k ± 0%   1.802k ± 0%        ~ (p=0.840 n=10)
Query/Values-12   3.677k ± 0%   2.192k ± 0%  -40.37% (p=0.000 n=10)
Query/Merge-12    1.435M ± 0%   1.435M ± 0%        ~ (p=0.424 n=10)
Query/Range-12    174.3k ± 0%   174.2k ± 0%   -0.00% (p=0.044 n=10)
Query/Filter-12   4.255k ± 0%   4.255k ± 0%        ~ (p=0.487 n=10)
geomean           27.72k        25.43k        -8.25%

                │  benchmain   │            benchpw             │
                │    B/msec    │    B/msec     vs base          │
Query/Filter-12   3.238Mi ± 3%   3.194Mi ± 4%  ~ (p=0.724 n=10)
```
This adds some more context to what exactly is failing to open. Especially helpful for knowing which blocks my be broken.
The total byte size recorded in the row group represents the
uncompressed size, so to calculate the percentage correctly we also need
to use the uncompressed pages of the column.
cmd/parquet-tool: Fix percentage calculation
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
….0 (#890)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Matthias Loibl <[email protected]>
…912)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…y] (#923)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
* Renaming unused parameters to `_`

* Update golangci-lint action

* Renaming unused parameters to `_`
…xedSizeBinary (#936)

* pqarrow/arrowutils: Add FixedSizeBinaryDictionary support

* pqarrow/arrowutils: Add support for Struct and RunEndEncoded

* pqarrow/arrowutils: Handle empty structs correctly

* pqarrow/arrowutils: Retain record if unmodified

* query/physicalplan: Sampler requires LessOrEqual 1024 bytes allocations

* remove trailing newline

* pqarrow/arrowutils: Limit by using the arrowutils Take helper

This will also allow us to limit all the newly supported column types, timestamp, struct and runendencoded

* query/physicalplan: Guard against s.size==0 panic

Please take a look at this fix, @thorfour. I'm not sure why this started panicking now.

* pqarrow/arrowutils: Release List ValueBuilder
This fixes compilation issues using go1.23+.
….1 (#931)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…20.5 (#928)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…952)

I'm having a hard time reproducing this panic. The if statement is exactly copied from the underlying code, but we are able to give more context here.
* *: Upgrade arrow to v18

* Upgrade custom go version for DST

* .github: Upgrade golangci-lint

* Fix linter issues

* Use latest golangci-lint
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants