PARQUET-306: Add row group alignment #211

rdblue · 2015-06-09T19:05:15Z

This adds AlignmentStrategy to the ParquetFileWriter that can alter the position of row groups and recommend a target size for the next row group. There are two strategies: NoAlignment and PaddingAlignment. Padding alignment is used for HDFS and no alignment is used for all other file systems. When HDFS-3689 is available, we can add a strategy to use that.

The amount of padding is controlled by a threshold between 0 and 1 that controls the fraction of the row group size that can be padded. This is interpreted as the maximum amount of padding that is acceptable, in terms of the row group size. For example, setting this to 5% will write padding when the bytes left in a HDFS block are less than 5% of the row group size. This defaults to 0%, which prevents padding from being added and matches the current behavior. The threshold is controlled by a new OutputFormat configuration property, parquet.writer.padding-thresh.

rdblue · 2015-06-16T22:23:48Z

@isnotinvain could you look at this before the ParquetWriter builder? If we can get this in, then I'll have to update the builder class to set the padding threshold.

isnotinvain · 2015-06-16T23:56:11Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java

should this just throw if !hasNonNullValue?
We already provide hasNonNullValue, shouldn't the caller inspect that before calling getMaxBytes?

This gets called from within Jackson when converting to JSON, so we don't have much of a choice.

isnotinvain · 2015-06-17T00:11:11Z

How do we make sure that row groups are still aligning with HDFS blocks? Everything gets written to an output stream, so it seems like it would be really easy to accidentally shift everything slightly and ruin the alignment -- eg we write a magic header at the beginning of the file right? does that push everything slightly out of alignment? It only has to be slightly off for any wins to be lost right?

I'm just wondering if we're relying on being very careful about that, or if the way this works is not sensitive to that (maybe it's not because it asks the output stream for it's position, which is all that matters maybe?)

rdblue · 2015-06-17T00:54:24Z

How do we make sure that row groups are still aligning with HDFS blocks?

This is based off of the position that the FS OutputStream reports, which is always going to be correct. If some other thread somehow had a reference to it and wrote data, then this would see that the position had changed.

I'm also going to update this so that we don't use a percentage of the row group size to configure it. I'm going to add a max padding setting in bytes instead.

rdblue · 2015-06-17T01:12:51Z

Ok, changed padding thresh to maxPaddingSize. I think that's much better.

rdblue · 2015-06-17T23:49:16Z

I've fixed the tests and added webhdfs and viewfs to the supported block schemes. Should be good to go?

isnotinvain · 2015-06-18T00:33:22Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java

nit, but how about setMaxRowGroupPaddingBytes

isnotinvain · 2015-06-18T00:39:18Z

What do you think about adding one more test, that writes a parquet file (with padding) and then reads it back (there are already lots of tests that do this, but w/o padding). I remember seeing discussion around a different issue where parquet stores an offset for the dictionary header, but never actually reads it, just assuming that the dictionary is the first page, and so we never noticed that the offset was wrong, etc. I guess I'm saying it'd be good to sanity check that parquet still reads files with padding in them w/o issue.

rdblue · 2015-06-18T01:27:35Z

@isnotinvain, good idea. I originally set the default padding to non-zero so all the tests tested that, which is why you don't see one. I still think we should change the default to include a reasonable amount of padding, but probably not in this PR.

rdblue · 2015-06-18T20:07:10Z

I added the test, but I didn't change the method names.

This adds two strategies: NoAlignment, and PaddingAlignment. NoAlignment matches the current behavior and is used by default for all file systems other than HDFS because there is no need to align with blocks. PaddingAlignment will add zero-padding if less than half of the target row group size remains in the current HDFS block. If there is more than half, it will return the number of remaining bytes for the target size of the next row group, or the row group size if it is smaller.

This uses the getNextRowGroupSize in InternalParquetRecordWriter to set the target size of the next row group when a row group is flushed. The actual target size is either this value (the remaining bytes in the block) or the row group size set by the memory manager, whichever is smaller.

This is a size, in bytes, that is the maximum amount of padding that will be used to align row groups with HDFS blocks. This is also the minimum target size for a row group.

rdblue · 2015-06-22T21:56:29Z

Rebased on top of #221 and tests are passing.

isnotinvain · 2015-06-22T22:15:47Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestInputOutputFormatWithPadding.java

probably want to getBytes("UTF-8") right?

isnotinvain · 2015-06-22T22:16:43Z

+1, just one comment about character encodings in the tests

rdblue force-pushed the PARQUET-306-row-group-alignment branch from a65cb43 to 0e1c240 Compare June 16, 2015 23:05

isnotinvain reviewed Jun 16, 2015
View reviewed changes

rdblue force-pushed the PARQUET-306-row-group-alignment branch 2 times, most recently from 7c092e2 to 46f5953 Compare June 17, 2015 01:12

rdblue force-pushed the PARQUET-306-row-group-alignment branch from 46f5953 to 027c108 Compare June 17, 2015 23:47

isnotinvain reviewed Jun 18, 2015
View reviewed changes

rdblue force-pushed the PARQUET-306-row-group-alignment branch 2 times, most recently from 04144ad to 329be30 Compare June 22, 2015 18:25

rdblue mentioned this pull request Jun 22, 2015

PARQUET-311: Fix NPE when debug logging metadata #221

Closed

rdblue added 3 commits June 22, 2015 14:30

PARQUET-306: Add parquet.writer.max-padding setting.

6ce3f08

This is a size, in bytes, that is the maximum amount of padding that will be used to align row groups with HDFS blocks. This is also the minimum target size for a row group.

rdblue force-pushed the PARQUET-306-row-group-alignment branch from 329be30 to 9b4e22a Compare June 22, 2015 21:30

isnotinvain reviewed Jun 22, 2015
View reviewed changes

PARQUET-306: Add MR test with padding.

0137ddf

rdblue force-pushed the PARQUET-306-row-group-alignment branch from 9b4e22a to 0137ddf Compare June 23, 2015 00:10

asfgit closed this in 412ab96 Jun 23, 2015

rdblue mentioned this pull request Jun 23, 2015

PARQUET-248: Add ParquetWriter.Builder. #199

Closed

asfimport mentioned this pull request Jun 23, 2024

Improve alignment between row groups and HDFS blocks #1831

Closed

PARQUET-306: Add row group alignment #211

PARQUET-306: Add row group alignment #211

Uh oh!

Conversation

rdblue commented Jun 9, 2015

Uh oh!

rdblue commented Jun 16, 2015

Uh oh!

isnotinvain Jun 16, 2015

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 17, 2015

Choose a reason for hiding this comment

Uh oh!

isnotinvain Jun 17, 2015

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jun 17, 2015

Uh oh!

rdblue commented Jun 17, 2015

Uh oh!

rdblue commented Jun 17, 2015

Uh oh!

rdblue commented Jun 17, 2015

Uh oh!

isnotinvain Jun 18, 2015

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jun 18, 2015

Uh oh!

rdblue commented Jun 18, 2015

Uh oh!

rdblue commented Jun 18, 2015

Uh oh!

rdblue commented Jun 22, 2015

Uh oh!

isnotinvain Jun 22, 2015

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Jun 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants