Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jun 9, 2015

This adds AlignmentStrategy to the ParquetFileWriter that can alter the position of row groups and recommend a target size for the next row group. There are two strategies: NoAlignment and PaddingAlignment. Padding alignment is used for HDFS and no alignment is used for all other file systems. When HDFS-3689 is available, we can add a strategy to use that.

The amount of padding is controlled by a threshold between 0 and 1 that controls the fraction of the row group size that can be padded. This is interpreted as the maximum amount of padding that is acceptable, in terms of the row group size. For example, setting this to 5% will write padding when the bytes left in a HDFS block are less than 5% of the row group size. This defaults to 0%, which prevents padding from being added and matches the current behavior. The threshold is controlled by a new OutputFormat configuration property, parquet.writer.padding-thresh.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 16, 2015

@isnotinvain could you look at this before the ParquetWriter builder? If we can get this in, then I'll have to update the builder class to set the padding threshold.

@rdblue rdblue force-pushed the PARQUET-306-row-group-alignment branch from a65cb43 to 0e1c240 Compare June 16, 2015 23:05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just throw if !hasNonNullValue?
We already provide hasNonNullValue, shouldn't the caller inspect that before calling getMaxBytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets called from within Jackson when converting to JSON, so we don't have much of a choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok SGTM

@isnotinvain
Copy link
Contributor

How do we make sure that row groups are still aligning with HDFS blocks? Everything gets written to an output stream, so it seems like it would be really easy to accidentally shift everything slightly and ruin the alignment -- eg we write a magic header at the beginning of the file right? does that push everything slightly out of alignment? It only has to be slightly off for any wins to be lost right?

I'm just wondering if we're relying on being very careful about that, or if the way this works is not sensitive to that (maybe it's not because it asks the output stream for it's position, which is all that matters maybe?)

@rdblue
Copy link
Contributor Author

rdblue commented Jun 17, 2015

How do we make sure that row groups are still aligning with HDFS blocks?

This is based off of the position that the FS OutputStream reports, which is always going to be correct. If some other thread somehow had a reference to it and wrote data, then this would see that the position had changed.

I'm also going to update this so that we don't use a percentage of the row group size to configure it. I'm going to add a max padding setting in bytes instead.

@rdblue rdblue force-pushed the PARQUET-306-row-group-alignment branch 2 times, most recently from 7c092e2 to 46f5953 Compare June 17, 2015 01:12
@rdblue
Copy link
Contributor Author

rdblue commented Jun 17, 2015

Ok, changed padding thresh to maxPaddingSize. I think that's much better.

@rdblue rdblue force-pushed the PARQUET-306-row-group-alignment branch from 46f5953 to 027c108 Compare June 17, 2015 23:47
@rdblue
Copy link
Contributor Author

rdblue commented Jun 17, 2015

I've fixed the tests and added webhdfs and viewfs to the supported block schemes. Should be good to go?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but how about setMaxRowGroupPaddingBytes

@isnotinvain
Copy link
Contributor

What do you think about adding one more test, that writes a parquet file (with padding) and then reads it back (there are already lots of tests that do this, but w/o padding). I remember seeing discussion around a different issue where parquet stores an offset for the dictionary header, but never actually reads it, just assuming that the dictionary is the first page, and so we never noticed that the offset was wrong, etc. I guess I'm saying it'd be good to sanity check that parquet still reads files with padding in them w/o issue.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 18, 2015

@isnotinvain, good idea. I originally set the default padding to non-zero so all the tests tested that, which is why you don't see one. I still think we should change the default to include a reasonable amount of padding, but probably not in this PR.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 18, 2015

I added the test, but I didn't change the method names.

@rdblue rdblue force-pushed the PARQUET-306-row-group-alignment branch 2 times, most recently from 04144ad to 329be30 Compare June 22, 2015 18:25
rdblue added 3 commits June 22, 2015 14:30
This adds two strategies: NoAlignment, and PaddingAlignment.

NoAlignment matches the current behavior and is used by default for all
file systems other than HDFS because there is no need to align with
blocks.

PaddingAlignment will add zero-padding if less than half of the target
row group size remains in the current HDFS block. If there is more than
half, it will return the number of remaining bytes for the target size
of the next row group, or the row group size if it is smaller.
This uses the getNextRowGroupSize in InternalParquetRecordWriter to set
the target size of the next row group when a row group is flushed. The
actual target size is either this value (the remaining bytes in the
block) or the row group size set by the memory manager, whichever is
smaller.
This is a size, in bytes, that is the maximum amount of padding that
will be used to align row groups with HDFS blocks. This is also the
minimum target size for a row group.
@rdblue rdblue force-pushed the PARQUET-306-row-group-alignment branch from 329be30 to 9b4e22a Compare June 22, 2015 21:30
@rdblue
Copy link
Contributor Author

rdblue commented Jun 22, 2015

Rebased on top of #221 and tests are passing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to getBytes("UTF-8") right?

@isnotinvain
Copy link
Contributor

+1, just one comment about character encodings in the tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants