-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-306: Add row group alignment #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@isnotinvain could you look at this before the ParquetWriter builder? If we can get this in, then I'll have to update the builder class to set the padding threshold. |
a65cb43 to
0e1c240
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this just throw if !hasNonNullValue?
We already provide hasNonNullValue, shouldn't the caller inspect that before calling getMaxBytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets called from within Jackson when converting to JSON, so we don't have much of a choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok SGTM
|
How do we make sure that row groups are still aligning with HDFS blocks? Everything gets written to an output stream, so it seems like it would be really easy to accidentally shift everything slightly and ruin the alignment -- eg we write a magic header at the beginning of the file right? does that push everything slightly out of alignment? It only has to be slightly off for any wins to be lost right? I'm just wondering if we're relying on being very careful about that, or if the way this works is not sensitive to that (maybe it's not because it asks the output stream for it's position, which is all that matters maybe?) |
This is based off of the position that the FS OutputStream reports, which is always going to be correct. If some other thread somehow had a reference to it and wrote data, then this would see that the position had changed. I'm also going to update this so that we don't use a percentage of the row group size to configure it. I'm going to add a max padding setting in bytes instead. |
7c092e2 to
46f5953
Compare
|
Ok, changed padding thresh to maxPaddingSize. I think that's much better. |
46f5953 to
027c108
Compare
|
I've fixed the tests and added webhdfs and viewfs to the supported block schemes. Should be good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but how about setMaxRowGroupPaddingBytes
|
What do you think about adding one more test, that writes a parquet file (with padding) and then reads it back (there are already lots of tests that do this, but w/o padding). I remember seeing discussion around a different issue where parquet stores an offset for the dictionary header, but never actually reads it, just assuming that the dictionary is the first page, and so we never noticed that the offset was wrong, etc. I guess I'm saying it'd be good to sanity check that parquet still reads files with padding in them w/o issue. |
|
@isnotinvain, good idea. I originally set the default padding to non-zero so all the tests tested that, which is why you don't see one. I still think we should change the default to include a reasonable amount of padding, but probably not in this PR. |
|
I added the test, but I didn't change the method names. |
04144ad to
329be30
Compare
This adds two strategies: NoAlignment, and PaddingAlignment. NoAlignment matches the current behavior and is used by default for all file systems other than HDFS because there is no need to align with blocks. PaddingAlignment will add zero-padding if less than half of the target row group size remains in the current HDFS block. If there is more than half, it will return the number of remaining bytes for the target size of the next row group, or the row group size if it is smaller.
This uses the getNextRowGroupSize in InternalParquetRecordWriter to set the target size of the next row group when a row group is flushed. The actual target size is either this value (the remaining bytes in the block) or the row group size set by the memory manager, whichever is smaller.
This is a size, in bytes, that is the maximum amount of padding that will be used to align row groups with HDFS blocks. This is also the minimum target size for a row group.
329be30 to
9b4e22a
Compare
|
Rebased on top of #221 and tests are passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want to getBytes("UTF-8") right?
|
+1, just one comment about character encodings in the tests |
9b4e22a to
0137ddf
Compare
This adds
AlignmentStrategyto theParquetFileWriterthat can alter the position of row groups and recommend a target size for the next row group. There are two strategies:NoAlignmentandPaddingAlignment. Padding alignment is used for HDFS and no alignment is used for all other file systems. When HDFS-3689 is available, we can add a strategy to use that.The amount of padding is controlled by a threshold between 0 and 1 that controls the fraction of the row group size that can be padded. This is interpreted as the maximum amount of padding that is acceptable, in terms of the row group size. For example, setting this to 5% will write padding when the bytes left in a HDFS block are less than 5% of the row group size. This defaults to 0%, which prevents padding from being added and matches the current behavior. The threshold is controlled by a new OutputFormat configuration property,
parquet.writer.padding-thresh.