-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add ParquetMetaDataBuilder
#6466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a2818bf
572ce5c
5e324f6
b6c63a8
997c3f9
c0432e6
21e8dc2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -191,11 +191,13 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> { | |
| /// Creates file reader from a Parquet file with read options. | ||
| /// Returns error if Parquet file does not exist or is corrupt. | ||
| pub fn new_with_options(chunk_reader: R, options: ReadOptions) -> Result<Self> { | ||
| let metadata = ParquetMetaDataReader::new().parse_and_finish(&chunk_reader)?; | ||
| let mut metadata_builder = ParquetMetaDataReader::new() | ||
| .parse_and_finish(&chunk_reader)? | ||
| .into_builder(); | ||
| let mut predicates = options.predicates; | ||
| let row_groups = metadata.row_groups().to_vec(); | ||
| let mut filtered_row_groups = Vec::<RowGroupMetaData>::new(); | ||
| for (i, rg_meta) in row_groups.into_iter().enumerate() { | ||
|
|
||
| // Filter row groups based on the predicates | ||
|
||
| for (i, rg_meta) in metadata_builder.take_row_groups().into_iter().enumerate() { | ||
| let mut keep = true; | ||
| for predicate in &mut predicates { | ||
| if !predicate(&rg_meta, i) { | ||
|
|
@@ -204,41 +206,30 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> { | |
| } | ||
| } | ||
| if keep { | ||
| filtered_row_groups.push(rg_meta); | ||
| metadata_builder = metadata_builder.add_row_group(rg_meta); | ||
| } | ||
| } | ||
|
|
||
| if options.enable_page_index { | ||
| let mut columns_indexes = vec![]; | ||
| let mut offset_indexes = vec![]; | ||
|
|
||
| for rg in &mut filtered_row_groups { | ||
| for rg in metadata_builder.row_groups().iter() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can build the metadata here (with the filtered row groups), pass it into
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if options.enable_page_index {
let mut reader = ParquetMetaDataReader::new_with_metadata(metadata_builder.build())
.with_page_indexes(options.enable_page_index);
reader.read_page_indexes(&chunk_reader)?;
metadata_builder = ParquetMetaDataBuilder::new_from_metadata(reader.finish()?);
}I forgot to do this in #6450.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite follow why this is needed. What scenario does it help (I can write a test to cover it)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean replace if options.enable_page_index {
let mut columns_indexes = vec![];
let mut offset_indexes = vec![];
for rg in metadata_builder.row_groups().iter() {
let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?;
let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?;
columns_indexes.push(column_index);
offset_indexes.push(offset_index);
}
metadata_builder = metadata_builder
.set_column_index(Some(columns_indexes))
.set_offset_index(Some(offset_indexes));
}with the above code snippet from my earlier comment. This should be a bit more efficient since
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I implemented something slightly different: Since there is already a One thing that might be different is that it looks like the current code may only read the column index/page index for row groups that passed the "predicates" but the ParquetMetadataReader reads the index for all the row groups. That being said, no tests fail, so i am not sure if it is a real problem or not
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, this worries me a bit, since the column and offset indexes will have more row groups represented than are in the We could prune the page indexes at the same time we're pruning the row groups.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, that is probably mirrors the intent the most closely. How about I'll back out c0432e6 and we can address improving this code as a follow on PR (along with tests)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| let column_index = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?; | ||
| let offset_index = index_reader::read_offset_indexes(&chunk_reader, rg.columns())?; | ||
| columns_indexes.push(column_index); | ||
| offset_indexes.push(offset_index); | ||
| } | ||
|
|
||
| Ok(Self { | ||
| chunk_reader: Arc::new(chunk_reader), | ||
| metadata: Arc::new(ParquetMetaData::new_with_page_index( | ||
| metadata.file_metadata().clone(), | ||
| filtered_row_groups, | ||
| Some(columns_indexes), | ||
| Some(offset_indexes), | ||
| )), | ||
| props: Arc::new(options.props), | ||
| }) | ||
| } else { | ||
| Ok(Self { | ||
| chunk_reader: Arc::new(chunk_reader), | ||
| metadata: Arc::new(ParquetMetaData::new( | ||
| metadata.file_metadata().clone(), | ||
| filtered_row_groups, | ||
| )), | ||
| props: Arc::new(options.props), | ||
| }) | ||
| metadata_builder = metadata_builder | ||
| .set_column_index(Some(columns_indexes)) | ||
| .set_offset_index(Some(offset_indexes)); | ||
| } | ||
|
|
||
| Ok(Self { | ||
| chunk_reader: Arc::new(chunk_reader), | ||
| metadata: Arc::new(metadata_builder.build()), | ||
| props: Arc::new(options.props), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods follow the exsiting convention of
addandsetas the other Builders in this crate sucha s