Skip to content

Conversation

@gszadovszky
Copy link
Contributor

In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:

  • null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
    fix: check if min/max is set before comparing to the related values
  • null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
    fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky [email protected]

Closes #458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics

This change is based on b82d962 but is not a clean cherry-pick.

In parquet-format every value in Statistics is optional while parquet-mr does not properly handle these scenarios:
- null_count is set but min/max or min_value/max_value are not: filtering may fail with NPE or incorrect filtering occurs
  fix: check if min/max is set before comparing to the related values
- null_count is not set: filtering handles null_count as if it would be 0 -> incorrect filtering may occur
  fix: introduce new method in Statistics object to check if num_nulls is set; check if num_nulls is set by the new method before using its value for filtering

Author: Gabor Szadovszky <[email protected]>

Closes apache#458 from gszadovszky/PARQUET-1217 and squashes the following commits:

9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing values in Statistics

This change is based on b82d962 but is not a clean cherry-pick.
Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants