fix: patched-base encoding producing empty patch list #74
+31
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Overview
Upon encoding a random integer sequence of relatively constant bit width, function
determine_variable_run_encodingwrongly chooses patched-base encoding when it should have chosen direct encoding, resulting in an ill-formed patched-base encoded sequence with an empty patch list.I noticed the issue while using the
orcmodule ofpyarrowto read a file I wrote with this crate, and got the following error:Details
For instance, with the following input:
We should compute the following bit widths:
0to17): 5 bits31): also 5 bits!According to the ORCv2 Specification Draft (sections on Direct encoding and Patched based encoding), since there is no difference in bit width between the lowest 95% values and the highest 5% values, direct encoding should be used instead of patched base encoding.
But the
determine_variable_run_encodingfunction actually aligns the bit width of the max value to the closest power of 2:https://github.com/datafusion-contrib/orc-rust/blob/d460b779f33b96f6392c75fb2c9621b174ef35bb/src/encoding/integer/rle_v2/mod.rs#L513C1-L513C90
Meaning that, for the input given above, the computed bit widths will be:
0to17): 5 bits31): 8 bits!Since the two computed bit widths are different, the function chooses patched-base encoding over direct encoding.
Then, the patched-base encoding procedes and since there is no real different in bit width, no patch is needed and we end up with an empty patch list.
This issue will happen for sequences resembling the above example, i.e. sequences which have a lot of "low" values and some "high" values in such a way that the bit width needed to encode the 95% lowest values is equal to the bit width needed to encode the highest value.
Impact on other types of random sequences
For other types of random sequences, I also noticed that this alignment makes the patched-base encoding overestimate the patch width (PW). Using the example data from the section on Patched based encoding of the ORCv2 Specification Draft, I noticed that the function computes a patch width of
16bits, instead of the expected12bits of the specification:Fix
From my understanding of the ORCv2 Specification Draft, I propose the simple fix below, removing the alignment of the bit width of the max value:
From what I've tested, this seems to fix the issue detailed above. I still do not understand why this alignment was here in the first place. Maybe a change of spec at some point ? I'm still quite new to ORC so I might have missed certain aspects of the spec on this matter.
To test this fix, I also added 2 tests for the
determine_variable_run_encodingfunction: