Skip to content

Conversation

@baptistemehat
Copy link

Problem

Overview

Upon encoding a random integer sequence of relatively constant bit width, function determine_variable_run_encoding wrongly chooses patched-base encoding when it should have chosen direct encoding, resulting in an ill-formed patched-base encoded sequence with an empty patch list.

I noticed the issue while using the orc module of pyarrow to read a file I wrote with this crate, and got the following error:

Corrupt PATCHED_BASE encoded data (pl==0)!

Details

For instance, with the following input:

[0, 7, 6, 4, 5, 7, 0, 5, 6, 1, 4, 6, 5, 5, 3, 6, 7, 31, 17, 3]

We should compute the following bit widths:

  • Bit width needed for the 95% lowest values (values ranging from 0 to 17): 5 bits
  • Bit width needed for max value (31): also 5 bits!

According to the ORCv2 Specification Draft (sections on Direct encoding and Patched based encoding), since there is no difference in bit width between the lowest 95% values and the highest 5% values, direct encoding should be used instead of patched base encoding.

But the determine_variable_run_encoding function actually aligns the bit width of the max value to the closest power of 2:
https://github.com/datafusion-contrib/orc-rust/blob/d460b779f33b96f6392c75fb2c9621b174ef35bb/src/encoding/integer/rle_v2/mod.rs#L513C1-L513C90

Meaning that, for the input given above, the computed bit widths will be:

  • Bit width needed for the 95% lowest values (values ranging from 0 to 17): 5 bits
  • Aligned Bit width needed for max value (31): 8 bits!

Since the two computed bit widths are different, the function chooses patched-base encoding over direct encoding.
Then, the patched-base encoding procedes and since there is no real different in bit width, no patch is needed and we end up with an empty patch list.

This issue will happen for sequences resembling the above example, i.e. sequences which have a lot of "low" values and some "high" values in such a way that the bit width needed to encode the 95% lowest values is equal to the bit width needed to encode the highest value.

Impact on other types of random sequences

For other types of random sequences, I also noticed that this alignment makes the patched-base encoding overestimate the patch width (PW). Using the example data from the section on Patched based encoding of the ORCv2 Specification Draft, I noticed that the function computes a patch width of 16 bits, instead of the expected 12 bits of the specification:

[...] It has an encoding of patched base (2), a bit width of 8 (7), a length of 20 (19), a base value width of 2 bytes (1), a patch width of 12 bits (11), patch gap width of 2 bits (1), and a patch list length of 1 (1). [...]

Fix

From my understanding of the ORCv2 Specification Draft, I propose the simple fix below, removing the alignment of the bit width of the max value:

- let base_reduced_literals_max_bit_width = max_data_value.closest_aligned_bit_width();
+ let base_reduced_literals_max_bit_width = max_data_value.bits_used();

From what I've tested, this seems to fix the issue detailed above. I still do not understand why this alignment was here in the first place. Maybe a change of spec at some point ? I'm still quite new to ORC so I might have missed certain aspects of the spec on this matter.

To test this fix, I also added 2 tests for the determine_variable_run_encoding function:

  • one using the example sequence from my above example
  • one using the specification's example data

@WenyXu WenyXu requested review from Jefffrey and Copilot and removed request for Jefffrey December 15, 2025 13:29
@WenyXu
Copy link
Collaborator

WenyXu commented Dec 15, 2025

Hi @baptistemehat, the CI check failed. Please run make fmt to fix the formatting.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug in the RLE v2 patched-base encoding that incorrectly chose patched-base encoding when direct encoding should have been used, resulting in ill-formed sequences with empty patch lists. The fix changes the bit width calculation from using an aligned bit width to using the exact bit width when deciding between encoding strategies.

  • Changed bit width calculation from closest_aligned_bit_width() to bits_used() for encoding decision logic
  • Added two tests: one validating the ORC v2 spec example, and one verifying the specific bug fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@WenyXu WenyXu requested a review from Jefffrey December 16, 2025 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants