fix: patched-base encoding producing empty patch list #74

baptistemehat · 2025-12-15T13:26:52Z

Problem

Overview

Upon encoding a random integer sequence of relatively constant bit width, function determine_variable_run_encoding wrongly chooses patched-base encoding when it should have chosen direct encoding, resulting in an ill-formed patched-base encoded sequence with an empty patch list.

I noticed the issue while using the orc module of pyarrow to read a file I wrote with this crate, and got the following error:

Corrupt PATCHED_BASE encoded data (pl==0)!

Details

For instance, with the following input:

[0, 7, 6, 4, 5, 7, 0, 5, 6, 1, 4, 6, 5, 5, 3, 6, 7, 31, 17, 3]

We should compute the following bit widths:

Bit width needed for the 95% lowest values (values ranging from 0 to 17): 5 bits
Bit width needed for max value (31): also 5 bits!

According to the ORCv2 Specification Draft (sections on Direct encoding and Patched based encoding), since there is no difference in bit width between the lowest 95% values and the highest 5% values, direct encoding should be used instead of patched base encoding.

But the determine_variable_run_encoding function actually aligns the bit width of the max value to the closest power of 2:
https://github.com/datafusion-contrib/orc-rust/blob/d460b779f33b96f6392c75fb2c9621b174ef35bb/src/encoding/integer/rle_v2/mod.rs#L513C1-L513C90

Meaning that, for the input given above, the computed bit widths will be:

Bit width needed for the 95% lowest values (values ranging from 0 to 17): 5 bits
Aligned Bit width needed for max value (31): 8 bits!

Since the two computed bit widths are different, the function chooses patched-base encoding over direct encoding.
Then, the patched-base encoding procedes and since there is no real different in bit width, no patch is needed and we end up with an empty patch list.

This issue will happen for sequences resembling the above example, i.e. sequences which have a lot of "low" values and some "high" values in such a way that the bit width needed to encode the 95% lowest values is equal to the bit width needed to encode the highest value.

Impact on other types of random sequences

For other types of random sequences, I also noticed that this alignment makes the patched-base encoding overestimate the patch width (PW). Using the example data from the section on Patched based encoding of the ORCv2 Specification Draft, I noticed that the function computes a patch width of 16 bits, instead of the expected 12 bits of the specification:

[...] It has an encoding of patched base (2), a bit width of 8 (7), a length of 20 (19), a base value width of 2 bytes (1), a patch width of 12 bits (11), patch gap width of 2 bits (1), and a patch list length of 1 (1). [...]

Fix

From my understanding of the ORCv2 Specification Draft, I propose the simple fix below, removing the alignment of the bit width of the max value:

- let base_reduced_literals_max_bit_width = max_data_value.closest_aligned_bit_width();
+ let base_reduced_literals_max_bit_width = max_data_value.bits_used();

From what I've tested, this seems to fix the issue detailed above. I still do not understand why this alignment was here in the first place. Maybe a change of spec at some point ? I'm still quite new to ORC so I might have missed certain aspects of the spec on this matter.

To test this fix, I also added 2 tests for the determine_variable_run_encoding function:

one using the example sequence from my above example
one using the specification's example data

…th is relatively constant

WenyXu · 2025-12-15T13:31:55Z

Hi @baptistemehat, the CI check failed. Please run make fmt to fix the formatting.

Copilot

Pull request overview

This pull request fixes a bug in the RLE v2 patched-base encoding that incorrectly chose patched-base encoding when direct encoding should have been used, resulting in ill-formed sequences with empty patch lists. The fix changes the bit width calculation from using an aligned bit width to using the exact bit width when deciding between encoding strategies.

Changed bit width calculation from closest_aligned_bit_width() to bits_used() for encoding decision logic
Added two tests: one validating the ORC v2 spec example, and one verifying the specific bug fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/encoding/integer/rle_v2/mod.rs

Baptiste Mehat added 3 commits December 11, 2025 13:59

added simple test for patched base encoding based on ORC v2 spec

06159e6

fixed computation of max value bit width in patched base encoding

3974efa

added test to check that patched base encoding is not used if bit wid…

7e8642d

…th is relatively constant

WenyXu requested review from Jefffrey and Copilot and removed request for Jefffrey December 15, 2025 13:29

Copilot started reviewing on behalf of WenyXu December 15, 2025 13:29 View session

fix formatting

0096a12

Copilot AI reviewed Dec 15, 2025

View reviewed changes

src/encoding/integer/rle_v2/mod.rs Show resolved Hide resolved

src/encoding/integer/rle_v2/mod.rs Show resolved Hide resolved

src/encoding/integer/rle_v2/mod.rs Show resolved Hide resolved

WenyXu requested a review from Jefffrey December 16, 2025 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: patched-base encoding producing empty patch list #74

fix: patched-base encoding producing empty patch list #74

baptistemehat commented Dec 15, 2025

Uh oh!

WenyXu commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: patched-base encoding producing empty patch list #74

Are you sure you want to change the base?

fix: patched-base encoding producing empty patch list #74

Conversation

baptistemehat commented Dec 15, 2025

Problem

Overview

Details

Impact on other types of random sequences

Fix

Uh oh!

WenyXu commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants