Fix: eliminate unnecessary repartitioning for small datasets #19073

shashidhar-bm · 2025-12-03T18:47:18Z

Which issue does this PR close?

Closes Eliminate Repartitioning for Small Datasets #18595

Rationale for this change

Aggregate planning was repartitioning tiny datasets, leading to unnecessary cost for lightweight queries

The planner now uses statistics to keep small inputs in single-partition mode when there aren't enough rows to benefit from the overhead of repartitioning, thus avoiding this performance drain.

What changes are included in this PR?

Added the function has_sufficient_rows_for_repartition(...) in datafusion/core/src/physical_planner.rs.
Wired this check into the aggregate planning branch to only build AggregateMode::FinalPartitioned when $num_rows \geq batch_size$.
Extended planner coverage with tests (e.g., hash_agg_small_dataset_single_mode) and dataframe insta snapshots to assert the new single-mode plan and its explain output.
Regenerated all affected SQL logic suites (e.g., aggregate, group_by, tpch) to reflect the planning change.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

…partition

shashidhar-bm · 2025-12-03T19:06:59Z

Sorry for opening a new PR—my goal was to keep the diff clean and reviewable, so I rebuilt the changes clearly from scratch in this pr.

Copilot

Pull request overview

This PR optimizes aggregate query planning by eliminating unnecessary repartitioning for small datasets that don't have enough rows to benefit from parallel processing overhead.

Key Changes:

Added has_sufficient_rows_for_repartition() function to check if input has enough rows (≥ batch_size) to warrant repartitioning
Modified aggregate planning to skip FinalPartitioned mode for small datasets, using Single or Final mode instead
Updated existing test with larger dataset (10,000 rows) to ensure repartitioning still occurs for large data
Added new test case to verify single-mode execution for small datasets (6 rows)

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
datafusion/core/src/physical_planner.rs	Core implementation: adds row count check before repartitioning aggregates; updates test dataset size and adds new small dataset test
datafusion/core/tests/dataframe/mod.rs	Updates dataframe test snapshots to reflect new single-mode aggregation plans
datafusion/sqllogictest/test_files/*.slt	Regenerates expected query plans across multiple test files to show Single/Final modes instead of FinalPartitioned for small datasets
datafusion/sqllogictest/test_files/encrypted_parquet.slt	Changes test from expecting error to expecting results with rowsort
datafusion/sqllogictest/test_files/clickbench_extended.slt	Updates result ordering for non-deterministic query results

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

datafusion/sqllogictest/test_files/clickbench_extended.slt

datafusion/sqllogictest/test_files/encrypted_parquet.slt

…tition-small-datasets-revised

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

datafusion/core/src/physical_planner.rs

alamb

Thanks @ShashidharM0118 -- this is looking good. The basic idea certainly looks right to me.

I have a suggestion for exactly what setting to use. Let me know if that makes sense

alamb · 2025-12-15T21:10:25Z

datafusion/core/src/physical_planner.rs

+        Err(_) => return Ok(true),
+    };
+
+    if let Some(num_rows) = stats.num_rows.get_value().copied() {


I think there is already a config setting that would be appropriate for this setting (rather than using the batch size): https://datafusion.apache.org/user-guide/configs.html

datafusion.optimizer.repartition_file_min_size

What do you think?

hi @alamb, thanks for the suggestion to use datafusion.optimizer.repartition_file_min_size — it makes more sense to me, so I’ve switched the heuristic over to be size-based.

I have two concerns:

I’ve added a follow-up num_rows check that is used only when total_byte_size is missing. In other words, the code first applies the size-based threshold, and only if no size statistics are available does it fall back to num_rows >= batch_size. Does this fallback approach look reasonable?

The hash_agg_group_by_partitioned_on_dicts test in physical_planner.rs (lines 3275–3304) was originally written to assert that a partitioned aggregate is produced on dictionary keys, but with the new size-based heuristic the small in-memory dict dataset no longer triggers repartitioning and the test now asserts mode: Single, which no longer matches the test name or original intent.

Throwing my two cents in here. I think this configuration would be great as letting users "turn knobs" is a great way for extensibility in datafusion and have experiemented it with myself.

I see a use for this configuration in my work and I think this fallback behavior should not exist with the min_size configuration. As a user I prefer if I turn a knob to say declare a min_size that it sticks to this behavior without this fallback behavior.

Let me know your thoughts on this.

Throwing my two cents in here. I think this configuration would be great as letting users "turn knobs" is a great way for extensibility in datafusion and have experiemented it with myself.

I see a use for this configuration in my work and I think this fallback behavior should not exist with the min_size configuration. As a user I prefer if I turn a knob to say declare a min_size that it sticks to this behavior without this fallback behavior.

Let me know your thoughts on this.

understood,
I removed fallback,

…ised

…tion-small-datasets-revised

…ning and update plans

…nd fix formatting

…eliminate-repartition-small-datasets-revised

shashidhar-bm added 5 commits December 3, 2025 18:59

Skip aggregate repartitioning for small datasets

8ef1c92

Update aggregate/pushdown sqllogictest snapshots for small‑dataset re…

b898f0f

…partition

Fix unit test to look for mode: Single in plan output

ef1f69e

Refresh sqllogictest snapshots for small-aggregate single-mode plans

30840df

Stabilize encrypted parquet SQ logic test output

2fe6490

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 3, 2025

shashidhar-bm changed the title ~~Fix/eliminate repartition small datasets revised~~ Fix: eliminate unnecessary repartitioning for small datasets Dec 3, 2025

shashidhar-bm marked this pull request as ready for review December 3, 2025 19:07

Copilot AI review requested due to automatic review settings December 3, 2025 19:07

Copilot started reviewing on behalf of shashidhar-bm December 3, 2025 19:07 View session

Copilot finished reviewing on behalf of shashidhar-bm December 3, 2025 19:09

Copilot AI reviewed Dec 3, 2025

View reviewed changes

datafusion/sqllogictest/test_files/clickbench_extended.slt Show resolved Hide resolved

datafusion/sqllogictest/test_files/encrypted_parquet.slt Outdated Show resolved Hide resolved

shashidhar-bm added 3 commits December 5, 2025 22:51

Merge remote-tracking branch 'upstream/main' into fix/eliminate-repar…

5682df3

…tition-small-datasets-revised

Add sqllogictest coverage for encrypted Parquet read/write

2ab3b8f

Merge remote-tracking branch 'upstream/main' into fix/eliminate-repar…

03d3ef5

…tition-small-datasets-revised

shashidhar-bm mentioned this pull request Dec 7, 2025

Eliminate Repartitioning for Small Datasets #18595

Open

shashidhar-bm requested a review from Copilot December 15, 2025 18:07

Copilot AI reviewed Dec 15, 2025

View reviewed changes

datafusion/core/src/physical_planner.rs Show resolved Hide resolved

datafusion/core/src/physical_planner.rs Outdated Show resolved Hide resolved

alamb reviewed Dec 15, 2025

View reviewed changes

shashidhar-bm added 5 commits December 16, 2025 11:40

Merge branch 'main' into fix/eliminate-repartition-small-datasets-rev…

6cfbce3

…ised

refresh dataframe snapshot after aggregate repartition change

dab6b4f

Merge remote-tracking branch 'origin/main' into fix/eliminate-reparti…

830ecbf

…tion-small-datasets-revised

Update aggregate plans and snapshots

3881a7f

Experimentally use repartition_file_min_size for aggregate repartitio…

2bfc4c4

…ning and update plans

github-actions bot added the execution Related to the execution crate label Dec 17, 2025

Rename helper to reflect size-based aggregate repartition heuristic a…

e5695cc

…nd fix formatting

shashidhar-bm requested a review from alamb December 17, 2025 10:54

Merge branch 'main' of https://github.com/apache/datafusion into fix/…

dfe0f43

…eliminate-repartition-small-datasets-revised

shashidhar-bm added 2 commits December 24, 2025 22:53

remove fallback condition

da26a52

clippy

c64fe55

shashidhar-bm requested a review from gene-bordegaray December 30, 2025 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: eliminate unnecessary repartitioning for small datasets #19073

Fix: eliminate unnecessary repartitioning for small datasets #19073

shashidhar-bm commented Dec 3, 2025 •

edited

Loading

Uh oh!

shashidhar-bm commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Uh oh!

alamb Dec 15, 2025

Uh oh!

shashidhar-bm Dec 17, 2025

Uh oh!

gene-bordegaray Dec 20, 2025

Uh oh!

shashidhar-bm Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix: eliminate unnecessary repartitioning for small datasets #19073

Are you sure you want to change the base?

Fix: eliminate unnecessary repartitioning for small datasets #19073

Conversation

shashidhar-bm commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

shashidhar-bm commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

shashidhar-bm Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

gene-bordegaray Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

shashidhar-bm Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shashidhar-bm commented Dec 3, 2025 •

edited

Loading