experiment: Selectively remove CoalesceBatchesExec #15479

ctsk · 2025-03-28T14:29:24Z

Relates to Issue: #15478

Rationale for this change

The blocking operators (HJ buid side, Aggregation) are often planned on top of a RepartitionExec with a CoalesceBatchesExec in-between. However, one of the first things these operators do is concatenate the freshly CoalescedBatches.
This PR is to test if the overhead of the 2-step coalesce+concat outweighs the gains of fewer dispatches of the consuming operators.

What changes are included in this PR?

This PR adds a physical optimizer rule UncoalesceBatches. It runs after the CoalesceBatches rule and removes CoalesceBatchesExec that are at the build side of HashJoins and in front of non-partial aggregations

Are these changes tested?

Not yet!

Are there any user-facing changes?

Yes.

The partial aggregation can switch to a pass-through mode. In this case, coalescing might make sense

Dandandan · 2025-03-28T16:41:00Z

That makes a lot of sense!

Dandandan · 2025-03-28T16:43:22Z

datafusion/physical-optimizer/src/coalesce_batches.rs

+
+/// Remove CoalesceBatchesExec that are in front of a AggregateExec
+#[derive(Default, Debug)]
+pub struct UnCoalesceBatches {}


I am wondering if we instead can avoid adding them in the CoalesceBatches optimizer

Certainly something to attempt. I've not done it (yet) because it's not necessary to evaluate the impact of this change

berkaysynnada · 2025-03-29T22:33:45Z

I think you can generalize this logic by tracking the ExecutionPlanProperties::pipeline_behavior() of operators in the plan.

alamb · 2025-03-30T10:45:05Z

BTW thank you very much @ctsk -- it is really cool to see the joins get some careful love and attention ❤️

Rachelint · 2025-04-03T09:57:46Z

Here is my result after removing CoalesceBatchesExec for Aggregate:

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ remove-coalesc-test ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.58ms │              0.59ms │     no change │
│ QQuery 1     │    69.10ms │             67.52ms │     no change │
│ QQuery 2     │   164.32ms │            162.57ms │     no change │
│ QQuery 3     │   174.68ms │            173.55ms │     no change │
│ QQuery 4     │  1530.10ms │           1486.31ms │     no change │
│ QQuery 5     │  1438.23ms │           1390.79ms │     no change │
│ QQuery 6     │    67.68ms │             68.17ms │     no change │
│ QQuery 7     │    78.59ms │             80.99ms │     no change │
│ QQuery 8     │  1646.02ms │           1594.09ms │     no change │
│ QQuery 9     │  1823.36ms │           1806.96ms │     no change │
│ QQuery 10    │   464.34ms │            443.72ms │     no change │
│ QQuery 11    │   521.13ms │            510.05ms │     no change │
│ QQuery 12    │  1606.39ms │           1559.59ms │     no change │
│ QQuery 13    │  2578.92ms │           2425.36ms │ +1.06x faster │
│ QQuery 14    │  1650.71ms │           1584.42ms │     no change │
│ QQuery 15    │  1807.01ms │           1756.57ms │     no change │
│ QQuery 16    │  3430.08ms │           3226.74ms │ +1.06x faster │
│ QQuery 17    │  3177.38ms │           2923.88ms │ +1.09x faster │
│ QQuery 18    │  7348.28ms │           6759.03ms │ +1.09x faster │
│ QQuery 19    │   145.29ms │            142.43ms │     no change │
│ QQuery 20    │  2650.47ms │           2652.96ms │     no change │
│ QQuery 21    │  3416.49ms │           3382.16ms │     no change │
│ QQuery 22    │  8195.70ms │           8383.83ms │     no change │
│ QQuery 23    │ 21618.48ms │          21754.72ms │     no change │
│ QQuery 24    │   997.94ms │            999.24ms │     no change │
│ QQuery 25    │   908.12ms │            885.02ms │     no change │
│ QQuery 26    │  1168.94ms │           1169.06ms │     no change │
│ QQuery 27    │  3827.54ms │           3838.34ms │     no change │
│ QQuery 28    │ 22386.67ms │          22554.63ms │     no change │
│ QQuery 29    │   910.11ms │            910.88ms │     no change │
│ QQuery 30    │  1633.75ms │           1606.75ms │     no change │
│ QQuery 31    │  1876.00ms │           1838.23ms │     no change │
│ QQuery 32    │  7765.63ms │           7456.08ms │     no change │
│ QQuery 33    │  7439.70ms │           7022.39ms │ +1.06x faster │
│ QQuery 34    │  7414.28ms │           7033.60ms │ +1.05x faster │
│ QQuery 35    │  2212.33ms │           2106.79ms │     no change │
│ QQuery 36    │   220.97ms │            204.14ms │ +1.08x faster │
│ QQuery 37    │   142.39ms │            141.67ms │     no change │
│ QQuery 38    │   136.95ms │            134.41ms │     no change │
│ QQuery 39    │   388.01ms │            373.10ms │     no change │
│ QQuery 40    │    56.90ms │             57.45ms │     no change │
│ QQuery 41    │    54.00ms │             53.49ms │     no change │
│ QQuery 42    │    61.03ms │             60.48ms │     no change │
└──────────────┴────────────┴─────────────────────┴───────────────┘

berkaysynnada · 2025-04-06T20:29:20Z

I think you can generalize this logic by tracking the ExecutionPlanProperties::pipeline_behavior() of operators in the plan.

I can give some guidance on that BTW, if you prefer so

ctsk · 2025-04-07T10:41:37Z

Marking as draft to signify the implementation is not finished yet

Thanks for the benchmark @Rachelint. Did you use the implementation in this PR or write your own? Overall the impact seems small (still good for how little it takes to implement!). I think it's worth investigating if a Coalesce with a different threshold makes sense.

Thanks for the offer @berkaysynnada. I think I know what to do, but haven't found the time the past few days. I'll give it a try later today.

alamb · 2025-04-07T11:02:19Z

Thanks for the benchmark @Rachelint. Did you use the implementation in this PR or write your own? Overall the impact seems small (still good for how little it takes to implement!). I think it's worth investigating if a Coalesce with a different threshold makes sense.

It looks to me like clickbench results from bench.sh: https://github.com/apache/datafusion/blob/main/benchmarks/README.md

ctsk · 2025-04-07T14:20:33Z

@berkaysynnada I had a look at ExecutionPlanProperties::pipeline_behavior(). I think it is not quite what I want here: For the HashJoin, I want to remove the coalesce on the build side, but keep it on the probe side. The pipeline behaviour doesn't tell me which child is processed batch-wise, and which child is processed incrementally.

I could add a blanket rule for other plans - potentially outside of datafusion repo - that removes the coalesce for each child of a plan that does not have EmissionType::Incremental. Unfortunately this does not cover the rule for the aggregation: Here I purposefully kept the CoalesceBatchesExec underneath partial aggregations, because those can switch to passing-through batches without aggregating.

So far, this PR is a lot of reasoning what might make sense, but in the end it's down to measuring the impact for each operator. I plan on trying different coalesce thresholds tomorrow and see what works best.

berkaysynnada · 2025-04-08T13:00:37Z

@berkaysynnada I had a look at ExecutionPlanProperties::pipeline_behavior(). I think it is not quite what I want here: For the HashJoin, I want to remove the coalesce on the build side, but keep it on the probe side. The pipeline behaviour doesn't tell me which child is processed batch-wise, and which child is processed incrementally.

I understand why it doesn't fit in this use case.

Maybe we should have another API for operators like pipeline_behavior: accumulate_input_batches(&self) -> Vec<bool>? HashJoin implements as vec![true,false], SortExec implements [true], AggregateExec [true] if it has not an ordered input on gb keys, FilterExec has [false] etc. WDYT? That would be an over-engineering or reflects the behaviors better? Maybe we can utilize that in other places as well where we downcast operators and check the type

Dandandan · 2025-04-18T11:06:53Z

I think this is a nice experiment.
That said, I think we can better try changing the build side of the join to use Vec<RecordBatch> instead of concatenating to a single RecordBatch.
I remember we (I) changed it to concatenate all build batches to one side (to improve performance back then), but it would be preferable if we wouldn't concatenate everything in one batch.
One downside of doing it is that we can't load > 4GiB of Utf8 columns in the left side, it will fail with overflowing offsets.

ctsk · 2025-04-20T08:17:38Z

Sorry for the silence. I've been extensively benchmarking this PR and the results have been fairly mixed. I've also tried different thresholds for coalescing. I plan on generating tables and push the results later today.

ctsk · 2025-04-20T15:09:53Z

benchmarks/hashagg-results.md
benchmarks/join-results.md
benchmarks/sort-results.md

I've checked in the results because I think they would be too large to include as a comment.

Each file contains the results of reducing the coalesce threshold for a single operator - joins, hash aggregations, and sorts. Coalescing before all other operators remains unchanged. The value behind each configuration describes what the coalesce threshold was set to: SORT0 means that CoalesceBatches operators were fully removed, whereas SORT256 means that the CoalesceBatches operator in front of a SORT was configured to emit a batch once it had 256 rows buffered. The same applies to joins and hash aggregations.
The CHANGE value represents the relative change of the column to its right to the base column (the baseline when this PR branched off main).

The benchmarks were run with 16 target partitions. I suspect that the more target partitions there are, the smaller the batches produced by RepartitionExec become. Therefore, removing coalesce might work better with smaller target partition counts (for hash aggregation and joins).

ctsk added 2 commits March 28, 2025 15:06

Remove coalesce before hash agg

6d78d04

Keep coalesce before Partial Aggregation

040a250

The partial aggregation can switch to a pass-through mode. In this case, coalescing might make sense

github-actions bot added the optimizer Optimizer rules label Mar 28, 2025

ctsk mentioned this pull request Mar 28, 2025

Idea: Avoid planning CoalesceBatches in front of blocking operators. #15478

Open

Dandandan reviewed Mar 28, 2025

View reviewed changes

ctsk marked this pull request as draft April 7, 2025 10:37

berkaysynnada mentioned this pull request Apr 9, 2025

perf: Introduce sort prefix computation for early TopK exit optimization on partially sorted input (10x speedup on top10 bench) #15563

Merged

ctsk mentioned this pull request Apr 20, 2025

Use interleave to speed up hash repartitioning #15768

Closed

Check in benchmark results

05db68c

ctsk closed this May 5, 2025

experiment: Selectively remove CoalesceBatchesExec #15479

experiment: Selectively remove CoalesceBatchesExec #15479

Uh oh!

Conversation

ctsk commented Mar 28, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Mar 28, 2025

Uh oh!

Dandandan Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

ctsk Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

berkaysynnada commented Mar 29, 2025

Uh oh!

alamb commented Mar 30, 2025

Uh oh!

Rachelint commented Apr 3, 2025

Uh oh!

berkaysynnada commented Apr 6, 2025

Uh oh!

ctsk commented Apr 7, 2025

Uh oh!

alamb commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ctsk commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

berkaysynnada commented Apr 8, 2025

Uh oh!

Dandandan commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ctsk commented Apr 20, 2025

Uh oh!

ctsk commented Apr 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alamb commented Apr 7, 2025 •

edited

Loading

ctsk commented Apr 7, 2025 •

edited

Loading

Dandandan commented Apr 18, 2025 •

edited

Loading