[SPARK-48949][SQL] SPJ: Runtime partition filtering #47426

szehon-ho · 2024-07-19T21:45:41Z

What changes were proposed in this pull request?

Introduce runtime partition filtering for SPJ. In planning, we have the list of partition values on both sides to plan the tasks. We can thus filter out partition values based on the join type.

Currently LEFT OUTER, RIGHT OUTER, INNER join types are supported as they are more common, we can optimize other join types in subsequent PR.

Why are the changes needed?

In some common join types (INNER, LEFT, RIGHT), we have an opportunity to greatly reduce the data scanned in SPJ. For example, a small table joining a larger table by partition key, can prune out most of the partitions of the large table.

There is some similarity with the concept of DPP, but that uses heuristics and this is more exact as SPJ planning requires us anyway to list out both sides partitioning.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests in KeyGroupedPartitioningSuite.

### What changes were proposed in this pull request? Introduce runtime partition filtering for SPJ. In planning, we have the list of partition values on both sides to plan the tasks. We can thus filter out partition values based on the join type. ### Why are the changes needed? In some common join types (INNER, LEFT, RIGHT), we have an opportunity to greatly reduce the data scanned in SPJ. For example, a small table joining a larger table by partition key, can prune out most of the partitions of the large table. There is some similarity with the concept of DPP, but that uses heuristics and this is more exact as SPJ planning requires us anyway to list out both sides partitioning. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests in KeyGroupedPartitioningSuite.

szehon-ho · 2024-07-20T02:13:55Z

@sunchao can you take a look when you get a chance? Thanks

sunchao

LGTM - sorry for the late review @szehon-ho

sunchao · 2024-08-02T23:10:17Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/util/InternalRowComparableWrapper.scala

+      leftPartitionSet.union(rightPartitionSet)
    }
-    partitionsSet.map(_.row).toSeq.sorted(partitionOrdering)
+    result.toSeq


do we still need to sort the result partitions?

ah I see it is sorted later in the other method now

dongjoon-hyun

+1, LGTM. Thank you, @szehon-ho and @cloud-fan .
Merged to master.

dongjoon-hyun · 2024-08-04T20:12:33Z

Also, cc @cloud-fan , @viirya , too.

szehon-ho · 2024-08-05T18:01:10Z

Thank you @sunchao and @dongjoon-hyun for quick review!

viirya · 2024-08-05T19:57:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+          leftReducers)
+        val rightReducers = rightSpec.reducers(leftSpec)
+        val rightParts = reducePartValues(rightSpec.partitioning.partitionValues,
+          partitionExprs,


partitionExprs are from left spec. As this goes to reduce on right spec. Though they are compatible, but does it guarantee that right spec's partition expressions have same data types as left spec?

For compatible partition expressions, it is r(t1(x)) = t2(x), or r(t2(x)) = t1(x) by definition. But t1 and t2 still can have different data types, isn't?

It just requires r must be same data type as other side, i.e., r(t1(x)) and t2(x), or r(t2(x)) and t1(x).

Yes you may be right, let me double check this with a test and get back to you.

### What changes were proposed in this pull request? This PR aims to regenerate benchmark results (except `ExternalAppendOnlyUnsafeRowArrayBenchmark`) as a preparation for Apache Spark 4.0.0-preview2. - During the testing, it's observed that `ExternalAppendOnlyUnsafeRowArrayBenchmark` hangs in both CI and local environment. SPARK-49228 is filed for its investigation. - In addition, `Storage Partition Join`-related benchmark are generated for the following commits. - #46265 - #47426 ### Why are the changes needed? To check the performance regression. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is generated by - https://github.com/dongjoon-hyun/spark/actions/runs/10364365815 (Java 17) - https://github.com/dongjoon-hyun/spark/actions/runs/10364368441 (Java 21) Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47743 from dongjoon-hyun/SPARK-49224. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Introduce runtime partition filtering for SPJ. In planning, we have the list of partition values on both sides to plan the tasks. We can thus filter out partition values based on the join type. Currently LEFT OUTER, RIGHT OUTER, INNER join types are supported as they are more common, we can optimize other join types in subsequent PR. ### Why are the changes needed? In some common join types (INNER, LEFT, RIGHT), we have an opportunity to greatly reduce the data scanned in SPJ. For example, a small table joining a larger table by partition key, can prune out most of the partitions of the large table. There is some similarity with the concept of DPP, but that uses heuristics and this is more exact as SPJ planning requires us anyway to list out both sides partitioning. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests in KeyGroupedPartitioningSuite. Closes apache#47426 from szehon-ho/spj_partition_filter. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Jul 19, 2024

szehon-ho force-pushed the spj_partition_filter branch from 8ebd413 to b9d8855 Compare July 19, 2024 21:55

szehon-ho force-pushed the spj_partition_filter branch from b9d8855 to 7fd0d08 Compare July 19, 2024 23:40

sunchao approved these changes Aug 2, 2024

View reviewed changes

dongjoon-hyun approved these changes Aug 4, 2024

View reviewed changes

dongjoon-hyun closed this in dbba92a Aug 4, 2024

viirya reviewed Aug 5, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request Aug 13, 2024

[SPARK-49224][TESTS] Regenerate benchmark results #47743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48949][SQL] SPJ: Runtime partition filtering #47426

[SPARK-48949][SQL] SPJ: Runtime partition filtering #47426

Uh oh!

szehon-ho commented Jul 19, 2024

Uh oh!

szehon-ho commented Jul 20, 2024

Uh oh!

sunchao left a comment

Uh oh!

sunchao Aug 2, 2024

Uh oh!

sunchao Aug 2, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Aug 4, 2024

Uh oh!

szehon-ho commented Aug 5, 2024

Uh oh!

viirya Aug 5, 2024 •

edited

Loading

Uh oh!

szehon-ho Aug 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-48949][SQL] SPJ: Runtime partition filtering #47426

[SPARK-48949][SQL] SPJ: Runtime partition filtering #47426

Uh oh!

Conversation

szehon-ho commented Jul 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

szehon-ho commented Jul 20, 2024

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

sunchao Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 4, 2024

Uh oh!

szehon-ho commented Aug 5, 2024

Uh oh!

viirya Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya Aug 5, 2024 •

edited

Loading