Commit 5b97165
committed
[SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning
### What changes were proposed in this pull request?
Remove `OptimizeSubqueries` from batch of `PartitionPruning` to make DPP support more cases. For example:
```sql
SELECT date_id, product_id FROM fact_sk f
JOIN (select store_id + 3 as new_store_id from dim_store where country = 'US') s
ON f.store_id = s.new_store_id
```
Before this PR:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
:- *(2) ColumnarToRow
: +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(true)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#274]
+- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
+- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
+- *(1) ColumnarToRow
+- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
```
After this PR:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
:- *(2) ColumnarToRow
: +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN dynamicpruning#4007)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
: +- SubqueryBroadcast dynamicpruning#4007, 0, [new_store_id#3997], [id=#263]
: +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]
: +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
: +- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
: +- *(1) ColumnarToRow
: +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
+- ReusedExchange [new_store_id#3997], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#262]
```
This is because `OptimizeSubqueries` will infer more filters, so we cannot reuse broadcasts. The following is the plan if disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`:
```
== Physical Plan ==
*(2) Project [date_id#3998, product_id#3999]
+- *(2) BroadcastHashJoin [store_id#4001], [new_store_id#3997], Inner, BuildRight, false
:- *(2) ColumnarToRow
: +- FileScan parquet default.fact_sk[date_id#3998,product_id#3999,store_id#4001] Batched: true, DataFilters: [], Format: Parquet, PartitionFilters: [isnotnull(store_id#4001), dynamicpruningexpression(store_id#4001 IN subquery#4009)], PushedFilters: [], ReadSchema: struct<date_id:int,product_id:int>
: +- Subquery subquery#4009, [id=#284]
: +- *(2) HashAggregate(keys=[new_store_id#3997#4008], functions=[])
: +- Exchange hashpartitioning(new_store_id#3997#4008, 5), ENSURE_REQUIREMENTS, [id=#280]
: +- *(1) HashAggregate(keys=[new_store_id#3997 AS new_store_id#3997#4008], functions=[])
: +- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
: +- *(1) Filter (((isnotnull(store_id#4002) AND isnotnull(country#4004)) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
: +- *(1) ColumnarToRow
: +- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(store_id#4002), isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002..., Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(store_id), IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#305]
+- *(1) Project [(store_id#4002 + 3) AS new_store_id#3997]
+- *(1) Filter ((isnotnull(country#4004) AND (country#4004 = US)) AND isnotnull((store_id#4002 + 3)))
+- *(1) ColumnarToRow
+- FileScan parquet default.dim_store[store_id#4002,country#4004] Batched: true, DataFilters: [isnotnull(country#4004), (country#4004 = US), isnotnull((store_id#4002 + 3))], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(country), EqualTo(country,US)], ReadSchema: struct<store_id:int,country:string>
```
### Why are the changes needed?
Improve DPP to support more cases.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test and benchmark test:
SQL | Before this PR(Seconds) | After this PR(Seconds)
-- | -- | --
TPC-DS q58 | 40 | 20
TPC-DS q83 | 18 | 14
Closes #33664 from wangyum/SPARK-36444.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(cherry picked from commit 2310b99)
Signed-off-by: Yuming Wang <yumwang@ebay.com>1 parent 9544c24 commit 5b97165
File tree
10 files changed
+1235
-1156
lines changed- sql/core/src
- main/scala/org/apache/spark/sql/execution
- test
- resources/tpcds-plan-stability/approved-plans-v1_4
- q58.sf100
- q58
- q83.sf100
- q83
- scala/org/apache/spark/sql
10 files changed
+1235
-1156
lines changedLines changed: 1 addition & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
45 | | - | |
46 | | - | |
| 45 | + | |
47 | 46 | | |
48 | 47 | | |
49 | 48 | | |
| |||
Lines changed: 289 additions & 278 deletions
Large diffs are not rendered by default.
Lines changed: 24 additions & 20 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
36 | 32 | | |
37 | | - | |
38 | | - | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
39 | 42 | | |
40 | 43 | | |
41 | 44 | | |
42 | | - | |
43 | | - | |
44 | | - | |
| 45 | + | |
| 46 | + | |
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
| |||
66 | 68 | | |
67 | 69 | | |
68 | 70 | | |
| 71 | + | |
69 | 72 | | |
70 | 73 | | |
71 | 74 | | |
| |||
87 | 90 | | |
88 | 91 | | |
89 | 92 | | |
| 93 | + | |
90 | 94 | | |
91 | 95 | | |
92 | 96 | | |
| |||
0 commit comments