[pull] master from apache:master #19

pull · 2022-09-20T04:39:10Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…d query stages ### What changes were proposed in this pull request? AQE uses statistics from completed query stages and feeds them back into the logical optimizer. AQE currently only uses `dataSize` and `numOutputRows` and ignores any available `attributeMap` (column statistics). This PR updates AQE to also populate `attributeMap` in the statistics that it uses for re-optimizing the plan. ### Why are the changes needed? These changes are needed so that Spark plugins that provide custom implementations of the `ShuffleExchangeLike` trait can leverage column statistics for better plan optimization during AQE execution. ### Does this PR introduce _any_ user-facing change? No. The current Spark implementation of `ShuffleExchangeLike` (`ShuffleExchangeExec`) does not populate `attributeMap`, so this PR is a no-op for regular Spark. ### How was this patch tested? New unit test added. Closes #37424 from andygrove/aqe-column-stats. Authored-by: Andy Grove <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…le DSv1 is not avaliable …le DSv1 is not avaliable. ### What changes were proposed in this pull request? Improve the error message when DSv2 is disable while its fallback DSv1 is not available. ### Why are the changes needed? Improve the user experience. When users get UnsupportOperationError for the disabled DSv2, they are able to know which config to modify to enable the V2 source. ### Does this PR introduce _any_ user-facing change? Yes, error message. ### How was this patch tested? N/A, just the message change Closes #37917 from huanliwang-db/SPARK-40466. Authored-by: Huanli Wang <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? This PR proposes to document datetime.timedelta support in PySpark in SQL DataType reference page. This support was added in SPARK-37275 ### Why are the changes needed? To show the support of datetime.timedelta. ### Does this PR introduce _any_ user-facing change? Yes, this fixes the documentation. ### How was this patch tested? CI in this PR should validate the build. Closes #37939 from HyukjinKwon/minor-daytimeinterval. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…rrwith` ### What changes were proposed in this pull request? 1. extract the computation of `DataFrame.corr` into `correlation.py`, so it should be able to reused in `DataFrame.corrwith`/`DataFrameGroupBy.corrwith`/`DataFrameGroupBy.corr`/etc; 2. implement `spearman` and `kendall` in `DataFrame.corrwith` 3. add parameter `axis` in `DataFrame.corrwith`; ### Why are the changes needed? For API coverage ``` In [1]: import pyspark.pandas as ps In [2]: df1 = ps.DataFrame({ "A":[1, 5, 7, 8], "X":[5, 8, 4, 3], "C":[10, 4, 9, 3]}) In [3]: df2 = ps.DataFrame({ "A":[5, 3, 6, 4], "B":[11, 2, 4, 3], "C":[4, 3, 8, 5]}) In [4]: ps.set_option("compute.ops_on_diff_frames", True) In [5]: df1.corrwith(df2, method="kendall").sort_index() A 0.0 B NaN C 0.0 X NaN dtype: float64 In [6]: df1.to_pandas().corrwith(df2.to_pandas(), method="kendall").sort_index() /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small. warnings.warn(message, PandasAPIOnSparkAdviceWarning) /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small. warnings.warn(message, PandasAPIOnSparkAdviceWarning) Out[6]: A 0.0 B NaN C 0.0 X NaN dtype: float64 In [7]: df1.corrwith(df2.B, method="spearman").sort_index() Out[7]: A -0.4 C 0.8 X -0.2 dtype: float64 In [8]: df1.to_pandas().corrwith(df2.B.to_pandas(), method="spearman").sort_index() /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small. warnings.warn(message, PandasAPIOnSparkAdviceWarning) /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small. warnings.warn(message, PandasAPIOnSparkAdviceWarning) Out[8]: A -0.4 C 0.8 X -0.2 dtype: float64 ``` ### Does this PR introduce _any_ user-facing change? yes, new correlations supported ### How was this patch tested? added UT Closes #37929 from zhengruifeng/ps_corrwith_spearman_kendall. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

… unavailable ### What changes were proposed in this pull request? This PR follow-up for #37873. The UDAF test should be skipped when pyspark, pandas and/or pyarrow is unavailable with proper message. ### Why are the changes needed? Skip the test properly when it's unavailable. ### Does this PR introduce _any_ user-facing change? No, it's test only. ### How was this patch tested? Manually test to be skipped when missing package. ![Screen Shot 2022-09-20 at 6 11 30 PM](https://user-images.githubusercontent.com/44108233/191218078-254605b4-157e-4e22-8015-22399ff5e0b2.png) Closes #37946 from itholic/SPARK-40419-followup. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…eption ### What changes were proposed in this pull request? This pr reworks the group by map type to fix issues: - Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output - The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys ### Why are the changes needed? To fix issues. for example: ``` select map(1, id) from range(10) group by map(1, id); [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470) ``` ### Does this PR introduce _any_ user-facing change? no, not released ### How was this patch tested? improve the tests to add more cases ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47545 from ulysses-you/maptype. Authored-by: ulysses-you <[email protected]> Signed-off-by: youxiduo <[email protected]>

…IN-subquery ### What changes were proposed in this pull request? This PR adds code to `RewritePredicateSubquery#apply` to explicitly handle the case where an `Aggregate` node contains an aggregate expression in the left-hand operand of an IN-subquery expression. The explicit handler moves the IN-subquery expressions out of the `Aggregate` and into a parent `Project` node. The `Aggregate` will continue to perform the aggregations that were used as an operand to the IN-subquery expression, but will not include the IN-subquery expression itself. After pulling up IN-subquery expressions into a Project node, `RewritePredicateSubquery#apply` is called again to handle the `Project` as a `UnaryNode`. The `Join` will now be inserted between the `Project` and the `Aggregate` node, and the join condition will use an attribute rather than an aggregate expression, e.g.: ``` Project [col1#32, exists#42 AS (sum(col2) IN (listquery()))#40] +- Join ExistenceJoin(exists#42), (sum(col2)#41L = c2#39L) :- Aggregate [col1#32], [col1#32, sum(col2#33) AS sum(col2)#41L] : +- LocalRelation [col1#32, col2#33] +- LocalRelation [c2#39L] ``` `sum(col2)#41L` in the above join condition, despite how it looks, is the name of the attribute, not an aggregate expression. ### Why are the changes needed? The following query fails: ``` create or replace temp view v1(c1, c2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); create or replace temp view v2(col1, col2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); select col1, sum(col2) in (select c2 from v1) from v2 group by col1; ``` It fails with this error: ``` [INTERNAL_ERROR] Cannot generate code for expression: sum(input[1, int, false]) SQLSTATE: XX000 ``` With SPARK_TESTING=1, it fails with this error: ``` [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery in batch RewriteSubquery generated an invalid plan: Special expressions are placed in the wrong plan: Aggregate [col1#11], [col1#11, first(exists#20, false) AS (sum(col2) IN (listquery()))#19] +- Join ExistenceJoin(exists#20), (sum(col2#12) = c2#18L) :- LocalRelation [col1#11, col2#12] +- LocalRelation [c2#18L] ``` The issue is that `RewritePredicateSubquery` builds a `Join` operator where the join condition contains an aggregate expression. The bug is in the handler for `UnaryNode` in `RewritePredicateSubquery#apply`, which adds a `Join` below the `Aggregate` and assumes that the left-hand operand of IN-subquery can be used in the join condition. This works fine for most cases, but not when the left-hand operand is an aggregate expression. This PR moves the offending IN-subqueries to a `Project` node, with the aggregates replaced by attributes referring to the aggregate expressions. The resulting join condition now uses those attributes rather than the actual aggregate expressions. ### Does this PR introduce _any_ user-facing change? No, other than allowing this type of query to succeed. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48627 from bersprockets/aggregate_in_set_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

andygrove and others added 2 commits September 19, 2022 14:17

pull bot added the ⤵️ pull label Sep 20, 2022

github-actions bot added SQL STRUCTURED STREAMING labels Sep 20, 2022

github-actions bot added the DOCS label Sep 20, 2022

github-actions bot added CORE PANDAS API ON SPARK PYTHON labels Sep 20, 2022

pull bot merged commit e021f8b into wangyum:master Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #19

[pull] master from apache:master #19

Uh oh!

pull bot commented Sep 20, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[pull] master from apache:master #19

[pull] master from apache:master #19

Uh oh!

Conversation

pull bot commented Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull bot commented Sep 20, 2022 •

edited

Loading