forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 2
[pull] master from apache:master #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…d query stages ### What changes were proposed in this pull request? AQE uses statistics from completed query stages and feeds them back into the logical optimizer. AQE currently only uses `dataSize` and `numOutputRows` and ignores any available `attributeMap` (column statistics). This PR updates AQE to also populate `attributeMap` in the statistics that it uses for re-optimizing the plan. ### Why are the changes needed? These changes are needed so that Spark plugins that provide custom implementations of the `ShuffleExchangeLike` trait can leverage column statistics for better plan optimization during AQE execution. ### Does this PR introduce _any_ user-facing change? No. The current Spark implementation of `ShuffleExchangeLike` (`ShuffleExchangeExec`) does not populate `attributeMap`, so this PR is a no-op for regular Spark. ### How was this patch tested? New unit test added. Closes #37424 from andygrove/aqe-column-stats. Authored-by: Andy Grove <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…le DSv1 is not avaliable …le DSv1 is not avaliable. ### What changes were proposed in this pull request? Improve the error message when DSv2 is disable while its fallback DSv1 is not available. ### Why are the changes needed? Improve the user experience. When users get UnsupportOperationError for the disabled DSv2, they are able to know which config to modify to enable the V2 source. ### Does this PR introduce _any_ user-facing change? Yes, error message. ### How was this patch tested? N/A, just the message change Closes #37917 from huanliwang-db/SPARK-40466. Authored-by: Huanli Wang <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request? This PR proposes to document datetime.timedelta support in PySpark in SQL DataType reference page. This support was added in SPARK-37275 ### Why are the changes needed? To show the support of datetime.timedelta. ### Does this PR introduce _any_ user-facing change? Yes, this fixes the documentation. ### How was this patch tested? CI in this PR should validate the build. Closes #37939 from HyukjinKwon/minor-daytimeinterval. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…rrwith`
### What changes were proposed in this pull request?
1. extract the computation of `DataFrame.corr` into `correlation.py`, so it should be able to reused in `DataFrame.corrwith`/`DataFrameGroupBy.corrwith`/`DataFrameGroupBy.corr`/etc;
2. implement `spearman` and `kendall` in `DataFrame.corrwith`
3. add parameter `axis` in `DataFrame.corrwith`;
### Why are the changes needed?
For API coverage
```
In [1]: import pyspark.pandas as ps
In [2]: df1 = ps.DataFrame({ "A":[1, 5, 7, 8], "X":[5, 8, 4, 3], "C":[10, 4, 9, 3]})
In [3]: df2 = ps.DataFrame({ "A":[5, 3, 6, 4], "B":[11, 2, 4, 3], "C":[4, 3, 8, 5]})
In [4]: ps.set_option("compute.ops_on_diff_frames", True)
In [5]: df1.corrwith(df2, method="kendall").sort_index()
A 0.0
B NaN
C 0.0
X NaN
dtype: float64
In [6]: df1.to_pandas().corrwith(df2.to_pandas(), method="kendall").sort_index()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
warnings.warn(message, PandasAPIOnSparkAdviceWarning)
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[6]:
A 0.0
B NaN
C 0.0
X NaN
dtype: float64
In [7]: df1.corrwith(df2.B, method="spearman").sort_index()
Out[7]:
A -0.4
C 0.8
X -0.2
dtype: float64
In [8]: df1.to_pandas().corrwith(df2.B.to_pandas(), method="spearman").sort_index()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
warnings.warn(message, PandasAPIOnSparkAdviceWarning)
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small.
warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[8]:
A -0.4
C 0.8
X -0.2
dtype: float64
```
### Does this PR introduce _any_ user-facing change?
yes, new correlations supported
### How was this patch tested?
added UT
Closes #37929 from zhengruifeng/ps_corrwith_spearman_kendall.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
… unavailable ### What changes were proposed in this pull request? This PR follow-up for #37873. The UDAF test should be skipped when pyspark, pandas and/or pyarrow is unavailable with proper message. ### Why are the changes needed? Skip the test properly when it's unavailable. ### Does this PR introduce _any_ user-facing change? No, it's test only. ### How was this patch tested? Manually test to be skipped when missing package.  Closes #37946 from itholic/SPARK-40419-followup. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
pull bot
pushed a commit
that referenced
this pull request
Aug 13, 2024
…eption ### What changes were proposed in this pull request? This pr reworks the group by map type to fix issues: - Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output - The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys ### Why are the changes needed? To fix issues. for example: ``` select map(1, id) from range(10) group by map(1, id); [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470) ``` ### Does this PR introduce _any_ user-facing change? no, not released ### How was this patch tested? improve the tests to add more cases ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47545 from ulysses-you/maptype. Authored-by: ulysses-you <[email protected]> Signed-off-by: youxiduo <[email protected]>
pull bot
pushed a commit
that referenced
this pull request
Jan 23, 2025
…IN-subquery ### What changes were proposed in this pull request? This PR adds code to `RewritePredicateSubquery#apply` to explicitly handle the case where an `Aggregate` node contains an aggregate expression in the left-hand operand of an IN-subquery expression. The explicit handler moves the IN-subquery expressions out of the `Aggregate` and into a parent `Project` node. The `Aggregate` will continue to perform the aggregations that were used as an operand to the IN-subquery expression, but will not include the IN-subquery expression itself. After pulling up IN-subquery expressions into a Project node, `RewritePredicateSubquery#apply` is called again to handle the `Project` as a `UnaryNode`. The `Join` will now be inserted between the `Project` and the `Aggregate` node, and the join condition will use an attribute rather than an aggregate expression, e.g.: ``` Project [col1#32, exists#42 AS (sum(col2) IN (listquery()))#40] +- Join ExistenceJoin(exists#42), (sum(col2)#41L = c2#39L) :- Aggregate [col1#32], [col1#32, sum(col2#33) AS sum(col2)#41L] : +- LocalRelation [col1#32, col2#33] +- LocalRelation [c2#39L] ``` `sum(col2)#41L` in the above join condition, despite how it looks, is the name of the attribute, not an aggregate expression. ### Why are the changes needed? The following query fails: ``` create or replace temp view v1(c1, c2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); create or replace temp view v2(col1, col2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); select col1, sum(col2) in (select c2 from v1) from v2 group by col1; ``` It fails with this error: ``` [INTERNAL_ERROR] Cannot generate code for expression: sum(input[1, int, false]) SQLSTATE: XX000 ``` With SPARK_TESTING=1, it fails with this error: ``` [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery in batch RewriteSubquery generated an invalid plan: Special expressions are placed in the wrong plan: Aggregate [col1#11], [col1#11, first(exists#20, false) AS (sum(col2) IN (listquery()))#19] +- Join ExistenceJoin(exists#20), (sum(col2#12) = c2#18L) :- LocalRelation [col1#11, col2#12] +- LocalRelation [c2#18L] ``` The issue is that `RewritePredicateSubquery` builds a `Join` operator where the join condition contains an aggregate expression. The bug is in the handler for `UnaryNode` in `RewritePredicateSubquery#apply`, which adds a `Join` below the `Aggregate` and assumes that the left-hand operand of IN-subquery can be used in the join condition. This works fine for most cases, but not when the left-hand operand is an aggregate expression. This PR moves the offending IN-subqueries to a `Project` node, with the aggregates replaced by attributes referring to the aggregate expressions. The resulting join condition now uses those attributes rather than the actual aggregate expressions. ### Does this PR introduce _any_ user-facing change? No, other than allowing this type of query to succeed. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48627 from bersprockets/aggregate_in_set_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )