-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25908][SQL][FOLLOW-UP] Add back unionAll #23131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| /** | ||
| * Returns a new Dataset containing union of rows in this Dataset and another Dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say that this is an alias of union.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fine, so this un-deprecates it effectively. The migration docs need to be updated too, in sparkr.md and sql-migration-guide-upgrade.md, to remove reference to unionAll. I updated the JIRA release notes.
It needs to be restored to R too; see 41e1416#diff-508641a8bd6c6b59f3e77c80cdcfa6a9
|
Test build #99229 has finished for PR 23131 at commit
|
|
Test build #99230 has finished for PR 23131 at commit
|
|
Test build #99228 has finished for PR 23131 at commit
|
| APIs. Instead, `DataFrame` remains the primary programming abstraction, which is analogous to the | ||
| single-node data frame notion in these languages. | ||
|
|
||
| - Dataset and DataFrame API `unionAll` has been deprecated and replaced by `union` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, we cannot change the history. Until Spark 2.4.0, we are showing the deprecation warning.
scala> spark.version
res2: String = 2.4.0
scala> df.unionAll(df2)
<console>:28: warning: method unionAll in class Dataset is deprecated: use union()
df.unionAll(df2)
^Shall we keep the history in this specific migration doc, Upgrading From Spark SQL 1.6 to 2.0, and add some comment about 3.0.0 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my fault for making this suggestion. Yeah maybe best to leave this statement, and add a note here or the the 3.0 migration guide that it has been subsequently un-deprecated
|
Test build #99231 has finished for PR 23131 at commit
|
|
Test build #99234 has finished for PR 23131 at commit
|
|
retest this please |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thanks!
|
Test build #99242 has finished for PR 23131 at commit
|
|
shall we say |
|
Thanks! Merged to master. Yes. Adding Distinct over Union is super expensive especially when the underlying data set is huge. |
|
|
||
| #' Return a new SparkDataFrame containing the union of rows | ||
| #' | ||
| #' This is an alias for `union`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is for this to be like other *All, this should go into a separate doc page, plus seealso, example etc.
The way this was written, as it was a deprecated function, this doc page merged with union - as it is committed now, none of the text above will show up and also unionAll will not be listed in method index list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also backtick doesn't format with roxygen2. this should be
This is an alias for \code{union}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Instead of directly copying the comments back, we should follow intersectAll. Opened a ticket: https://issues.apache.org/jira/browse/SPARK-26189
This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4. Added a test case in DataFrameSuite Closes apache#23131 from gatorsmile/addBackUnionAll. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4. ## How was this patch tested? Added a test case in DataFrameSuite Closes apache#23131 from gatorsmile/addBackUnionAll. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]>
What changes were proposed in this pull request?
This PR is to add back
unionAll, which is widely used. The name is also consistent with our ANSI SQL. We also have the correspondingintersectAllandexceptAll, which were introduced in Spark 2.4.How was this patch tested?
Added a test case in DataFrameSuite