Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Nov 24, 2018

What changes were proposed in this pull request?

This PR is to add back unionAll, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding intersectAll and exceptAll, which were introduced in Spark 2.4.

How was this patch tested?

Added a test case in DataFrameSuite

@gatorsmile
Copy link
Member Author

cc @rxin @srowen @cloud-fan

}

/**
* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say that this is an alias of union.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine, so this un-deprecates it effectively. The migration docs need to be updated too, in sparkr.md and sql-migration-guide-upgrade.md, to remove reference to unionAll. I updated the JIRA release notes.

It needs to be restored to R too; see 41e1416#diff-508641a8bd6c6b59f3e77c80cdcfa6a9

@SparkQA
Copy link

SparkQA commented Nov 24, 2018

Test build #99229 has finished for PR 23131 at commit 12dfd77.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2018

Test build #99230 has finished for PR 23131 at commit f0dfe7b.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2018

Test build #99228 has finished for PR 23131 at commit 133246d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

APIs. Instead, `DataFrame` remains the primary programming abstraction, which is analogous to the
single-node data frame notion in these languages.

- Dataset and DataFrame API `unionAll` has been deprecated and replaced by `union`
Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, we cannot change the history. Until Spark 2.4.0, we are showing the deprecation warning.

scala> spark.version
res2: String = 2.4.0

scala> df.unionAll(df2)
<console>:28: warning: method unionAll in class Dataset is deprecated: use union()
       df.unionAll(df2)
          ^

Shall we keep the history in this specific migration doc, Upgrading From Spark SQL 1.6 to 2.0, and add some comment about 3.0.0 instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my fault for making this suggestion. Yeah maybe best to leave this statement, and add a note here or the the 3.0 migration guide that it has been subsequently un-deprecated

@SparkQA
Copy link

SparkQA commented Nov 25, 2018

Test build #99231 has finished for PR 23131 at commit 170262b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2018

Test build #99234 has finished for PR 23131 at commit 515c04c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

retest this please

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 25, 2018

Test build #99242 has finished for PR 23131 at commit 515c04c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 9414578 Nov 25, 2018
@cloud-fan
Copy link
Contributor

shall we say union is an alias of unionAll instead of unionAll is an alias of Union? According to the SQL spec, unionAll is implemented correctly that it keeps duplicated rows, while union does not follow SQL spec, as it's too widely used and it's too late to change behavior.

@gatorsmile
Copy link
Member Author

Thanks! Merged to master.

Yes. Adding Distinct over Union is super expensive especially when the underlying data set is huge.


#' Return a new SparkDataFrame containing the union of rows
#'
#' This is an alias for `union`.
Copy link
Member

@felixcheung felixcheung Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is for this to be like other *All, this should go into a separate doc page, plus seealso, example etc.

The way this was written, as it was a deprecated function, this doc page merged with union - as it is committed now, none of the text above will show up and also unionAll will not be listed in method index list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also backtick doesn't format with roxygen2. this should be

This is an alias for \code{union}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Instead of directly copying the comments back, we should follow intersectAll. Opened a ticket: https://issues.apache.org/jira/browse/SPARK-26189

robert3005 pushed a commit to palantir/spark that referenced this pull request Jan 9, 2019
This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4.

Added a test case in DataFrameSuite

Closes apache#23131 from gatorsmile/addBackUnionAll.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?
This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4.

## How was this patch tested?
Added a test case in DataFrameSuite

Closes apache#23131 from gatorsmile/addBackUnionAll.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants