Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
034ae30
[SPARK-26033][PYTHON][TESTS] Break large ml/tests.py file into smalle…
BryanCutler Nov 18, 2018
bbbdaa8
[SPARK-26105][PYTHON] Clean unittest2 imports up that were added for …
HyukjinKwon Nov 19, 2018
630e25e
[SPARK-26026][BUILD] Published Scaladoc jars missing from Maven Central
srowen Nov 19, 2018
ce2cdc3
[SPARK-26043][CORE] Make SparkHadoopUtil private to Spark
srowen Nov 19, 2018
b58b1fd
[SPARK-26068][CORE] ChunkedByteBufferInputStream should handle empty …
LinhongLiu Nov 19, 2018
48ea64b
[SPARK-26112][SQL] Update since versions of new built-in functions.
ueshin Nov 19, 2018
35c5516
[SPARK-26024][SQL] Update documentation for repartitionByRange
JulienPeloton Nov 19, 2018
219b037
[SPARK-26071][SQL] disallow map as map key
cloud-fan Nov 19, 2018
32365f8
[SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous deprecation a…
srowen Nov 19, 2018
86cc907
This is a dummy commit to trigger ASF git sync
srowen Nov 19, 2018
a09d5ba
[SPARK-26107][SQL] Extend ReplaceNullWithFalseInPredicate to support …
rednaxelafx Nov 20, 2018
a00aaf6
[MINOR][YARN] Make memLimitExceededLogMessage more clean
wangyum Nov 20, 2018
c34c422
[SPARK-26076][BUILD][MINOR] Revise ambiguous error message from load-…
gengliangwang Nov 20, 2018
ab61ddb
[SPARK-26118][WEB UI] Introducing spark.ui.requestHeaderSize for sett…
attilapiros Nov 20, 2018
db136d3
[SPARK-26084][SQL] Fixes unresolved AggregateExpression.references ex…
ssimeonov Nov 20, 2018
42c4838
[BUILD] refactor dev/lint-python in to something readable
shaneknapp Nov 20, 2018
23bcd6c
[SPARK-26043][HOTFIX] Hotfix a change to SparkHadoopUtil that doesn't…
srowen Nov 21, 2018
4785105
[SPARK-26124][BUILD] Update plugins to latest versions
srowen Nov 21, 2018
2df34db
[SPARK-26122][SQL] Support encoding for multiLine in CSV datasource
MaxGekk Nov 21, 2018
4b7f7ef
[SPARK-26120][TESTS][SS][SPARKR] Fix a streaming query leak in Struct…
zsxwing Nov 21, 2018
a480a62
[SPARK-25954][SS] Upgrade to Kafka 2.1.0
dongjoon-hyun Nov 21, 2018
540afc2
[SPARK-26109][WEBUI] Duration in the task summary metrics table and t…
shahidki31 Nov 21, 2018
6bbdf34
[SPARK-8288][SQL] ScalaReflection can use companion object constructor
drewrobb Nov 21, 2018
07a700b
[SPARK-26129][SQL] Instrumentation for per-query planning time
rxin Nov 21, 2018
81550b3
[SPARK-26066][SQL] Move truncatedString to sql/catalyst and add spark…
MaxGekk Nov 21, 2018
4aa9ccb
[SPARK-26127][ML] Remove deprecated setters from tree regression and …
mgaido91 Nov 21, 2018
9b48107
[SPARK-25957][K8S] Make building alternate language binding docker im…
ramaddepally Nov 21, 2018
ce7b57c
[SPARK-26106][PYTHON] Prioritizes ML unittests over the doctests in P…
HyukjinKwon Nov 22, 2018
38628dd
[SPARK-25935][SQL] Prevent null rows from JSON parser
MaxGekk Nov 22, 2018
ab2eafb
[SPARK-26085][SQL] Key attribute of non-struct type under typed aggre…
viirya Nov 22, 2018
8d54bf7
[SPARK-26099][SQL] Verification of the corrupt column in from_csv/fro…
MaxGekk Nov 22, 2018
15c0384
[SPARK-26134][CORE] Upgrading Hadoop to 2.7.4 to fix java.version pro…
tasanuma Nov 22, 2018
ab00533
[SPARK-26129][SQL] edge behavior for QueryPlanningTracker.topRulesByT…
rxin Nov 22, 2018
aeda76e
[GRAPHX] Remove unused variables left over by previous refactoring.
huonw Nov 22, 2018
dd8c179
[SPARK-25867][ML] Remove KMeans computeCost
mgaido91 Nov 22, 2018
d81d95a
[SPARK-19368][MLLIB] BlockMatrix.toIndexedRowMatrix() optimization fo…
uzadude Nov 22, 2018
1d766f0
[SPARK-26144][BUILD] `build/mvn` should detect `scala.version` based …
dongjoon-hyun Nov 22, 2018
76aae7f
[SPARK-24553][UI][FOLLOWUP] Fix unnecessary UI redirect
jerryshao Nov 22, 2018
0ec7b99
[SPARK-26021][SQL] replace minus zero with zero in Platform.putDouble…
Nov 23, 2018
1d3dd58
[SPARK-25954][SS][FOLLOWUP][TEST-MAVEN] Add Zookeeper 3.4.7 test depe…
dongjoon-hyun Nov 23, 2018
92fc0a8
[SPARK-26069][TESTS][FOLLOWUP] Add another possible error message
zsxwing Nov 23, 2018
466d011
[SPARK-26117][CORE][SQL] use SparkOutOfMemoryError instead of OutOfMe…
heary-cao Nov 23, 2018
8e8d117
[SPARK-26108][SQL] Support custom lineSep in CSV datasource
MaxGekk Nov 23, 2018
ecb785f
[SPARK-26038] Decimal toScalaBigInt/toJavaBigInteger for decimals not…
juliuszsompolski Nov 23, 2018
de84899
[SPARK-26140] Enable custom metrics implementation in shuffle reader
rxin Nov 23, 2018
7f5f7a9
[SPARK-25786][CORE] If the ByteBuffer.hasArray is false , it will thr…
10110346 Nov 24, 2018
0f56977
[SPARK-26156][WEBUI] Revise summary section of stage page
gengliangwang Nov 24, 2018
eea4a03
[MINOR][K8S] Invalid property "spark.driver.pod.name" is referenced i…
Leemoonsoo Nov 25, 2018
41d5aae
[SPARK-26148][PYTHON][TESTS] Increases default parallelism in PySpark…
HyukjinKwon Nov 25, 2018
c5daccb
[MINOR] Update all DOI links to preferred resolver
katrinleinweber Nov 25, 2018
9414578
[SPARK-25908][SQL][FOLLOW-UP] Add back unionAll
gatorsmile Nov 25, 2018
6339c8c
[SPARK-24762][SQL] Enable Option of Product encoders
viirya Nov 26, 2018
6ab8485
[SPARK-26169] Create DataFrameSetOperationsSuite
gatorsmile Nov 26, 2018
6bb60b3
[SPARK-26168][SQL] Update the code comments in Expression and Aggregate
gatorsmile Nov 26, 2018
1bb60ab
[SPARK-26153][ML] GBT & RandomForest avoid unnecessary `first` job to…
zhengruifeng Nov 26, 2018
2512a1d
[SPARK-26121][STRUCTURED STREAMING] Allow users to define prefix of K…
Nov 26, 2018
3df307a
[SPARK-25960][K8S] Support subpath mounting with Kubernetes
NiharS Nov 26, 2018
76ef02e
[SPARK-21809] Change Stage Page to use datatables to support sorting …
Nov 26, 2018
fbf62b7
[SPARK-25451][SPARK-26100][CORE] Aggregated metrics table doesn't sho…
shahidki31 Nov 26, 2018
6f1a1c1
[SPARK-25451][HOTFIX] Call stage.attemptNumber instead of attemptId.
Nov 26, 2018
9deaa72
[INFRA] Close stale PR.
Nov 26, 2018
c995e07
[SPARK-26140] followup: rename ShuffleMetricsReporter
rxin Nov 27, 2018
1c487f7
[SPARK-24762][SQL][FOLLOWUP] Enable Option of Product encoders
viirya Nov 27, 2018
85383d2
[SPARK-25860][SPARK-26107][FOLLOW-UP] Rule ReplaceNullWithFalseInPred…
gatorsmile Nov 27, 2018
6a064ba
[SPARK-26141] Enable custom metrics implementation in shuffle write
rxin Nov 27, 2018
65244b1
[SPARK-23356][SQL][TEST] add new test cases for a + 1,a + b and Rand …
heary-cao Nov 27, 2018
2d89d10
[SPARK-26025][K8S] Speed up docker image build on dev repo.
Nov 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-25908][SQL][FOLLOW-UP] Add back unionAll
## What changes were proposed in this pull request?
This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4.

## How was this patch tested?
Added a test case in DataFrameSuite

Closes apache#23131 from gatorsmile/addBackUnionAll.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
  • Loading branch information
gatorsmile committed Nov 25, 2018
commit 94145786a5b91a7f0bca44f27599a61c72f3a18f
1 change: 1 addition & 0 deletions R/pkg/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ exportMethods("arrange",
"toJSON",
"transform",
"union",
"unionAll",
"unionByName",
"unique",
"unpersist",
Expand Down
14 changes: 14 additions & 0 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -2732,6 +2732,20 @@ setMethod("union",
dataFrame(unioned)
})

#' Return a new SparkDataFrame containing the union of rows
#'
#' This is an alias for `union`.
#'
#' @rdname union
#' @name unionAll
#' @aliases unionAll,SparkDataFrame,SparkDataFrame-method
#' @note unionAll since 1.4.0
setMethod("unionAll",
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
function(x, y) {
union(x, y)
})

#' Return a new SparkDataFrame containing the union of rows, matched by column names
#'
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
Expand Down
3 changes: 3 additions & 0 deletions R/pkg/R/generics.R
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,9 @@ setGeneric("toRDD", function(x) { standardGeneric("toRDD") })
#' @rdname union
setGeneric("union", function(x, y) { standardGeneric("union") })

#' @rdname union
setGeneric("unionAll", function(x, y) { standardGeneric("unionAll") })

#' @rdname unionByName
setGeneric("unionByName", function(x, y) { standardGeneric("unionByName") })

Expand Down
1 change: 1 addition & 0 deletions R/pkg/tests/fulltests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -2458,6 +2458,7 @@ test_that("union(), unionByName(), rbind(), except(), and intersect() on a DataF
expect_equal(count(unioned), 6)
expect_equal(first(unioned)$name, "Michael")
expect_equal(count(arrange(suppressWarnings(union(df, df2)), df$age)), 6)
expect_equal(count(arrange(suppressWarnings(unionAll(df, df2)), df$age)), 6)

df1 <- select(df2, "age", "name")
unioned1 <- arrange(unionByName(df1, df), df1$age)
Expand Down
2 changes: 1 addition & 1 deletion docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -718,4 +718,4 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
## Upgrading to SparkR 3.0.0

- The deprecated methods `sparkR.init`, `sparkRSQL.init`, `sparkRHive.init` have been removed. Use `sparkR.session` instead.
- The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `registerTempTable`, `createExternalTable`, `dropTempTable`, `unionAll` have been removed. Use `read.parquet`, `write.parquet`, `read.json`, `createOrReplaceTempView`, `createTable`, `dropTempView`, `union` instead.
- The deprecated methods `parquetFile`, `saveAsParquetFile`, `jsonFile`, `registerTempTable`, `createExternalTable`, and `dropTempTable` have been removed. Use `read.parquet`, `write.parquet`, `read.json`, `createOrReplaceTempView`, `createTable`, `dropTempView`, `union` instead.
2 changes: 2 additions & 0 deletions docs/sql-migration-guide-upgrade.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide

## Upgrading From Spark SQL 2.4 to 3.0

- Since Spark 3.0, the Dataset and DataFrame API `unionAll` is not deprecated any more. It is an alias for `union`.

- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.

- In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as `IntegerType`. For `FloatType` and `DoubleType`, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except for `StringType` and `BinaryType`.
Expand Down
11 changes: 11 additions & 0 deletions python/pyspark/sql/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1448,6 +1448,17 @@ def union(self, other):
"""
return DataFrame(self._jdf.union(other._jdf), self.sql_ctx)

@since(1.3)
def unionAll(self, other):
""" Return a new :class:`DataFrame` containing union of rows in this and another frame.

This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
(that does deduplication of elements), use this function followed by :func:`distinct`.

Also as standard in SQL, this function resolves columns by position (not by name).
"""
return self.union(other)

@since(2.3)
def unionByName(self, other):
""" Returns a new :class:`DataFrame` containing union of rows in this and another frame.
Expand Down
14 changes: 14 additions & 0 deletions sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1852,6 +1852,20 @@ class Dataset[T] private[sql](
CombineUnions(Union(logicalPlan, other.logicalPlan))
}

/**
* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
* This is an alias for `union`.
*
* This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union (that does
* deduplication of elements), use this function followed by a [[distinct]].
*
* Also as standard in SQL, this function resolves columns by position (not by name).
*
* @group typedrel
* @since 2.0.0
*/
def unionAll(other: Dataset[T]): Dataset[T] = union(other)

/**
* Returns a new Dataset containing union of rows in this Dataset and another Dataset.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,12 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
unionDF.agg(avg('key), max('key), min('key), sum('key)),
Row(50.5, 100, 1, 25250) :: Nil
)

// unionAll is an alias of union
val unionAllDF = testData.unionAll(testData).unionAll(testData)
.unionAll(testData).unionAll(testData)

checkAnswer(unionDF, unionAllDF)
}

test("union should union DataFrames with UDTs (SPARK-13410)") {
Expand Down