Skip to content

Conversation

@skillgapfinder
Copy link

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

holdenk and others added 30 commits June 29, 2016 01:52
…doc is incorrect for toJavaRDD, …

## What changes were proposed in this pull request?

Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type.

## How was this patch tested?

Docs only change.

Author: Holden Karau <[email protected]>

Closes #13954 from holdenk/trivial-streaming-tojavardd-doc-fix.

(cherry picked from commit 757dc2c)
Signed-off-by: Tathagata Das <[email protected]>
…tions that reference no input attributes

## What changes were proposed in this pull request?

`MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime.

The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute.

This PR fixes this issue by removing the aforementioned branch.

## How was this patch tested?

New test case added in `AnalysisErrorSuite`.

Author: Cheng Lian <[email protected]>

Closes #13968 from liancheng/spark-16291-nested-agg-functions.

(cherry picked from commit d1e8108)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

Some appNames in ML examples are incorrect, mostly in PySpark but one in Scala.  This corrects the names.

## How was this patch tested?
Style, local tests

Author: Bryan Cutler <[email protected]>

Closes #13949 from BryanCutler/pyspark-example-appNames-fix-SPARK-16261.

(cherry picked from commit 21385d0)
Signed-off-by: Nick Pentreath <[email protected]>
## What changes were proposed in this pull request?
Fix wrong arguments description of ```survreg``` in SparkR.

## How was this patch tested?
```Arguments``` section of ```survreg``` doc before this PR (with wrong description for ```path``` and missing ```overwrite```):
![image](https://cloud.githubusercontent.com/assets/1962026/16447548/fe7a5ed4-3da1-11e6-8b96-b5bf2083b07e.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/16447617/368e0b18-3da2-11e6-8277-45640fb11859.png)

Author: Yanbo Liang <[email protected]>

Closes #13970 from yanboliang/spark-16143-followup.

(cherry picked from commit c6a220d)
Signed-off-by: Xiangrui Meng <[email protected]>
…R doc

https://issues.apache.org/jira/browse/SPARK-16140

## What changes were proposed in this pull request?

Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated.

## How was this patch tested?

Tested on my local machine

And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there.

![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png)

Author: Xin Ren <[email protected]>

Closes #13921 from keypointt/SPARK-16140.

(cherry picked from commit 8c9cd0a)
Signed-off-by: Xiangrui Meng <[email protected]>
…FrameReader

#### What changes were proposed in this pull request?
In Python API, we have the same issue. Thanks for identifying this issue, zsxwing ! Below is an example:
```Python
spark.read.format('json').load('python/test_support/sql/people.json')
```
#### How was this patch tested?
Existing test cases cover the changes by this PR

Author: gatorsmile <[email protected]>

Closes #13965 from gatorsmile/optionPaths.

(cherry picked from commit 39f2eb1)
Signed-off-by: Shixiong Zhu <[email protected]>
…Guide

Title defines all.

Author: Tathagata Das <[email protected]>

Closes #13945 from tdas/SPARK-16256.

(cherry picked from commit 64132a1)
Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request?

This PR corrects ORC compression option for PySpark as well. I think this was missed mistakenly in #13948.

## How was this patch tested?

N/A

Author: hyukjinkwon <[email protected]>

Closes #13963 from HyukjinKwon/minor-orc-compress.

(cherry picked from commit d8a87a3)
Signed-off-by: Davies Liu <[email protected]>
…d respect the case sensitivity setting.

## What changes were proposed in this pull request?
The analyzer rule for resolving using joins should respect the case sensitivity setting.

## How was this patch tested?
New tests in ResolveNaturalJoinSuite

Author: Yin Huai <[email protected]>

Closes #13977 from yhuai/SPARK-16301.

(cherry picked from commit 8b5a8b2)
Signed-off-by: Davies Liu <[email protected]>
…throws non-intuitive exception

## What changes were proposed in this pull request?

This PR allows `emptyDataFrame.write` since the user didn't specify any partition columns.

**Before**
```scala
scala> spark.emptyDataFrame.write.parquet("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
scala> spark.emptyDataFrame.write.csv("/tmp/t1")
org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns;
```

After this PR, there occurs no exceptions and the created directory has only one file, `_SUCCESS`, as expected.

## How was this patch tested?

Pass the Jenkins tests including updated test cases.

Author: Dongjoon Hyun <[email protected]>

Closes #13730 from dongjoon-hyun/SPARK-16006.

(cherry picked from commit 9b1b3ae)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

This extends SPARK-15860 to include metrics for the actual bytecode size of janino-generated methods. They can be accessed in the same way as any other codahale metric, e.g.

```
scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.getSnapshot().getValues()
res7: Array[Long] = Array(532, 532, 532, 542, 1479, 2670, 3585, 3585)

scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.getSnapshot().getValues()
res8: Array[Long] = Array(5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 38, 63, 79, 88, 94, 94, 94, 132, 132, 165, 165, 220, 220)
```

## How was this patch tested?

Small unit test, also verified manually that the performance impact is minimal (<10%). hvanhovell

Author: Eric Liang <[email protected]>

Closes #13934 from ericl/spark-16238.

(cherry picked from commit 23c5865)
Signed-off-by: Reynold Xin <[email protected]>
…nctions for decimal param lookups

## What changes were proposed in this pull request?

This PR supports a fallback lookup by casting `DecimalType` into `DoubleType` for the external functions with `double`-type parameter.

**Reported Error Scenarios**
```scala
scala> sql("select percentile(value, 0.5) from values 1,2,3 T(value)")
org.apache.spark.sql.AnalysisException: ... No matching method for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (int, decimal(38,18)). Possible choices: _FUNC_(bigint, array<double>)  _FUNC_(bigint, double)  ; line 1 pos 7

scala> sql("select percentile_approx(value, 0.5) from values 1.0,2.0,3.0 T(value)")
org.apache.spark.sql.AnalysisException: ... Only a float/double or float/double array argument is accepted as parameter 2, but decimal(38,18) was passed instead.; line 1 pos 7
```

## How was this patch tested?

Pass the Jenkins tests (including a new testcase).

Author: Dongjoon Hyun <[email protected]>

Closes #13930 from dongjoon-hyun/SPARK-16228.

(cherry picked from commit 2eaabfa)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

This PR adds 3 optimizer rules for typed filter:

1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition.
2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition.
3. combine adjacent typed filters and share the deserialized object among all the condition expressions.

This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules.

## How was this patch tested?

`TypedFilterOptimizationSuite`

Author: Wenchen Fan <[email protected]>

Closes #13846 from cloud-fan/filter.

(cherry picked from commit d063898)
Signed-off-by: Cheng Lian <[email protected]>
…ING` from testsuites.

## What changes were proposed in this pull request?

After SPARK-15674, `DDLStrategy` prints out the following deprecation messages in the testsuites.

```
12:10:53.284 WARN org.apache.spark.sql.execution.SparkStrategies$DDLStrategy:
CREATE TEMPORARY TABLE normal_orc_source USING... is deprecated,
please use CREATE TEMPORARY VIEW viewName USING... instead
```

Total : 40
- JDBCWriteSuite: 14
- DDLSuite: 6
- TableScanSuite: 6
- ParquetSourceSuite: 5
- OrcSourceSuite: 2
- SQLQuerySuite: 2
- HiveCommandSuite: 2
- JsonSuite: 1
- PrunedScanSuite: 1
- FilteredScanSuite  1

This PR replaces `CREATE TEMPORARY TABLE` with `CREATE TEMPORARY VIEW` in order to remove the deprecation messages in the above testsuites except `DDLSuite`, `SQLQuerySuite`, `HiveCommandSuite`.

The Jenkins results shows only remaining 10 messages.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61422/consoleFull

## How was this patch tested?

This is a testsuite-only change.

Author: Dongjoon Hyun <[email protected]>

Closes #13956 from dongjoon-hyun/SPARK-16267.

(cherry picked from commit 831a04f)
Signed-off-by: Reynold Xin <[email protected]>
…lugin

## What changes were proposed in this pull request?

This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.

## How was this patch tested?

Manually tested.

<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">

Author: Cheng Lian <[email protected]>

Closes #13972 from liancheng/include-example-with-labels.

(cherry picked from commit bde1d6a)
Signed-off-by: Xiangrui Meng <[email protected]>
…0 Consumer API

## What changes were proposed in this pull request?

New Kafka consumer api for the released 0.10 version of Kafka

## How was this patch tested?

Unit tests, manual tests

Author: cody koeninger <[email protected]>

Closes #11863 from koeninger/kafka-0.9.

(cherry picked from commit dedbcee)
Signed-off-by: Tathagata Das <[email protected]>
…ng Guide

Author: Tathagata Das <[email protected]>

Closes #13978 from tdas/SPARK-16256-1.

(cherry picked from commit 2c3d961)
Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request?

model loading backward compatibility for ml NaiveBayes

## How was this patch tested?

existing ut and manual test for loading models saved by Spark 1.6.

Author: zlpmichelle <[email protected]>

Closes #13940 from zlpmichelle/naivebayes.

(cherry picked from commit b30a2dc)
Signed-off-by: Yanbo Liang <[email protected]>
…2.10

## What changes were proposed in this pull request?

The commented lines failed scala 2.10 build. This is because of change in behavior of case classes between 2.10 and 2.11. In scala 2.10, if companion object of a case class has explicitly defined apply(), then the implicit apply method is not generated. In scala 2.11 it is generated. Hence, the lines compile fine in 2.11 but not in 2.10.

This simply comments the tests to fix broken build. Correct solution is pending.

Author: Tathagata Das <[email protected]>

Closes #13992 from tdas/SPARK-12177.

(cherry picked from commit de8ab31)
Signed-off-by: Cheng Lian <[email protected]>
…BufferHolder

## What changes were proposed in this pull request?

This PR Checks the size limit when doubling the array size in BufferHolder to avoid integer overflow.

## How was this patch tested?

Manual test.

Author: Sean Zhong <[email protected]>

Closes #13829 from clockfly/SPARK-16071_2.

(cherry picked from commit 5320adc)
Signed-off-by: Wenchen Fan <[email protected]>
self explanatory

Author: Tathagata Das <[email protected]>

Closes #13994 from tdas/SPARK-12177-1.
Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size.

Tested by running a job which was failing without this change due to TimSort bug.

Author: Sital Kedia <[email protected]>

Closes #13107 from sitalkedia/fix_TimSort.

(cherry picked from commit 07f46af)
Signed-off-by: Davies Liu <[email protected]>
Author: Tathagata Das <[email protected]>

Closes #14001 from tdas/SPARK-16256-2.

(cherry picked from commit 5d00a7b)
Signed-off-by: Tathagata Das <[email protected]>
…tion

## What changes were proposed in this pull request?
This patch appends a message to suggest users running refresh table or reloading data frames when Spark sees a FileNotFoundException due to stale, cached metadata.

## How was this patch tested?
Added a unit test for this in MetadataCacheSuite.

Author: petermaxlee <[email protected]>

Closes #14003 from petermaxlee/SPARK-16336.

(cherry picked from commit fb41670)
Signed-off-by: Reynold Xin <[email protected]>
…listing

## What changes were proposed in this pull request?
Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors.

## How was this patch tested?
Manually verified.

Author: Reynold Xin <[email protected]>

Closes #13987 from rxin/SPARK-16313.

(cherry picked from commit 3d75a5b)
Signed-off-by: Reynold Xin <[email protected]>
…methods to PySpark linalg

The move to `ml.linalg` created `asML`/`fromML` utility methods in Scala/Java for converting between representations. These are missing in Python, this PR adds them.

## How was this patch tested?

New doctests.

Author: Nick Pentreath <[email protected]>

Closes #13997 from MLnick/SPARK-16328-python-linalg-convert.

(cherry picked from commit dab1051)
Signed-off-by: Joseph K. Bradley <[email protected]>
This PR adds the breaking changes from [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) to the migration guide.

## How was this patch tested?

Built docs locally.

Author: Nick Pentreath <[email protected]>

Closes #13924 from MLnick/SPARK-15643-migration-guide.

(cherry picked from commit 4a981dc)
Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request?
This patch introduces a flag to disable loading test tables in TestHiveSparkSession and disables that in Python. This fixes an issue in which python/run-tests would fail due to failure to load test tables.

Note that these test tables are not used outside of HiveCompatibilitySuite. In the long run we should probably decouple the loading of test tables from the test Hive setup.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <[email protected]>

Closes #14005 from rxin/SPARK-15954.

(cherry picked from commit 38f4d6f)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

Add Catalog.refreshTable API into python interface for Spark-SQL.

## How was this patch tested?

Existing test.

Author: WeichenXu <[email protected]>

Closes #13558 from WeichenXu123/update_python_sql_interface_refreshTable.

(cherry picked from commit 5344bad)
Signed-off-by: Cheng Lian <[email protected]>
holdenk and others added 13 commits August 1, 2016 06:55
## What changes were proposed in this pull request?

Change to non-deprecated constructor for SQLContext.

## How was this patch tested?

Existing tests

Author: Holden Karau <[email protected]>

Closes #14406 from holdenk/SPARK-16778-fix-use-of-deprecated-SQLContext-constructor.

(cherry picked from commit 1e9b59b)
Signed-off-by: Sean Owen <[email protected]>
… 0.10.0.

## What changes were proposed in this pull request?

This PR replaces the old Kafka API to 0.10.0 ones in `KafkaTestUtils`.

The change include:

 - `Producer` to `KafkaProducer`
 - Change configurations to equalvant ones. (I referred [here](http://kafka.apache.org/documentation.html#producerconfigs) for 0.10.0 and [here](http://kafka.apache.org/082/documentation.html#producerconfigs
) for old, 0.8.2).

This PR will remove the build warning as below:

```scala
[WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:71: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead.
[WARNING]   private var producer: Producer[String, String] = _
[WARNING]                         ^
[WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:181: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead.
[WARNING]     producer = new Producer[String, String](new ProducerConfig(producerConfiguration))
[WARNING]                    ^
[WARNING] .../spark/streaming/kafka010/KafkaTestUtils.scala:181: class ProducerConfig in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerConfig instead.
[WARNING]     producer = new Producer[String, String](new ProducerConfig(producerConfiguration))
[WARNING]                                                 ^
[WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:182: class KeyedMessage in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerRecord instead.
[WARNING]     producer.send(messages.map { new KeyedMessage[String, String](topic, _ ) }: _*)
[WARNING]                                      ^
[WARNING] four warnings found
[WARNING] warning: [options] bootstrap class path not set in conjunction with -source 1.7
[WARNING] 1 warning
```

## How was this patch tested?

Existing tests that use `KafkaTestUtils` should cover this.

Author: hyukjinkwon <[email protected]>

Closes #14416 from HyukjinKwon/SPARK-16776.

(cherry picked from commit f93ad4f)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?
a failing test case + fix to SPARK-16791 (https://issues.apache.org/jira/browse/SPARK-16791)

## How was this patch tested?
added a failing test case to CastSuit, then fixed the Cast code and rerun the entire CastSuit

Author: eyal farago <eyal farago>
Author: Eyal Farago <[email protected]>

Closes #14400 from eyalfa/SPARK-16791_cast_struct_with_timestamp_field_fails.

(cherry picked from commit 338a98d)
Signed-off-by: Wenchen Fan <[email protected]>
…ove timezone handling

## What changes were proposed in this pull request?

Removes the deprecated timestamp constructor and incidentally fixes the use which was using system timezone rather than the one specified when working near DST.

This change also causes the roundtrip tests to fail since it now actually uses all the timezones near DST boundaries where it didn't before.

Note: this is only a partial the solution, longer term we should follow up with https://issues.apache.org/jira/browse/SPARK-16788 to avoid this problem & simplify our timezone handling code.

## How was this patch tested?

New tests for two timezones added so even if user timezone happens to coincided with one, the other tests should still fail. Important note: this (temporarily) disables the round trip tests until we can fix the issue more thoroughly.

Author: Holden Karau <[email protected]>

Closes #14398 from holdenk/SPARK-16774-fix-use-of-deprecated-timestamp-constructor.

(cherry picked from commit ab1e761)
Signed-off-by: Sean Owen <[email protected]>
…istener.getBatchUIData

## What changes were proposed in this pull request?

Moved `asScala` to a `map` to avoid NPE.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <[email protected]>

Closes #14443 from zsxwing/SPARK-15869.

(cherry picked from commit 03d46aa)
Signed-off-by: Shixiong Zhu <[email protected]>
…sets of partitions

#14425 rebased for branch-2.0

Author: Eric Liang <[email protected]>

Closes #14427 from ericl/spark-16818-br-2.
## What changes were proposed in this pull request?

This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.

## How was this patch tested?

Manually tested.

Author: Cheng Lian <[email protected]>

Closes #14368 from liancheng/revise-examples.

(cherry picked from commit 10e1c0e)
Signed-off-by: Wenchen Fan <[email protected]>
…LVector instead of MLlib Vector

## What changes were proposed in this pull request?

mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former transforms original data into MLVector format, while the latter uses MLlibVector format.

## How was this patch tested?

Test manually.

Author: Xusen Yin <[email protected]>

Closes #14212 from yinxusen/SPARK-16558.

(cherry picked from commit dd8514f)
Signed-off-by: Yanbo Liang <[email protected]>
## What changes were proposed in this pull request?

Casting ConcurrentHashMap to ConcurrentMap allows to run code compiled with Java 8 on Java 7

## How was this patch tested?

Compilation. Existing automatic tests

Author: Maciej Brynski <[email protected]>

Closes #14459 from maver1ck/spark-15541-master.

(cherry picked from commit 511dede)
Signed-off-by: Sean Owen <[email protected]>
…tructors

## What changes were proposed in this pull request?

Fix of incorrect arguments (dropping slideDuration and using windowDuration) in constructors for TimeWindow.

The JIRA this addresses is here: https://issues.apache.org/jira/browse/SPARK-16837

## How was this patch tested?

Added a test to TimeWindowSuite to check that the results of TimeWindow object apply and TimeWindow class constructor are equivalent.

Author: Tom Magrino <[email protected]>

Closes #14441 from tmagrino/windowing-fix.

(cherry picked from commit 1dab63d)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know.

### First bug: When MapObjects works on Python-only UDTs

`RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like:

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
    df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema)
    df.show()

    File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
    : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType)
    ...

### Second bug: When Python-only UDTs is the element type of ArrayType

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT()))
    df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema)
    df.show()

## How was this patch tested?
PySpark's sql tests.

Author: Liang-Chi Hsieh <[email protected]>

Closes #13778 from viirya/fix-pyudt.

(cherry picked from commit 146001a)
Signed-off-by: Davies Liu <[email protected]>
…erals

## What changes were proposed in this pull request?
In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example:
```SQL
select /* Spark 1.6: */ current_date, /* Spark 1.6  & Spark 2.0: */ current_date()
```
This was accidentally dropped in Spark 2.0. This PR reinstates this functionality.

## How was this patch tested?
Added a case to ExpressionParserSuite.

Author: Herman van Hovell <[email protected]>

Closes #14442 from hvanhovell/SPARK-16836.

(cherry picked from commit 2330f3e)
Signed-off-by: Reynold Xin <[email protected]>
…east

Greatest/least function does not have the most friendly error message for data types. This patch improves the error message to not show the Seq type, and use more human readable data types.

Before:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), StringType)).; line 1 pos 7
```

After:
```
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST(decimal(2,1), string).; line 1 pos 7
```

Manually verified the output and also added unit tests to ConditionalExpressionSuite.

Author: petermaxlee <[email protected]>

Closes #14453 from petermaxlee/SPARK-16850.

(cherry picked from commit a1ff72e)
Signed-off-by: Reynold Xin <[email protected]>
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Aug 2, 2016

Close this please

JoshRosen and others added 7 commits August 2, 2016 12:03
… with the same file

## What changes were proposed in this pull request?

The behavior of `SparkContext.addFile()` changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0.

Prior to 2.0, calling `SparkContext.addFile()` with files that have the same name and identical contents would succeed. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions.

In 2.0 (or 1.6 with the Netty file server enabled), the second `addFile()` call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration.

This problem also affects `addJar()` in a more subtle way: the `fileServer.addJar()` call will also fail with an exception but that exception is logged and ignored; I believe that the problematic exception-catching path was mistakenly copied from some old code which was only relevant to very old versions of Spark and YARN mode.

I believe that this change of behavior was unintentional, so this patch weakens the `require` check so that adding the same filename at the same path will succeed.

At file download time, Spark tasks will fail with exceptions if an executor already has a local copy of a file and that file's contents do not match the contents of the file being downloaded / added. As a result, it's important that we prevent files with the same name and different contents from being served because allowing that can effectively brick an executor by preventing it from successfully launching any new tasks. Before this patch's change, this was prevented by forbidding `addFile()` from being called twice on files with the same name. Because Spark does not defensively copy local files that are passed to `addFile` it is vulnerable to files' contents changing, so I think it's okay to rely on an implicit assumption that these files are intended to be immutable (since if they _are_ mutable then this can lead to either explicit task failures or implicit incorrectness (in case new executors silently get newer copies of the file while old executors continue to use an older version)). To guard against this, I have decided to only update the file addition timestamps on the first call to `addFile()`; duplicate calls will succeed but will not update the timestamp. This behavior is fine as long as we assume files are immutable, which seems reasonable given the behaviors described above.

As part of this change, I also improved the thread-safety of the `addedJars` and `addedFiles` maps; this is important because these maps may be concurrently read by a task launching thread and written by a driver thread in case the user's driver code is multi-threaded.

## How was this patch tested?

I added regression tests in `SparkContextSuite`.

Author: Josh Rosen <[email protected]>

Closes #14396 from JoshRosen/SPARK-16787.

(cherry picked from commit e9fc0b6)
Signed-off-by: Josh Rosen <[email protected]>
## What changes were proposed in this pull request?

avgMetrics was summed, not averaged, across folds

Author: =^_^= <[email protected]>

Closes #14456 from pkch/pkch-patch-1.

(cherry picked from commit 639df04)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

Mask spark.ssl.keyPassword, spark.ssl.keyStorePassword, spark.ssl.trustStorePassword in Web UI environment page.
(Changes their values to ***** in env. page)

## How was this patch tested?

I've built spark, run spark shell and checked that this values have been masked with *****.

Also run tests:
./dev/run-tests

[info] ScalaTest
[info] Run completed in 1 hour, 9 minutes, 5 seconds.
[info] Total number of tests run: 2166
[info] Suites: completed 65, aborted 0
[info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0
[info] All tests passed.

![mask](https://cloud.githubusercontent.com/assets/15244468/17262154/7641e132-55e2-11e6-8a6c-30ead77c7372.png)

Author: Artur Sukhenko <[email protected]>

Closes #14409 from Devian-ua/maskpass.

(cherry picked from commit 3861273)
Signed-off-by: Sean Owen <[email protected]>
… type coercion should handle decimal type

## What changes were proposed in this pull request?

Here is a table about the behaviours of `array`/`map` and `greatest`/`least` in Hive, MySQL and Postgres:

|    |Hive|MySQL|Postgres|
|---|---|---|---|---|
|`array`/`map`|can find a wider type with decimal type arguments, and will truncate the wider decimal type if necessary|can find a wider type with decimal type arguments, no truncation problem|can find a wider type with decimal type arguments, no truncation problem|
|`greatest`/`least`|can find a wider type with decimal type arguments, and truncate if necessary, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|

I think these behaviours makes sense and Spark SQL should follow them.

This PR fixes `array` and `map` by using `findWiderCommonType` to get the wider type.
This PR fixes `greatest` and `least` by add a `findWiderTypeWithoutStringPromotion`, which provides similar semantic of `findWiderCommonType`, but without string promotion.

## How was this patch tested?

new tests in `TypeCoersionSuite`

Author: Wenchen Fan <[email protected]>
Author: Yin Huai <[email protected]>

Closes #14439 from cloud-fan/bug.

(cherry picked from commit b55f343)
Signed-off-by: Yin Huai <[email protected]>
This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well.  srowen zzcclp JoshRosen

This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204

My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user.

This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed.

Author: Kevin McHale <[email protected]>

Closes #14420 from mchalek/mchalek-jdbc_driver_registration.

(cherry picked from commit 685b08e)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?
As of Scala 2.11.x there is no longer a org.scala-lang:jline version aligned to the scala version itself. Scala console now uses the plain jline:jline module. Spark's  dependency management did not reflect this change properly, causing Maven to pull in Jline via transitive dependency. Unfortunately Jline 2.12 contained a minor but very annoying bug rendering the shell almost useless for developers with german keyboard layout. This request contains the following chages:
- Exclude transitive dependency 'jline:jline' from hive-exec module
- Remove global properties 'jline.version' and 'jline.groupId'
- Add both properties and dependency to 'scala-2.11' profile
- Add explicit dependency on 'jline:jline' to  module 'spark-repl'

## How was this patch tested?
- Running mvn dependency:tree and checking for correct Jline version 2.12.1
- Running full builds with assembly and checking for jline-2.12.1.jar in 'lib' folder of generated tarball

Author: Stefan Schulze <[email protected]>

Closes #14429 from stsc-pentasys/SPARK-16770.

(cherry picked from commit 4775eb4)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

SpillReader NPE when spillFile has no data. See follow logs:

16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77-565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa, fileSize:0.0 B
16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from org.apache.spark.util.collection.ExternalSorter3db4b52d
16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; size = 190458101 bytes, TID = 2358516/07/31 20:54:04 ERROR executor.Executor: Exception in task 1013.0 in stage 18.0 (TID 23585)
java.lang.NullPointerException
	at org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:624)
	at org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:539)
	at org.apache.spark.util.collection.ExternalSorter$SpillReader.<init>(ExternalSorter.scala:507)
	at org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:816)
	at org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:251)
	at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109)
	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154)
	at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
	at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
	at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
	at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
16/07/31 20:54:30 INFO executor.Executor: Executor is trying to kill task 1090.1 in stage 18.0 (TID 23793)
16/07/31 20:54:30 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown

## How was this patch tested?

Manual test.

Author: sharkd <[email protected]>
Author: sharkdtu <[email protected]>

Closes #14479 from sharkdtu/master.

(cherry picked from commit 583d91a)
Signed-off-by: Reynold Xin <[email protected]>
vanzin pushed a commit to vanzin/spark that referenced this pull request Aug 4, 2016
Closing the following PRs due to requests or unresponsive users.

Closes apache#13923
Closes apache#14462
Closes apache#13123
Closes apache#14423 (requested by srowen)
Closes apache#14424 (requested by srowen)
Closes apache#14101 (requested by jkbradley)
Closes apache#10676 (requested by srowen)
Closes apache#10943 (requested by yhuai)
Closes apache#9936
Closes apache#10701
Davies Liu and others added 5 commits August 4, 2016 11:20
## What changes were proposed in this pull request?

This patch fix the overflow in LongToUnsafeRowMap when the range of key is very wide (the key is much much smaller then minKey, for example, key is Long.MinValue, minKey is > 0).

## How was this patch tested?

Added regression test (also for SPARK-16740)

Author: Davies Liu <[email protected]>

Closes #14464 from davies/fix_overflow.

(cherry picked from commit 9d4e621)
Signed-off-by: Davies Liu <[email protected]>
## What changes were proposed in this pull request?

Add the missing args-checking for randomSplit and sample

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <[email protected]>

Closes #14478 from zhengruifeng/fix_randomSplit.

(cherry picked from commit be8ea4b)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

To Make sure ANN layer input training data to be persisted,
so that it can avoid overhead cost if the RDD need to be computed from lineage.

## How was this patch tested?

Existing Tests.

Author: WeichenXu <[email protected]>

Closes #14483 from WeichenXu123/add_ann_persist_training_data.

(cherry picked from commit 462784f)
Signed-off-by: Sean Owen <[email protected]>
… (Deprecated and Override)

## What changes were proposed in this pull request?

This PR adds both rules for preventing to use `Deprecated` and `Override`.

- Java's `Override`
  It seems Scala compiler just ignores this. Apparently, `override` modifier is only mandatory for " that override some other **concrete member definition** in a parent class" but not for for **incomplete member definition** (such as ones from trait or abstract), see (http://www.scala-lang.org/files/archive/spec/2.11/05-classes-and-objects.html#override)

  For a simple example,

  - Normal class - needs `override` modifier

  ```bash
  scala> class A { def say = {}}
  defined class A

  scala> class B extends A { def say = {}}
  <console>:8: error: overriding method say in class A of type => Unit;
   method say needs `override' modifier
         class B extends A { def say = {}}
                                 ^
  ```

  - Trait - does not need `override` modifier

  ```bash
  scala> trait A { def say }
  defined trait A

  scala> class B extends A { def say = {}}
  defined class B
  ```

  To cut this short, this case below is possible,

  ```bash
  scala> class B extends A {
       |    Override
       |    def say = {}
       | }
  defined class B
  ```
  we can write `Override` annotation (meaning nothing) which might confuse engineers that Java's annotation is working fine. It might be great if we prevent those potential confusion.

- Java's `Deprecated`
  When `Deprecated` is used,  it seems Scala compiler recognises this correctly but it seems we use Scala one `deprecated` across codebase.

## How was this patch tested?

Manually tested, by inserting both `Override` and `Deprecated`. This will shows the error messages as below:

```bash
Scalastyle checks failed at following occurrences:
[error] ... : deprecated should be used instead of java.lang.Deprecated.
```

```basg
Scalastyle checks failed at following occurrences:
[error] ... : override modifier should be used instead of java.lang.Override.
```

Author: hyukjinkwon <[email protected]>

Closes #14490 from HyukjinKwon/SPARK-16877.

(cherry picked from commit 1d78157)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier

## How was this patch tested?

unit tests and manual tests

Author: Zheng RuiFeng <[email protected]>

Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.

(cherry picked from commit 0e2e5d7)
Signed-off-by: Sean Owen <[email protected]>
@asfgit asfgit closed this in 53e766c Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.