-
Notifications
You must be signed in to change notification settings - Fork 29k
Pin tag 210 #21681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Pin tag 210 #21681
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## What changes were proposed in this pull request? Add missing ParamValidations for ML algos ## How was this patch tested? existing tests Author: Zheng RuiFeng <[email protected]> Closes #15881 from zhengruifeng/arg_checking. (cherry picked from commit c68f1a3) Signed-off-by: Yanbo Liang <[email protected]>
## What changes were proposed in this pull request? Add links to API docs for ML algos ## How was this patch tested? Manual checking for the API links Author: Zheng RuiFeng <[email protected]> Closes #15890 from zhengruifeng/algo_link. (cherry picked from commit a75e3fe) Signed-off-by: Sean Owen <[email protected]>
Small fix, fix the errors caused by lint check in Java - Clear unused objects and `UnusedImports`. - Add comments around the method `finalize` of `NioBufferedFileInputStream`to turn off checkstyle. - Cut the line which is longer than 100 characters into two lines. Travis CI. ``` $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install $ dev/lint-java ``` Before: ``` Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] (imports) UnusedImports: Unused import - org.apache.commons.crypto.cipher.CryptoCipherFactory. [ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] (modifier) RedundantModifier: Redundant 'public' modifier. [ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] (coding) NoFinalizer: Avoid using finalizer method. [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] (sizes) LineLength: Line is longer than 100 characters (found 113). [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112] (sizes) LineLength: Line is longer than 100 characters (found 110). [ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[31,17] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions. [ERROR]src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors. [ERROR] src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] (regexp) RegexpSingleline: No trailing whitespace allowed. ``` After: ``` $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install $ dev/lint-java Using `mvn` from path: /home/travis/build/ConeyLiu/spark/build/apache-maven-3.3.9/bin/mvn Checkstyle checks passed. ``` Author: Xianyang Liu <[email protected]> Closes #15865 from ConeyLiu/master. (cherry picked from commit 7569cf6) Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request?
Currently, when CTE is used in RunnableCommand, the Analyzer does not replace the logical node `With`. The child plan of RunnableCommand is not resolved. Thus, the output of the `With` plan node looks very confusing.
For example,
```
sql(
"""
|CREATE VIEW cte_view AS
|WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
|SELECT n FROM w
""".stripMargin).explain()
```
The output is like
```
ExecutedCommand
+- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
+- 'With [(w,SubqueryAlias w
+- Project [1 AS n#16]
+- OneRowRelation$
), (cte1,'SubqueryAlias cte1
+- 'Project [unresolvedalias(2, None)]
+- OneRowRelation$
), (cte2,'SubqueryAlias cte2
+- 'Project [unresolvedalias(3, None)]
+- OneRowRelation$
)]
+- 'Project ['n]
+- 'UnresolvedRelation `w`
```
After the fix, the output is as shown below.
```
ExecutedCommand
+- CreateViewCommand `cte_view`, WITH w AS (SELECT 1 AS n), cte1 (select 2), cte2 as (select 3)
SELECT n FROM w, false, false, PersistedView
+- CTE [w, cte1, cte2]
: :- SubqueryAlias w
: : +- Project [1 AS n#16]
: : +- OneRowRelation$
: :- 'SubqueryAlias cte1
: : +- 'Project [unresolvedalias(2, None)]
: : +- OneRowRelation$
: +- 'SubqueryAlias cte2
: +- 'Project [unresolvedalias(3, None)]
: +- OneRowRelation$
+- 'Project ['n]
+- 'UnresolvedRelation `w`
```
BTW, this PR also fixes the output of the view type.
### How was this patch tested?
Manual
Author: gatorsmile <[email protected]>
Closes #15854 from gatorsmile/cteName.
(cherry picked from commit 608ecc5)
Signed-off-by: Herman van Hovell <[email protected]>
…atchId and add triggerDetails to json in StreamingQueryStatus ## What changes were proposed in this pull request? SPARK-18459: triggerId seems like a number that should be increasing with each trigger, whether or not there is data in it. However, actually, triggerId increases only where there is a batch of data in a trigger. So its better to rename it to batchId. SPARK-18460: triggerDetails was missing from json representation. Fixed it. ## How was this patch tested? Updated existing unit tests. Author: Tathagata Das <[email protected]> Closes #15895 from tdas/SPARK-18459. (cherry picked from commit 0048ce7) Signed-off-by: Shixiong Zhu <[email protected]>
… monitoring streaming queries ## What changes were proposed in this pull request? <img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png"> <img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png"> Author: Tathagata Das <[email protected]> Closes #15897 from tdas/SPARK-18461. (cherry picked from commit bb6cdfd) Signed-off-by: Michael Armbrust <[email protected]>
…Service ## What changes were proposed in this pull request? Suggest users to increase `NodeManager's` heap size if `External Shuffle Service` is enabled as `NM` can spend a lot of time doing GC resulting in shuffle operations being a bottleneck due to `Shuffle Read blocked time` bumped up. Also because of GC `NodeManager` can use an enormous amount of CPU and cluster performance will suffer. I have seen NodeManager using 5-13G RAM and up to 2700% CPU with `spark_shuffle` service on. ## How was this patch tested? #### Added step 5:  Author: Artur Sukhenko <[email protected]> Closes #15906 from Devian-ua/nmHeapSize. (cherry picked from commit 5558998) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? The nullability of `WrapOption` should be `false`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes #15887 from ueshin/issues/SPARK-18442. (cherry picked from commit 170eeb3) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <[email protected]> Author: Juliet Hougland <[email protected]> Author: Juliet Hougland <[email protected]> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
…tastore ## What changes were proposed in this pull request? Before Spark 2.1, users can create an external data source table without schema, and we will infer the table schema at runtime. In Spark 2.1, we decided to infer the schema when the table was created, so that we don't need to infer it again and again at runtime. This is a good improvement, but we should still respect and support old tables which doesn't store table schema in metastore. ## How was this patch tested? regression test. Author: Wenchen Fan <[email protected]> Closes #15900 from cloud-fan/hive-catalog. (cherry picked from commit 07b3f04) Signed-off-by: Reynold Xin <[email protected]>
…arn.md ## What changes were proposed in this pull request? Remove `spark.driver.memory`, `spark.executor.memory`, `spark.driver.cores`, and `spark.executor.cores` from `running-on-yarn.md` as they are not Yarn-specific, and they are also defined in`configuration.md`. ## How was this patch tested? Build passed & Manually check. Author: Weiqing Yang <[email protected]> Closes #15869 from weiqingy/yarnDoc. (cherry picked from commit a3cac7b) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? I found the documentation for the sample method to be confusing, this adds more clarification across all languages. - [x] Scala - [x] Python - [x] R - [x] RDD Scala - [ ] RDD Python with SEED - [X] RDD Java - [x] RDD Java with SEED - [x] RDD Python ## How was this patch tested? NA Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: anabranch <[email protected]> Author: Bill Chambers <[email protected]> Closes #15815 from anabranch/SPARK-18365. (cherry picked from commit 49b6f45) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? Several places in MLlib use custom regexes or other approaches to parse Spark versions. Those should be fixed to use the VersionUtils. This PR replaces custom regexes with VersionUtils to get Spark version numbers. ## How was this patch tested? Existing tests. Signed-off-by: VinceShieh vincent.xieintel.com Author: VinceShieh <[email protected]> Closes #15055 from VinceShieh/SPARK-17462. (cherry picked from commit de77c67) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? 1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert. 2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter` in `ml-pipeline.md` were linked to `ml-guide.html` by mistake. 3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private. 4, Other link updates. ## How was this patch tested? manual tests Author: Zheng RuiFeng <[email protected]> Closes #15912 from zhengruifeng/md_fix. (cherry picked from commit cdaf4ce) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? In ShuffleExchange, the nodename's extraInfo are the same when exchangeCoordinator.isEstimated is true or false. Merge the two situation in the PR. Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)> Closes #15920 from windpiger/DupNodeNameShuffleExchange. (cherry picked from commit b0aa1aa) Signed-off-by: Sean Owen <[email protected]>
…hould depend on the location of default database ## What changes were proposed in this pull request? The current semantic of the warehouse config: 1. it's a static config, which means you can't change it once your spark application is launched. 2. Once a database is created, its location won't change even the warehouse path config is changed. 3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`. rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones. This PR fixes hive serde tables to make it consistent with data source tables. ## How was this patch tested? HiveSparkSubmitSuite Author: Wenchen Fan <[email protected]> Closes #15812 from cloud-fan/default-db. (cherry picked from commit ce13c26) Signed-off-by: Yin Huai <[email protected]>
…es event ## What changes were proposed in this pull request? This patch fixes a `ClassCastException: java.lang.Integer cannot be cast to java.lang.Long` error which could occur in the HistoryServer while trying to process a deserialized `SparkListenerDriverAccumUpdates` event. The problem stems from how `jackson-module-scala` handles primitive type parameters (see https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges for more details). This was causing a problem where our code expected a field to be deserialized as a `(Long, Long)` tuple but we got an `(Int, Int)` tuple instead. This patch hacks around this issue by registering a custom `Converter` with Jackson in order to deserialize the tuples as `(Object, Object)` and perform the appropriate casting. ## How was this patch tested? New regression tests in `SQLListenerSuite`. Author: Josh Rosen <[email protected]> Closes #15922 from JoshRosen/SPARK-18462. (cherry picked from commit d9dd979) Signed-off-by: Reynold Xin <[email protected]>
…terval" direcly with user setting. ## What changes were proposed in this pull request? CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value. ## How was this patch tested? When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one. The primary solution to this issue was given by uncleGen Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. zsxwing Author: Tyson Condie <[email protected]> Author: genmao.ygm <[email protected]> Closes #15852 from tcondie/spark-18187. (cherry picked from commit 51baca2) Signed-off-by: Shixiong Zhu <[email protected]>
… all columns when doing a simple count
## What changes were proposed in this pull request?
When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
## How was this patch tested?
Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
```
build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
```
However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
I tested with the following setup using above build options
```
case class OrcData(intField: Long, stringField: String)
spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
sql(
s"""CREATE EXTERNAL TABLE orc_test(
| intField LONG,
| stringField STRING
|)
|STORED AS ORC
|LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
""".stripMargin)
```
## Results
query | Spark 2.0.2 | this PR
---|---|---
`sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
`sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
`sql("select * from orc_test").collect`|4.4 MB|4.4 MB
Author: Andrew Ray <[email protected]>
Closes #15898 from aray/sql-orc-no-col.
(cherry picked from commit 795e9fc)
Signed-off-by: Reynold Xin <[email protected]>
…aAPISuite ## What changes were proposed in this pull request? This PR fixes the test `wholeTextFiles` in `JavaAPISuite.java`. This is failed due to the different path format on Windows. For example, the path in `container` was ``` C:\projects\spark\target\tmp\1478967560189-0/part-00000 ``` whereas `new URI(res._1()).getPath()` was as below: ``` /C:/projects/spark/target/tmp/1478967560189-0/part-00000 ``` ## How was this patch tested? Tests in `JavaAPISuite.java`. Tested via AppVeyor. **Before** Build: https://ci.appveyor.com/project/spark-test/spark/build/63-JavaAPISuite-1 Diff: master...spark-test:JavaAPISuite-1 ``` [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started [error] Test org.apache.spark.JavaAPISuite.wholeTextFiles failed: java.lang.AssertionError: expected:<spark is easy to use. [error] > but was:<null>, took 0.578 sec [error] at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089) ... ``` **After** Build started: [CORE] `org.apache.spark.JavaAPISuite` [](https://ci.appveyor.com/project/spark-test/spark/branch/198DDA52-F201-4D2B-BE2F-244E0C1725B2) Diff: master...spark-test:198DDA52-F201-4D2B-BE2F-244E0C1725B2 ``` [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started ... ``` Author: hyukjinkwon <[email protected]> Closes #15866 from HyukjinKwon/SPARK-18422. (cherry picked from commit 40d59ff) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? HDFS `write` may just hang until timeout if some network error happens. It's better to enable interrupts to allow stopping the query fast on HDFS. This PR just changes the logic to only disable interrupts for local file system, as HADOOP-10622 only happens for local file system. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #15911 from zsxwing/interrupt-on-dfs. (cherry picked from commit e5f5c29) Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request? I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address. This is a small pull request to clean up AnalyzeColumnCommand: 1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them. 2. Removed the nested updateStats function, by just inlining the function. 3. Renamed a few functions to better reflect what they do. 4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct). 5. Renamed ColumnStatStruct to just AnalyzeColumnCommand. 6. Added more documentation explaining some of the non-obvious return types and code blocks. In follow-up pull requests, I'd like to address the following: 1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings. 2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow. 3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions. 4. Clearly document the data representation stored in the catalog for statistics. ## How was this patch tested? Affected test cases have been updated. Author: Reynold Xin <[email protected]> Closes #15933 from rxin/SPARK-18505. (cherry picked from commit 6f7ff75) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?
The issue in ForeachSink is the new created DataSet still uses the old QueryExecution. When `foreachPartition` is called, `QueryExecution.toString` will be called and then fail because it doesn't know how to plan EventTimeWatermark.
This PR just replaces the QueryExecution with IncrementalExecution to fix the issue.
## How was this patch tested?
`test("foreach with watermark")`.
Author: Shixiong Zhu <[email protected]>
Closes #15934 from zsxwing/SPARK-18497.
(cherry picked from commit 2a40de4)
Signed-off-by: Tathagata Das <[email protected]>
…able like JavaSparkContext ## What changes were proposed in this pull request? Just adds `close()` + `Closeable` as a synonym for `stop()`. This makes it usable in Java in try-with-resources, as suggested by ash211 (`Closeable` extends `AutoCloseable` BTW) ## How was this patch tested? Existing tests Author: Sean Owen <[email protected]> Closes #15932 from srowen/SPARK-18448. (cherry picked from commit db9fb9b) Signed-off-by: Sean Owen <[email protected]>
… that`/`'''Note:'''` across Scala/Java API documentation It seems in Scala/Java, - `Note:` - `NOTE:` - `Note that` - `'''Note:'''` - `note` This PR proposes to fix those to `note` to be consistent. **Before** - Scala  - Java  **After** - Scala  - Java  The notes were found via ```bash grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// NOTE: " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...` -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// Note that " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// Note: " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// '''Note:''' " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` And then fixed one by one comparing with API documentation/access modifiers. After that, manually tested via `jekyll build`. Author: hyukjinkwon <[email protected]> Closes #15889 from HyukjinKwon/SPARK-18437. (cherry picked from commit d5b1d5f) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? Avoid hard-coding spark.rpc.askTimeout to non-default in Client; fix doc about spark.rpc.askTimeout default ## How was this patch tested? Existing tests Author: Sean Owen <[email protected]> Closes #15833 from srowen/SPARK-18353. (cherry picked from commit 8b1e108) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? Fix since 2.1.0 on new SparkSession.close() method. I goofed in #15932 because it was back-ported to 2.1 instead of just master as originally planned. Author: Sean Owen <[email protected]> Closes #15938 from srowen/SPARK-18448.2. (cherry picked from commit ded5fef) Signed-off-by: Sean Owen <[email protected]>
…n LogisticRegression training ## What changes were proposed in this pull request? This is a follow up to some of the discussion [here](#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing. Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark. ## How was this patch tested? This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch. Author: sethah <[email protected]> Closes #15893 from sethah/logreg_refactor. (cherry picked from commit 856e004) Signed-off-by: DB Tsai <[email protected]>
…ion in RadixSort.java ## What changes were proposed in this pull request? This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive. ## How was this patch tested? Manually executed query82 of TPC-DS with 100TB Author: Kazuaki Ishizaki <[email protected]> Closes #15907 from kiszk/SPARK-18458. (cherry picked from commit d93b655) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? The previous documentation and example for DateDiff was wrong. ## How was this patch tested? Doc only change. Author: Reynold Xin <[email protected]> Closes #15937 from rxin/datediff-doc. (cherry picked from commit bce9a03) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis to Analysis to fix a regression caused by SPARK-18504. This problem can be reproduced with a simple script now. Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p") Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c") sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show The requirements are: 1. We need to reference the same table twice in both the parent and the subquery. Here is the table c. 2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent. 3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504. ## How was this patch tested? SubquerySuite and a simplified version of TPCDS-Q32 Author: Nattavut Sutyanyong <[email protected]> Closes #16246 from nsyca/18814. (cherry picked from commit cccd643) Signed-off-by: Herman van Hovell <[email protected]>
…le output page to GitHub ## What changes were proposed in this pull request? Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while. This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure. Note that for builds that a test report is not available (ongoing builds and builds that fail before test execution), the test report link automatically redirects to the build page. ## How was this patch tested? N/A. Author: Cheng Lian <[email protected]> Closes #16163 from liancheng/jenkins-test-report. (cherry picked from commit ba4aab9) Signed-off-by: Reynold Xin <[email protected]>
…-side post-filter for FileFormat datasources
## What changes were proposed in this pull request?
Currently, `FileSourceStrategy` does not handle the case when the pushed-down filter is `Literal(null)` and removes it at the post-filter in Spark-side.
For example, the codes below:
```scala
val df = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDF()
df.filter($"_1" === "true").explain(true)
```
shows it keeps `null` properly.
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- LocalRelation [_1#17]
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#17 as double) = cast(true as double))
+- LocalRelation [_1#17]
== Optimized Logical Plan ==
Filter (isnotnull(_1#17) && null)
+- LocalRelation [_1#17]
== Physical Plan ==
*Filter (isnotnull(_1#17) && null) << Here `null` is there
+- LocalTableScan [_1#17]
```
However, when we read it back from Parquet,
```scala
val path = "/tmp/testfile"
df.write.parquet(path)
spark.read.parquet(path).filter($"_1" === "true").explain(true)
```
`null` is removed at the post-filter.
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#11] parquet
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#11 as double) = cast(true as double))
+- Relation[_1#11] parquet
== Optimized Logical Plan ==
Filter (isnotnull(_1#11) && null)
+- Relation[_1#11] parquet
== Physical Plan ==
*Project [_1#11]
+- *Filter isnotnull(_1#11) << Here `null` is missing
+- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
```
This PR fixes it to keep it properly. In more details,
```scala
val partitionKeyFilters =
ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
```
This keeps this `null` in `partitionKeyFilters` as `Literal` always don't have `children` and `references` is being empty which is always the subset of `partitionSet`.
And then in
```scala
val afterScanFilters = filterSet -- partitionKeyFilters
```
`null` is always removed from the post filter. So, if the referenced fields are empty, it should be applied into data columns too.
After this PR, it becomes as below:
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#276] parquet
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#276 as double) = cast(true as double))
+- Relation[_1#276] parquet
== Optimized Logical Plan ==
Filter (isnotnull(_1#276) && null)
+- Relation[_1#276] parquet
== Physical Plan ==
*Project [_1#276]
+- *Filter (isnotnull(_1#276) && null)
+- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b..., PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
```
## How was this patch tested?
Unit test in `FileSourceStrategySuite`
Author: hyukjinkwon <[email protected]>
Closes #16184 from HyukjinKwon/SPARK-18753.
(cherry picked from commit 89ae26d)
Signed-off-by: Cheng Lian <[email protected]>
…ating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <[email protected]> Closes #16274 from rxin/SPARK-18853. (cherry picked from commit 5d79947) Signed-off-by: Herman van Hovell <[email protected]>
…entProgress is empty
## What changes were proposed in this pull request?
Right now `StreamingQuery.lastProgress` throws NoSuchElementException and it's hard to be used in Python since Python user will just see Py4jError.
This PR just makes it return null instead.
## How was this patch tested?
`test("lastProgress should be null when recentProgress is empty")`
Author: Shixiong Zhu <[email protected]>
Closes #16273 from zsxwing/SPARK-18852.
(cherry picked from commit 1ac6567)
Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request? Added short section for KSTest. Also added logreg model to list of ML models in vignette. (This will be reorganized under SPARK-18849)  ## How was this patch tested? Manually tested example locally. Built vignettes locally. Author: Joseph K. Bradley <[email protected]> Closes #16283 from jkbradley/ksTest-vignette. (cherry picked from commit 7862742) Signed-off-by: Joseph K. Bradley <[email protected]>
…ubqueries ## What changes were proposed in this pull request? This is a bug introduced by subquery handling. numberedTreeString (which uses generateTreeString under the hood) numbers trees including innerChildren (used to print subqueries), but apply (which uses getNodeNumbered) ignores innerChildren. As a result, apply(i) would return the wrong plan node if there are subqueries. This patch fixes the bug. ## How was this patch tested? Added a test case in SubquerySuite.scala to test both the depth-first traversal of numbering as well as making sure the two methods are consistent. Author: Reynold Xin <[email protected]> Closes #16277 from rxin/SPARK-18854. (cherry picked from commit ffdd1fc) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? When do the QA work, I found that the following issues: 1). `spark.mlp` doesn't include an example; 2). `spark.mlp` and `spark.lda` have redundant parameter explanations; 3). `spark.lda` document misses default values for some parameters. I also changed the `spark.logit` regParam in the examples, as we discussed in #16222. ## How was this patch tested? Manual test Author: [email protected] <[email protected]> Closes #16284 from wangmiao1981/ks. (cherry picked from commit 3243885) Signed-off-by: Felix Cheung <[email protected]>
… size ## What changes were proposed in this pull request? In `DataSource`, if the table is not analyzed, we will use 0 as the default value for table size. This is dangerous, we may broadcast a large table and cause OOM. We should use `defaultSizeInBytes` instead. ## How was this patch tested? new regression test Author: Wenchen Fan <[email protected]> Closes #16280 from cloud-fan/bug. (cherry picked from commit d6f11a1) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_] rather than a more specific type. It would be easier for interactive debugging to introduce a function that returns the BaseType. ## How was this patch tested? N/A - this is a developer only feature used for interactive debugging. As long as it compiles, it should be good to go. I tested this in spark-shell. Author: Reynold Xin <[email protected]> Closes #16288 from rxin/SPARK-18869. (cherry picked from commit 5d510c6) Signed-off-by: Reynold Xin <[email protected]>
…IPTION` file ## What changes were proposed in this pull request? Since Apache Spark 1.4.0, R API document page has a broken link on `DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR aims to fix that. - Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html - Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html ## How was this patch tested? Manual. ```bash cd docs SKIP_SCALADOC=1 jekyll build ``` Author: Dongjoon Hyun <[email protected]> Closes #16292 from dongjoon-hyun/SPARK-18875. (cherry picked from commit ec0eae4) Signed-off-by: Shivaram Venkataraman <[email protected]>
## What changes were proposed in this pull request? doc cleanup ## How was this patch tested? ~~vignettes is not building for me. I'm going to kick off a full clean build and try again and attach output here for review.~~ Output html here: https://felixcheung.github.io/sparkr-vignettes.html Author: Felix Cheung <[email protected]> Closes #16286 from felixcheung/rvignettespass. (cherry picked from commit 7d858bc) Signed-off-by: Shivaram Venkataraman <[email protected]>
## What changes were proposed in this pull request? Check whether Aggregation operators on a streaming subplan have aggregate expressions with isDistinct = true. ## How was this patch tested? Added unit test Author: Tathagata Das <[email protected]> Closes #16289 from tdas/SPARK-18870. (cherry picked from commit 4f7292c) Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <[email protected]> Closes #16251 from zsxwing/newest-first. (cherry picked from commit 68a6dc9) Signed-off-by: Tathagata Das <[email protected]>
…q not defined ## What changes were proposed in this pull request? `_to_seq` wasn't imported. ## How was this patch tested? Added partitionBy to existing write path unit test Author: Burak Yavuz <[email protected]> Closes #16297 from brkyvz/SPARK-18888.
… listener, check trigger... ## What changes were proposed in this pull request? Use `recentProgress` instead of `lastProgress` and filter out last non-zero value. Also add eventually to the latest assertQuery similar to first `assertQuery` ## How was this patch tested? Ran test 1000 times Author: Burak Yavuz <[email protected]> Closes #16287 from brkyvz/SPARK-18868. (cherry picked from commit 9c7f83b) Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request? For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g., 2.1.0). Thus `cp` throws an error which causes the build to fail. ## How was this patch tested? Manually by executing the following script ``` set -o pipefail set -e set -x touch a R_PACKAGE_VERSION=2.1.0 VERSION=2.1.0 if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then cp a a fi ``` Author: Shivaram Venkataraman <[email protected]> Closes #16299 from shivaram/sparkr-cp-fix. (cherry picked from commit 9634018) Signed-off-by: Reynold Xin <[email protected]>
Follow up to ae853e8 as `mv` throws an error on the Jenkins machines if source and destinations are the same. Author: Shivaram Venkataraman <[email protected]> Closes #16302 from shivaram/sparkr-no-mv-fix. (cherry picked from commit 5a44f18) Signed-off-by: Shivaram Venkataraman <[email protected]>
|
Can one of the admins verify this patch? |
Member
|
@zhangchj1990 Looks mistakenly open. Mind closing this please? |
Member
|
Close this @zhangchj1990 |
zifeif2
pushed a commit
to zifeif2/spark
that referenced
this pull request
Nov 22, 2025
Closes apache#20932 Closes apache#17843 Closes apache#13477 Closes apache#14291 Closes apache#20919 Closes apache#17907 Closes apache#18766 Closes apache#20809 Closes apache#8849 Closes apache#21076 Closes apache#21507 Closes apache#21336 Closes apache#21681 Closes apache#21691 Author: Sean Owen <[email protected]> Closes apache#21708 from srowen/CloseStalePRs.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.