sync #11

GuoPhilipse · 2020-06-05T12:00:57Z

sync

### What changes were proposed in this pull request? This PR disables week-based date filed for parsing closes #28674 ### Why are the changes needed? 1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at #28674 2. The existing behavior itself in 2.4 is confusing, e.g. ```sql spark-sql> select to_timestamp('1', 'w'); 1969-12-28 00:00:00 spark-sql> select to_timestamp('1', 'u'); 1970-01-05 00:00:00 ``` The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form. And, e.g. ```sql spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy'); 2020-01-01 00:00:00 spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY'); 2019-12-29 00:00:00 spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w'); NULL spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w'); 2019-12-29 00:00:00 ``` I think we don't need to introduce all the weird behavior from Java 3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it. 4. Avoiding JDK bugs https://issues.apache.org/jira/browse/SPARK-31880 ### Does this PR introduce _any_ user-facing change? Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions. ### How was this patch tested? more tests added Closes #28706 from yaooqinn/SPARK-31892. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR fixes a wrong coloring issue in the DAG-viz. In the Job Page and Stage Page, nodes which are associated with "barrier mode" in the DAG-viz will be colored pale green. But, with some type of jobs, nodes which are not associated with the mode will also colored. You can reproduce with the following operation. ``` sc.parallelize(1 to 10).barrier.mapPartitions(identity).repartition(1).collect() ``` <img width="376" alt="wrong-coloring" src="https://user-images.githubusercontent.com/4736016/83403670-1711df00-a444-11ea-9457-c683f75bc566.png"> In the screen shot above, `repartition` in `Stage 1` is not associated with barrier mode so the corresponding node should not be colored pale green. The cause of this issue is the logic which chooses HTML elements to be colored is wrong. The logic chooses such elements based on whether each element is associated with a style class (`clusterId` in the code). But when an operation crosses over shuffle (like `repartition` above), a `clusterId` can be duplicated and non-barrier mode node is also associated with the same `clusterId`. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added test case with the following command. ``` build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -- -z SPARK-31886" ``` Closes #28694 from sarutak/fix-wrong-barrier-color. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

… K8S IT ### What changes were proposed in this pull request? This PR aims to activate `hadoop-2.7` profile by default in Kubernetes IT module. ### Why are the changes needed? While SPARK-31881 added Hadoop 3.2 support, one default test dependency was moved to `hadoop-2.7` profile. It works when we give one of `hadoop-2.7` and `hadoop-3.2`, but it fails when we don't give any profile. **BEFORE** ``` $ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests ... [ERROR] [Error] /APACHE/spark-merge/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:23: object amazonaws is not a member of package com ``` **AFTER** ``` $ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests .. [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ ``` The default activated profile will be override when we give `hadoop-3.2`. ``` $ mvn help:active-profiles -Pkubernetes-integration-tests ... Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12:jar:3.1.0-SNAPSHOT': The following profiles are active: - hadoop-2.7 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT) - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) ``` ``` $ mvn help:active-profiles -Pkubernetes-integration-tests -Phadoop-3.2 ... Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12:jar:3.1.0-SNAPSHOT': The following profiles are active: - hadoop-3.2 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT) - hadoop-3.2 (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins UT and IT. Currently, all Jenkins build and tests (UT & IT) passes without this patch. This should be tested manually with the above command. `hadoop-3.2` K8s IT also passed like the following. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 33 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28716 from dongjoon-hyun/SPARK-31881-2. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? 1. Replace `def dateFormatter` to `val dateFormatter`. 2. Modify the `date formatting in hive result` test in `HiveResultSuite` to check modified code on various time zones. ### Why are the changes needed? To avoid creation of `DateFormatter` per every incoming date in `HiveResult.toHiveString`. This should eliminate unnecessary creation of `SimpleDateFormat` instances and compilation of the default pattern `yyyy-MM-dd`. The changes can speed up processing of legacy date values of the `java.sql.Date` type which is collected by default. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified a test in `HiveResultSuite`. Closes #28687 from MaxGekk/HiveResult-val-dateFormatter. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR set the hour to 12/0 when the AMPM_OF_DAY field exists ### Why are the changes needed? When the hour is absent but the am-pm is present, the time is incorrect for pm ### Does this PR introduce _any_ user-facing change? yes, the change is user-facing but to change back to 2.4 to keep backward compatibility e.g. ```sql spark-sql> select to_timestamp('33:33 PM', 'mm:ss a'); 1970-01-01 12:33:33 spark-sql> select to_timestamp('33:33 AM', 'mm:ss a'); 1970-01-01 00:33:33 ``` otherwise, the results are all `1970-01-01 00:33:33` ### How was this patch tested? add unit tests Closes #28713 from yaooqinn/SPARK-31896. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…rmatters" This reverts commit c59f51b.

… with fresh attribute IDs ### What changes were proposed in this pull request? This is a followup of #26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs. This PR is to eliminate this side effect. ### Why are the changes needed? There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28717 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ion guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? Enable `date.sql` and run it via Thrift Server in `ThriftServerQueryTestSuite`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the enabled tests via: ``` $ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *ThriftServerQueryTestSuite -- -z date.sql" ``` Closes #28721 from MaxGekk/enable-date.sql-for-thrift. Authored-by: Max Gekk <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…w metrics in Query UI ### What changes were proposed in this pull request? In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 07 PM](https://user-images.githubusercontent.com/506656/83815735-bec22380-a675-11ea-8ecc-bf2954731f35.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 27 PM](https://user-images.githubusercontent.com/506656/83815804-de594c00-a675-11ea-933a-d0ffc0f534dd.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-04 at 3 19 41 PM](https://user-images.githubusercontent.com/506656/83816265-d77f0900-a676-11ea-84b8-2a8d80428bc6.png) Closes #28730 from ueshin/issues/SPARK-31903/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This PR proposes to add one more newline to clearly separate JVM and Python tracebacks: Before: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` After: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` This is kind of a followup of e694660 (SPARK-31849). ### Why are the changes needed? To make it easier to read. ### Does this PR introduce _any_ user-facing change? It's in the unreleased branches. ### How was this patch tested? Manually tested. Closes #28732 from HyukjinKwon/python-minor. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

This reverts commit a4195d2.

…ormatting too # What changes were proposed in this pull request? After all these attempts #28692 and #28719 an #28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all ### Why are the changes needed? These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) ### Does this PR introduce _any_ user-facing change? With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. ### How was this patch tested? add unit tests Closes #28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? It appears I have unintentionally used nested JDBC statements in the two tests I added. ### Why are the changes needed? Cleanup a typo. Please merge to master/branch-3.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #28735 from juliuszsompolski/SPARK-31859-fixup. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

yaooqinn and others added 15 commits June 3, 2020 06:49

fix compilation

349015d

Revert "[SPARK-31879][SQL] Using GB as default Locale for datetime fo…

e61d0de

…rmatters" This reverts commit c59f51b.

Revert "[SPARK-28624][SQL][TESTS] Run date.sql via Thrift Server"

a8266e4

This reverts commit a4195d2.

GuoPhilipse merged commit a6b4f74 into GuoPhilipse:master Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync #11

sync #11

Uh oh!

GuoPhilipse commented Jun 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

sync #11

sync #11

Uh oh!

Conversation

GuoPhilipse commented Jun 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants