[pull] master from apache:master #18

pull · 2022-09-19T01:39:05Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

…dd UTs for RocksDB ### What changes were proposed in this pull request? `ChromeUIHistoryServerSuite` only test LevelDB backend now, this pr refactor the UTs of `ChromeUIHistoryServerSuite` to add UTs for RocksDB ### Why are the changes needed? Add UTs related to RocksDB. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA - Manual test on Apple Silicon environment： ``` build/sbt -Dguava.version=31.1-jre -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest.default.exclude.tags="" "core/testOnly org.apache.spark.deploy.history.RocksBackendChromeUIHistoryServerSuite" ``` ``` [info] RocksBackendChromeUIHistoryServerSuite: Starting ChromeDriver 105.0.5195.52 (412c95e518836d8a7d97250d62b29c2ae6a26a85-refs/branch-heads/5195{#853}) on port 54402 Only local connections are allowed. Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe. ChromeDriver was started successfully. [info] - ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase) (5 seconds, 387 milliseconds) [info] Run completed in 20 seconds, 838 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 118 s (01:58), completed 2022-9-15 10:30:53 ``` Closes #37878 from LuciferYang/SPARK-40424. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ervice.db.backend` in `running-on-yarn.md` ### What changes were proposed in this pull request? From the context from [pr](#19032) of [SPARK-17321](https://issues.apache.org/jira/browse/SPARK-17321), `YarnShuffleService` will persist data into `Level/RocksDB` when Yarn NM recovery is enabled. So this pr adds the precondition description related to `Yarn NM recovery is enabled` for `spark.shuffle.service.db.backend`. in `running-on-yarn.md` ### Why are the changes needed? Add precondition description for `spark.shuffle.service.db.backend` in `running-on-yarn.md` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #37853 from LuciferYang/SPARK-40404. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ples self-contained (part 7, ~30 functions) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37850 from khalidmammadov/docstrings_funcs_part_7. Lead-authored-by: Khalid Mammadov <[email protected]> Co-authored-by: khalidmammadov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…n look up function failed` ### What changes were proposed in this pull request? This reverts #21790 because it's no longer needed. It kept the original error from Hive when Spark loads builtin functions from Hive, which no longer happens as Spark has implemented all builtin functions natively. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #37896 from cloud-fan/error. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the batch and the actual planned logical plan are mismatched. So, [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348) we cannot find the plan and collect metrics correctly. This PR fixes this by replacing the initial `LogicalPlan` with the `LogicalPlan` containing the metadata column ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing + New UTs Closes #37905 from Yaohua628/spark-40460. Authored-by: yaohua <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? This PR updates `DropTable`/`DropView` to use `UnresolvedIdentifier` instead of `UnresolvedTableOrView`/`UnresolvedView`. This has several benefits: 1. Simplify the `ifExits` handling. No need to handle `DropTable` in `ResolveCommandsWithIfExists` anymore. 2. Avoid one table lookup if we eventually fallback to v1 command (v1 `DropTableCommand` will look up table again) 3. v2 catalogs can avoid table lookup entirely if possible. This PR also improves table uncaching to match by table name directly, so that we don't need to look up the table and resolve to table relations. ### Why are the changes needed? Save table lookup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #37879 from cloud-fan/drop-table. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR proposes to add `CONNECT` component to our labeler so the PRs related to Spark Connect project have this label. ### Why are the changes needed? To make the PRs easier to track and search. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Will monitor other PRs once this gets merged. The notation is consistent with others. Closes #37925 from HyukjinKwon/SPARK-40483. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Fix doc of `DataFrame.corr`, it should be the implementation of `Kendall` in PS has the complexity of O(#row * #row), since it apply a cross join (within each partition) to compute the statistics ### Why are the changes needed? Fix doc of `DataFrame.corr` ### Does this PR introduce _any_ user-facing change? yes, doc fixed ### How was this patch tested? manually check Closes #37927 from zhengruifeng/ps_df_kendall_doc. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…o *.sql test cases ### What changes were proposed in this pull request? This PR proposes to integrate Grouped Aggregate Pandas UDF tests into *.sql test cases. This PR includes the fixes below: - Add `UDAFTestCase` into `SQLQueryTestSuite.scala` to test the UDAF related functions in sql. - Add `udaf` directory and create related sql test cases into this directory. - Generate golden files for new added sql test files. - Skip from `ThriftServerQueryTestSuite.scala` for now. - Fix minor typos. ### Why are the changes needed? To improve the test coverage, so prevent the potential bug in the future. ### Does this PR introduce _any_ user-facing change? No, it's test-only. ### How was this patch tested? Added sql test files and corresponding golden files. Closes #37873 from itholic/SPARK-40419. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…repeatedly ### What changes were proposed in this pull request? This PR caches the result of `PartitionReader.next` in `PartitionIterator`, so that its `hasNext` method is cheap to be called repeatedly. ### Why are the changes needed? potential perf improvement. `PartitionReader.next` can be expensive in some v2 sources, and it's legal to call `Iterator.hasNext` repeatedly. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #37900 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This pr aims to upgrade log4j2 from 2.18.0 to 2.19.0. ### Why are the changes needed? Log4j 2.19.0 contains new features and fixes, and this version begin to support for slf4j2: - https://issues.apache.org/jira/browse/LOG4J2-3583 - https://issues.apache.org/jira/browse/LOG4J2-2975 all changes can be found in the latest [changes report](https://logging.apache.org/log4j/2.x/changes-report.html#a2.19.0). ### Does this PR introduce _any_ user-facing change? No, log4j 2.19.0 maintains binary compatibility with previous releases. ### How was this patch tested? Pass GitHub Actions Closes #37926 from LuciferYang/SPARK-40484. Authored-by: yangjie01 <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down doesn't supports project with alias. Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96 This PR let it works good with alias. **The first example:** the origin plan show below: ``` Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14] +- Project [DEPT#0, SALARY#2 AS mySalary#8] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` **The second example:** the origin plan show below: ``` Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40] +- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34] +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40] +- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52] +- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee ``` ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 aggregate push-down supports project with alias. ### How was this patch tested? New tests. Closes apache#35932 from beliefer/SPARK-38533_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit f327dad) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Restore `scipy` installation in dockerfile ### Why are the changes needed? https://docs.scipy.org/doc/scipy-1.13.1/building/index.html#system-level-dependencies > If you want to use the system Python and pip, you will need: C, C++, and Fortran compilers (typically gcc, g++, and gfortran). ... `scipy` actually depends on `gfortran`, but `apt-get remove --purge -y 'gfortran-11'` broke this dependency. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually check with the first commit apache@5be0dfa: move `apt-get remove --purge -y 'gfortran-11'` ahead of `scipy` installation, then the installation fails with ``` #18 394.3 Collecting scipy #18 394.4 Downloading scipy-1.13.1.tar.gz (57.2 MB) #18 395.2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.2/57.2 MB 76.7 MB/s eta 0:00:00 #18 401.3 Installing build dependencies: started #18 410.5 Installing build dependencies: finished with status 'done' #18 410.5 Getting requirements to build wheel: started #18 410.7 Getting requirements to build wheel: finished with status 'done' #18 410.7 Installing backend dependencies: started #18 411.8 Installing backend dependencies: finished with status 'done' #18 411.8 Preparing metadata (pyproject.toml): started #18 414.9 Preparing metadata (pyproject.toml): finished with status 'error' #18 414.9 error: subprocess-exited-with-error #18 414.9 #18 414.9 × Preparing metadata (pyproject.toml) did not run successfully. #18 414.9 │ exit code: 1 #18 414.9 ╰─> [42 lines of output] #18 414.9 + meson setup /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d/.mesonpy-xqfvs4ek -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=/tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d/.mesonpy-xqfvs4ek/meson-python-native-file.ini #18 414.9 The Meson build system #18 414.9 Version: 1.5.2 #18 414.9 Source dir: /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d #18 414.9 Build dir: /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d/.mesonpy-xqfvs4ek #18 414.9 Build type: native build #18 414.9 Project name: scipy #18 414.9 Project version: 1.13.1 #18 414.9 C compiler for the host machine: cc (gcc 11.4.0 "cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0") #18 414.9 C linker for the host machine: cc ld.bfd 2.38 #18 414.9 C++ compiler for the host machine: c++ (gcc 11.4.0 "c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0") #18 414.9 C++ linker for the host machine: c++ ld.bfd 2.38 #18 414.9 Cython compiler for the host machine: cython (cython 3.0.11) #18 414.9 Host machine cpu family: x86_64 #18 414.9 Host machine cpu: x86_64 #18 414.9 Program python found: YES (/usr/local/bin/pypy3) #18 414.9 Run-time dependency python found: YES 3.9 #18 414.9 Program cython found: YES (/tmp/pip-build-env-v_vnvt3h/overlay/bin/cython) #18 414.9 Compiler for C supports arguments -Wno-unused-but-set-variable: YES #18 414.9 Compiler for C supports arguments -Wno-unused-function: YES #18 414.9 Compiler for C supports arguments -Wno-conversion: YES #18 414.9 Compiler for C supports arguments -Wno-misleading-indentation: YES #18 414.9 Library m found: YES #18 414.9 #18 414.9 ../meson.build:78:0: ERROR: Unknown compiler(s): [['gfortran'], ['flang'], ['nvfortran'], ['pgfortran'], ['ifort'], ['ifx'], ['g95']] #18 414.9 The following exception(s) were encountered: #18 414.9 Running `gfortran --version` gave "[Errno 2] No such file or directory: 'gfortran'" #18 414.9 Running `gfortran -V` gave "[Errno 2] No such file or directory: 'gfortran'" #18 414.9 Running `flang --version` gave "[Errno 2] No such file or directory: 'flang'" #18 414.9 Running `flang -V` gave "[Errno 2] No such file or directory: 'flang'" #18 414.9 Running `nvfortran --version` gave "[Errno 2] No such file or directory: 'nvfortran'" #18 414.9 Running `nvfortran -V` gave "[Errno 2] No such file or directory: 'nvfortran'" #18 414.9 Running `pgfortran --version` gave "[Errno 2] No such file or directory: 'pgfortran'" #18 414.9 Running `pgfortran -V` gave "[Errno 2] No such file or directory: 'pgfortran'" #18 414.9 Running `ifort --version` gave "[Errno 2] No such file or directory: 'ifort'" #18 414.9 Running `ifort -V` gave "[Errno 2] No such file or directory: 'ifort'" #18 414.9 Running `ifx --version` gave "[Errno 2] No such file or directory: 'ifx'" #18 414.9 Running `ifx -V` gave "[Errno 2] No such file or directory: 'ifx'" #18 414.9 Running `g95 --version` gave "[Errno 2] No such file or directory: 'g95'" #18 414.9 Running `g95 -V` gave "[Errno 2] No such file or directory: 'g95'" #18 414.9 #18 414.9 A full log can be found at /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984[4155](https://github.com/zhengruifeng/spark/actions/runs/11357130578/job/31589506939#step:7:4161)33ae9079d/.mesonpy-xqfvs4ek/meson-logs/meson-log.txt #18 414.9 [end of output] ``` see https://github.com/zhengruifeng/spark/actions/runs/11357130578/job/31589506939 ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48489 from zhengruifeng/infra_scipy. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

pull bot added the ⤵️ pull label Sep 19, 2022

github-actions bot added the CORE label Sep 19, 2022

github-actions bot added the DOCS label Sep 19, 2022

github-actions bot added PYTHON SQL labels Sep 19, 2022

cloud-fan and others added 2 commits September 19, 2022 13:10

github-actions bot added the STRUCTURED STREAMING label Sep 19, 2022

cloud-fan and others added 2 commits September 19, 2022 16:47

github-actions bot added the INFRA label Sep 19, 2022

github-actions bot added the PANDAS API ON SPARK label Sep 19, 2022

itholic and others added 3 commits September 19, 2022 19:53

github-actions bot added the BUILD label Sep 19, 2022

pull bot merged commit 2d6d5e2 into wangyum:master Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #18

[pull] master from apache:master #18

Uh oh!

pull bot commented Sep 19, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[pull] master from apache:master #18

[pull] master from apache:master #18

Uh oh!

Conversation

pull bot commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull bot commented Sep 19, 2022 •

edited

Loading