Skip to content

Conversation

@HyukjinKwon
Copy link

No description provided.

@cloud-fan cloud-fan merged this pull request into cloud-fan:script Mar 3, 2021
@HyukjinKwon HyukjinKwon deleted the script-pr branch January 4, 2022 00:54
cloud-fan pushed a commit that referenced this pull request Mar 31, 2022
### What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.

Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96

This PR let it works good with alias.

**The first example:**
the origin plan show below:
```
Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14]
+- Project [DEPT#0, SALARY#2 AS mySalary#8]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```

**The second example:**
the origin plan show below:
```
Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40]
+- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34]
   +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#25 AS myDept#33, SUM(SALARY)apache#44 AS sum(SALARY#27)apache#39 AS total#40]
+- RelationV2[DEPT#25, SUM(SALARY)apache#44] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)apache#56 as decimal(20,2))) AS total#52]
+- RelationV2[DEPT#25, SUM(SALARY)apache#56] test.employee
```

### Why are the changes needed?
Alias is more useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.

### How was this patch tested?
New tests.

Closes apache#35932 from beliefer/SPARK-38533_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jun 13, 2022
### What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.

Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96

This PR let it works good with alias.

**The first example:**
the origin plan show below:
```
Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14]
+- Project [DEPT#0, SALARY#2 AS mySalary#8]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```

**The second example:**
the origin plan show below:
```
Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40]
+- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34]
   +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#25 AS myDept#33, SUM(SALARY)apache#44 AS sum(SALARY#27)apache#39 AS total#40]
+- RelationV2[DEPT#25, SUM(SALARY)apache#44] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)apache#56 as decimal(20,2))) AS total#52]
+- RelationV2[DEPT#25, SUM(SALARY)apache#56] test.employee
```

### Why are the changes needed?
Alias is more useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.

### How was this patch tested?
New tests.

Closes apache#35932 from beliefer/SPARK-38533_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit f327dad)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Oct 25, 2024
### What changes were proposed in this pull request?
Restore `scipy` installation in dockerfile

### Why are the changes needed?
https://docs.scipy.org/doc/scipy-1.13.1/building/index.html#system-level-dependencies

> If you want to use the system Python and pip, you will need:
 C, C++, and Fortran compilers (typically gcc, g++, and gfortran).
 ...

`scipy` actually depends on `gfortran`, but `apt-get remove --purge -y 'gfortran-11'` broke this dependency.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manually check with the first commit apache@5be0dfa:
move `apt-get remove --purge -y 'gfortran-11'` ahead of `scipy` installation, then the installation fails with
```
#18 394.3 Collecting scipy
#18 394.4   Downloading scipy-1.13.1.tar.gz (57.2 MB)
#18 395.2      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.2/57.2 MB 76.7 MB/s eta 0:00:00
#18 401.3   Installing build dependencies: started
#18 410.5   Installing build dependencies: finished with status 'done'
#18 410.5   Getting requirements to build wheel: started
#18 410.7   Getting requirements to build wheel: finished with status 'done'
#18 410.7   Installing backend dependencies: started
#18 411.8   Installing backend dependencies: finished with status 'done'
#18 411.8   Preparing metadata (pyproject.toml): started
#18 414.9   Preparing metadata (pyproject.toml): finished with status 'error'
#18 414.9   error: subprocess-exited-with-error
#18 414.9
#18 414.9   × Preparing metadata (pyproject.toml) did not run successfully.
#18 414.9   │ exit code: 1
#18 414.9   ╰─> [42 lines of output]
#18 414.9       + meson setup /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d/.mesonpy-xqfvs4ek -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=/tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d/.mesonpy-xqfvs4ek/meson-python-native-file.ini
#18 414.9       The Meson build system
#18 414.9       Version: 1.5.2
#18 414.9       Source dir: /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d
#18 414.9       Build dir: /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984415533ae9079d/.mesonpy-xqfvs4ek
#18 414.9       Build type: native build
#18 414.9       Project name: scipy
#18 414.9       Project version: 1.13.1
#18 414.9       C compiler for the host machine: cc (gcc 11.4.0 "cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0")
#18 414.9       C linker for the host machine: cc ld.bfd 2.38
#18 414.9       C++ compiler for the host machine: c++ (gcc 11.4.0 "c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0")
#18 414.9       C++ linker for the host machine: c++ ld.bfd 2.38
#18 414.9       Cython compiler for the host machine: cython (cython 3.0.11)
#18 414.9       Host machine cpu family: x86_64
#18 414.9       Host machine cpu: x86_64
#18 414.9       Program python found: YES (/usr/local/bin/pypy3)
#18 414.9       Run-time dependency python found: YES 3.9
#18 414.9       Program cython found: YES (/tmp/pip-build-env-v_vnvt3h/overlay/bin/cython)
#18 414.9       Compiler for C supports arguments -Wno-unused-but-set-variable: YES
#18 414.9       Compiler for C supports arguments -Wno-unused-function: YES
#18 414.9       Compiler for C supports arguments -Wno-conversion: YES
#18 414.9       Compiler for C supports arguments -Wno-misleading-indentation: YES
#18 414.9       Library m found: YES
#18 414.9
#18 414.9       ../meson.build:78:0: ERROR: Unknown compiler(s): [['gfortran'], ['flang'], ['nvfortran'], ['pgfortran'], ['ifort'], ['ifx'], ['g95']]
#18 414.9       The following exception(s) were encountered:
#18 414.9       Running `gfortran --version` gave "[Errno 2] No such file or directory: 'gfortran'"
#18 414.9       Running `gfortran -V` gave "[Errno 2] No such file or directory: 'gfortran'"
#18 414.9       Running `flang --version` gave "[Errno 2] No such file or directory: 'flang'"
#18 414.9       Running `flang -V` gave "[Errno 2] No such file or directory: 'flang'"
#18 414.9       Running `nvfortran --version` gave "[Errno 2] No such file or directory: 'nvfortran'"
#18 414.9       Running `nvfortran -V` gave "[Errno 2] No such file or directory: 'nvfortran'"
#18 414.9       Running `pgfortran --version` gave "[Errno 2] No such file or directory: 'pgfortran'"
#18 414.9       Running `pgfortran -V` gave "[Errno 2] No such file or directory: 'pgfortran'"
#18 414.9       Running `ifort --version` gave "[Errno 2] No such file or directory: 'ifort'"
#18 414.9       Running `ifort -V` gave "[Errno 2] No such file or directory: 'ifort'"
#18 414.9       Running `ifx --version` gave "[Errno 2] No such file or directory: 'ifx'"
#18 414.9       Running `ifx -V` gave "[Errno 2] No such file or directory: 'ifx'"
#18 414.9       Running `g95 --version` gave "[Errno 2] No such file or directory: 'g95'"
#18 414.9       Running `g95 -V` gave "[Errno 2] No such file or directory: 'g95'"
#18 414.9
#18 414.9       A full log can be found at /tmp/pip-install-y77ar9d0/scipy_1e543e0816ed4b26984[4155](https://github.com/zhengruifeng/spark/actions/runs/11357130578/job/31589506939#step:7:4161)33ae9079d/.mesonpy-xqfvs4ek/meson-logs/meson-log.txt
#18 414.9       [end of output]
```

see https://github.com/zhengruifeng/spark/actions/runs/11357130578/job/31589506939

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48489 from zhengruifeng/infra_scipy.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants