Skip to content

Conversation

@GulajavaMinistudio
Copy link
Owner

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

gatorsmile and others added 8 commits May 6, 2017 22:21
### What changes were proposed in this pull request?

This PR is to support JDBC data type TIME WITH TIME ZONE. It can be converted to TIMESTAMP

In addition, before this PR, for unsupported data types, we simply output the type number instead of the type name.

```
java.sql.SQLException: Unsupported type 2014
```
After this PR, the message is like
```
java.sql.SQLException: Unsupported type TIMESTAMP_WITH_TIMEZONE
```

- Also upgrade the H2 version to `1.4.195` which has the type fix for "TIMESTAMP WITH TIMEZONE". However, it is not fully supported. Thus, we capture the exception, but we still need it to partially test the support of "TIMESTAMP WITH TIMEZONE", because Docker tests are not regularly run.

### How was this patch tested?
Added test cases.

Author: Xiao Li <[email protected]>

Closes #17835 from gatorsmile/h2.
## What changes were proposed in this pull request?

- Move udf wrapping code from `functions.udf` to `functions.UserDefinedFunction`.
- Return wrapped udf from `catalog.registerFunction` and dependent methods.
- Update docstrings in `catalog.registerFunction` and `SQLContext.registerFunction`.
- Unit tests.

## How was this patch tested?

- Existing unit tests and docstests.
- Additional tests covering new feature.

Author: zero323 <[email protected]>

Closes #17831 from zero323/SPARK-18777.
## What changes were proposed in this pull request?

This PR adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , TempShuffleBlockId , TempLocalBlockId

## How was this patch tested?

The new unit test.

Author: caoxuewen <[email protected]>

Closes #17794 from heary-cao/blockidsuite.
## What changes were proposed in this pull request?

This PR adds documentation to the ALS code.

## How was this patch tested?

Existing tests were used.

mengxr srowen

This contribution is my original work.  I have the license to work on this project under the Spark project’s open source license.

Author: Daniel Li <[email protected]>

Closes #17793 from danielyli/spark-20484.
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to #12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes #17834 from steveloughran/cloud/SPARK-7481-current.
## What changes were proposed in this pull request?

add environment

## How was this patch tested?

wait for appveyor run

Author: Felix Cheung <[email protected]>

Closes #17878 from felixcheung/appveyorrcran.
## What changes were proposed in this pull request?

* Docs are consistent (across different `unix_timestamp` variants and their internal expressions)
* typo hunting

## How was this patch tested?

local build

Author: Jacek Laskowski <[email protected]>

Closes #17801 from jaceklaskowski/unix_timestamp.
## What changes were proposed in this pull request?

- Add SparkR wrapper for `Dataset.alias`.
- Adjust roxygen annotations for `functions.alias` (including example usage).

## How was this patch tested?

Unit tests, `check_cran.sh`.

Author: zero323 <[email protected]>

Closes #17825 from zero323/SPARK-20550.
@GulajavaMinistudio GulajavaMinistudio merged commit ab5eaea into GulajavaMinistudio:master May 8, 2017
GulajavaMinistudio pushed a commit that referenced this pull request Jul 20, 2020
…or its output partitioning

### What changes were proposed in this pull request?

Currently, the `BroadcastHashJoinExec`'s `outputPartitioning` only uses the streamed side's `outputPartitioning`. However, if the join type of `BroadcastHashJoinExec` is an inner-like join, the build side's info (the join keys) can be added to `BroadcastHashJoinExec`'s `outputPartitioning`.

 For example,
```Scala
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2")
val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3")
val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4")

// join1 is a sort merge join.
val join1 = t1.join(t2, t1("i1") === t2("i2"))

// join2 is a broadcast join where t3 is broadcasted.
val join2 = join1.join(t3, join1("i1") === t3("i3"))

// Join on the column from the broadcasted side (i3).
val join3 = join2.join(t4, join2("i3") === t4("i4"))

join3.explain
```
You see that `Exchange hashpartitioning(i2#103, 200)` is introduced because there is no output partitioning info from the build side.
```
== Physical Plan ==
*(6) SortMergeJoin [i3#29], [i4#40], Inner
:- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i3#29, 200), true, [id=#55]
:     +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight
:        :- *(3) SortMergeJoin [i1#7], [i2#18], Inner
:        :  :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0
:        :  :  +- Exchange hashpartitioning(i1#7, 200), true, [id=#28]
:        :  :     +- LocalTableScan [i1#7, j1#8]
:        :  +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0
:        :     +- Exchange hashpartitioning(i2#18, 200), true, [id=#29]
:        :        +- LocalTableScan [i2#18, j2#19]
:        +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#34]
:           +- LocalTableScan [i3#29, j3#30]
+- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i4#40, 200), true, [id=#39]
      +- LocalTableScan [i4#40, j4#41]
```
This PR proposes to introduce output partitioning for the build side for `BroadcastHashJoinExec` if the streamed side has a `HashPartitioning` or a collection of `HashPartitioning`s.

There is a new internal config `spark.sql.execution.broadcastHashJoin.outputPartitioningExpandLimit`, which can limit the number of partitioning a `HashPartitioning` can expand to. It can be set to "0" to disable this feature.

### Why are the changes needed?

To remove unnecessary shuffle.

### Does this PR introduce _any_ user-facing change?

Yes, now the shuffle in the above example can be eliminated:
```
== Physical Plan ==
*(5) SortMergeJoin [i3#108], [i4#119], Inner
:- *(3) Sort [i3#108 ASC NULLS FIRST], false, 0
:  +- *(3) BroadcastHashJoin [i1#86], [i3#108], Inner, BuildRight
:     :- *(3) SortMergeJoin [i1#86], [i2#97], Inner
:     :  :- *(1) Sort [i1#86 ASC NULLS FIRST], false, 0
:     :  :  +- Exchange hashpartitioning(i1#86, 200), true, [id=#120]
:     :  :     +- LocalTableScan [i1#86, j1#87]
:     :  +- *(2) Sort [i2#97 ASC NULLS FIRST], false, 0
:     :     +- Exchange hashpartitioning(i2#97, 200), true, [id=#121]
:     :        +- LocalTableScan [i2#97, j2#98]
:     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#126]
:        +- LocalTableScan [i3#108, j3#109]
+- *(4) Sort [i4#119 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i4#119, 200), true, [id=#130]
      +- LocalTableScan [i4#119, j4#120]
```

### How was this patch tested?

Added new tests.

Closes apache#28676 from imback82/broadcast_join_output.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
GulajavaMinistudio pushed a commit that referenced this pull request Mar 25, 2022
### What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.

Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96

This PR let it works good with alias.

**The first example:**
the origin plan show below:
```
Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14]
+- Project [DEPT#0, SALARY#2 AS mySalary#8]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```

**The second example:**
the origin plan show below:
```
Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40]
+- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34]
   +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40]
+- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52]
+- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee
```

### Why are the changes needed?
Alias is more useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.

### How was this patch tested?
New tests.

Closes apache#35932 from beliefer/SPARK-38533_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
GulajavaMinistudio pushed a commit that referenced this pull request Apr 30, 2024
… spark docker image

### What changes were proposed in this pull request?
The pr aims to update the packages name removed in building the spark docker image.

### Why are the changes needed?
When our default image base was switched from `ubuntu 20.04` to `ubuntu 22.04`, the unused installation package in the base image has changed, in order to eliminate some warnings in building images and free disk space more accurately, we need to correct it.

Before:
```
#35 [29/31] RUN apt-get remove --purge -y     '^aspnet.*' '^dotnet-.*' '^llvm-.*' 'php.*' '^mongodb-.*'     snapd google-chrome-stable microsoft-edge-stable firefox     azure-cli google-cloud-sdk mono-devel powershell libgl1-mesa-dri || true
#35 0.489 Reading package lists...
#35 0.505 Building dependency tree...
#35 0.507 Reading state information...
#35 0.511 E: Unable to locate package ^aspnet.*
#35 0.511 E: Couldn't find any package by glob '^aspnet.*'
#35 0.511 E: Couldn't find any package by regex '^aspnet.*'
#35 0.511 E: Unable to locate package ^dotnet-.*
#35 0.511 E: Couldn't find any package by glob '^dotnet-.*'
#35 0.511 E: Couldn't find any package by regex '^dotnet-.*'
#35 0.511 E: Unable to locate package ^llvm-.*
#35 0.511 E: Couldn't find any package by glob '^llvm-.*'
#35 0.511 E: Couldn't find any package by regex '^llvm-.*'
#35 0.511 E: Unable to locate package ^mongodb-.*
#35 0.511 E: Couldn't find any package by glob '^mongodb-.*'
#35 0.511 EPackage 'php-crypt-gpg' is not installed, so not removed
#35 0.511 Package 'php' is not installed, so not removed
#35 0.511 : Couldn't find any package by regex '^mongodb-.*'
#35 0.511 E: Unable to locate package snapd
#35 0.511 E: Unable to locate package google-chrome-stable
#35 0.511 E: Unable to locate package microsoft-edge-stable
#35 0.511 E: Unable to locate package firefox
#35 0.511 E: Unable to locate package azure-cli
#35 0.511 E: Unable to locate package google-cloud-sdk
#35 0.511 E: Unable to locate package mono-devel
#35 0.511 E: Unable to locate package powershell
#35 DONE 0.5s

#36 [30/31] RUN apt-get autoremove --purge -y
#36 0.063 Reading package lists...
#36 0.079 Building dependency tree...
#36 0.082 Reading state information...
#36 0.088 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
#36 DONE 0.4s
```

After:
```
#38 [32/36] RUN apt-get remove --purge -y     'gfortran-11' 'humanity-icon-theme' 'nodejs-doc' || true
#38 0.066 Reading package lists...
#38 0.087 Building dependency tree...
#38 0.089 Reading state information...
#38 0.094 The following packages were automatically installed and are no longer required:
#38 0.094   at-spi2-core bzip2-doc dbus-user-session dconf-gsettings-backend
#38 0.095   dconf-service gsettings-desktop-schemas gtk-update-icon-cache
#38 0.095   hicolor-icon-theme libatk-bridge2.0-0 libatk1.0-0 libatk1.0-data
#38 0.095   libatspi2.0-0 libbz2-dev libcairo-gobject2 libcolord2 libdconf1 libepoxy0
#38 0.095   libgfortran-11-dev libgtk-3-common libjs-highlight.js libllvm11
#38 0.095   libncurses-dev libncurses5-dev libphobos2-ldc-shared98 libreadline-dev
#38 0.095   librsvg2-2 librsvg2-common libvte-2.91-common libwayland-client0
#38 0.095   libwayland-cursor0 libwayland-egl1 libxdamage1 libxkbcommon0
#38 0.095   session-migration tilix-common xkb-data
#38 0.095 Use 'apt autoremove' to remove them.
#38 0.096 The following packages will be REMOVED:
#38 0.096   adwaita-icon-theme* gfortran* gfortran-11* humanity-icon-theme* libgtk-3-0*
#38 0.096   libgtk-3-bin* libgtkd-3-0* libvte-2.91-0* libvted-3-0* nodejs-doc*
#38 0.096   r-base-dev* tilix* ubuntu-mono*
#38 0.248 0 upgraded, 0 newly installed, 13 to remove and 0 not upgraded.
#38 0.248 After this operation, 99.6 MB disk space will be freed.
...
(Reading database ... 70597 files and directories currently installed.)
#38 0.304 Removing r-base-dev (4.1.2-1ubuntu2) ...
#38 0.319 Removing gfortran (4:11.2.0-1ubuntu1) ...
#38 0.340 Removing gfortran-11 (11.4.0-1ubuntu1~22.04) ...
#38 0.356 Removing tilix (1.9.4-2build1) ...
#38 0.377 Removing libvted-3-0:amd64 (3.10.0-1ubuntu1) ...
#38 0.392 Removing libvte-2.91-0:amd64 (0.68.0-1) ...
#38 0.407 Removing libgtk-3-bin (3.24.33-1ubuntu2) ...
#38 0.422 Removing libgtkd-3-0:amd64 (3.10.0-1ubuntu1) ...
#38 0.436 Removing nodejs-doc (12.22.9~dfsg-1ubuntu3.4) ...
#38 0.457 Removing libgtk-3-0:amd64 (3.24.33-1ubuntu2) ...
#38 0.488 Removing ubuntu-mono (20.10-0ubuntu2) ...
#38 0.754 Removing humanity-icon-theme (0.6.16) ...
#38 1.362 Removing adwaita-icon-theme (41.0-1ubuntu1) ...
#38 1.537 Processing triggers for libc-bin (2.35-0ubuntu3.6) ...
#38 1.566 Processing triggers for mailcap (3.70+nmu1ubuntu1) ...
#38 1.577 Processing triggers for libglib2.0-0:amd64 (2.72.4-0ubuntu2.2) ...
(Reading database ... 56946 files and directories currently installed.)
#38 1.645 Purging configuration files for libgtk-3-0:amd64 (3.24.33-1ubuntu2) ...
#38 1.657 Purging configuration files for ubuntu-mono (20.10-0ubuntu2) ...
#38 1.670 Purging configuration files for humanity-icon-theme (0.6.16) ...
#38 1.682 Purging configuration files for adwaita-icon-theme (41.0-1ubuntu1) ...
#38 DONE 1.7s

#39 [33/36] RUN apt-get autoremove --purge -y
#39 0.061 Reading package lists...
#39 0.075 Building dependency tree...
#39 0.077 Reading state information...
#39 0.083 The following packages will be REMOVED:
#39 0.083   at-spi2-core* bzip2-doc* dbus-user-session* dconf-gsettings-backend*
#39 0.083   dconf-service* gsettings-desktop-schemas* gtk-update-icon-cache*
#39 0.083   hicolor-icon-theme* libatk-bridge2.0-0* libatk1.0-0* libatk1.0-data*
#39 0.083   libatspi2.0-0* libbz2-dev* libcairo-gobject2* libcolord2* libdconf1*
#39 0.083   libepoxy0* libgfortran-11-dev* libgtk-3-common* libjs-highlight.js*
#39 0.083   libllvm11* libncurses-dev* libncurses5-dev* libphobos2-ldc-shared98*
#39 0.083   libreadline-dev* librsvg2-2* librsvg2-common* libvte-2.91-common*
#39 0.083   libwayland-client0* libwayland-cursor0* libwayland-egl1* libxdamage1*
#39 0.083   libxkbcommon0* session-migration* tilix-common* xkb-data*
#39 0.231 0 upgraded, 0 newly installed, 36 to remove and 0 not upgraded.
#39 0.231 After this operation, 124 MB disk space will be freed.
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Manually test.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46258 from panbingkun/remove_packages_on_ubuntu.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants