Skip to content

Conversation

@AngersZhuuuu
Copy link
Owner

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://spark.apache.org/contributing.html before opening a pull request.

AngersZhuuuu pushed a commit that referenced this pull request Jul 26, 2019
…comparison assertions

## What changes were proposed in this pull request?

This PR removes a few hardware-dependent assertions which can cause a failure in `aarch64`.

**x86_64**
```
rootdonotdel-openlab-allinone-l00242678:/home/ubuntu# uname -a
Linux donotdel-openlab-allinone-l00242678 4.4.0-154-generic apache#181-Ubuntu SMP Tue Jun 25 05:29:03 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux

scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res0: Int = -4194304
scala> floatToRawIntBits(Float.NaN)
res1: Int = 2143289344
```

**aarch64**
```
[rootarm-huangtianhua spark]# uname -a
Linux arm-huangtianhua 4.14.0-49.el7a.aarch64 #1 SMP Tue Apr 10 17:22:26 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res1: Int = 2143289344
scala> floatToRawIntBits(Float.NaN)
res2: Int = 2143289344
```

## How was this patch tested?

Pass the Jenkins (This removes the test coverage).

Closes apache#25186 from huangtianhua/special-test-case-for-aarch64.

Authored-by: huangtianhua <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Oct 27, 2019
### What changes were proposed in this pull request?
`org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details:
```
Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database
```
Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs.

### Why are the changes needed?
Failing test doesn't give enough debug information.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
I've started the test manually and checked that such additional debug messages show up:
```
>>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Looking for keys for: kafka/localhostEXAMPLE.COM
Added key: 17version: 0
Added key: 23version: 0
Added key: 16version: 0
Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Using builtin default etypes for permitted_enctypes
default etypes for permitted_enctypes: 17 16 23.
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM
MemoryCache: Existing AuthList:
#3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM
#2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM
#1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM
```

Closes apache#26252 from gaborgsomogyi/SPARK-29580.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 18, 2019
@github-actions github-actions bot closed this Dec 19, 2019
AngersZhuuuu pushed a commit that referenced this pull request Mar 11, 2020
### What changes were proposed in this pull request?
Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment
apache#27509 (comment) .

This PR refined from the following aspects:
1. Refined structure of all physical join operators
2. Add missing joinType field for CartesianProductExec operator
3. Refined codes related to Explain Formatted

The EXPLAIN FORMATTED changes are
1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty.
2. `#1` will add Join condition to `BroadcastNestedLoopJoinExec`.
3. `#1` will **NOT** affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before.
4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added.
5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before.

### Why are the changes needed?
Make the code consistent with other operators and for future handiness of join operators.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes apache#27595 from Eric5553/RefineJoin.

Authored-by: Eric Wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Mar 22, 2021
… correctly

### What changes were proposed in this pull request?

This PR proposes:
  1. `CREATE OR REPLACE TEMP VIEW USING` should use `TemporaryViewRelation` to store temp views.
  2. By doing #1, it fixes the issue where the temp view being replaced is not uncached.

### Why are the changes needed?

This is a part of an ongoing work to wrap all the temporary views with `TemporaryViewRelation`: [SPARK-34698](https://issues.apache.org/jira/browse/SPARK-34698).

This also fixes a bug where the temp view being replaced is not uncached.

### Does this PR introduce _any_ user-facing change?

Yes, the temp view being replaced with `CREATE OR REPLACE TEMP VIEW USING` is correctly uncached if the temp view is cached.

### How was this patch tested?

Added new tests.

Closes apache#31825 from imback82/create_temp_view_using.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Oct 28, 2022
…ly equivalent children in `RewriteDistinctAggregates`

### What changes were proposed in this pull request?

In `RewriteDistinctAggregates`, when grouping aggregate expressions by function children, treat children that are semantically equivalent as the same.

### Why are the changes needed?

This PR will reduce the number of projections in the Expand operator when there are multiple distinct aggregations with superficially different children. In some cases, it will eliminate the need for an Expand operator.

Example: In the following query, the Expand operator creates 3\*n rows (where n is the number of incoming rows) because it has a projection for each of function children `b + 1`, `1 + b` and `c`.

```
create or replace temp view v1 as
select * from values
(1, 2, 3.0),
(1, 3, 4.0),
(2, 4, 2.5),
(2, 3, 1.0)
v1(a, b, c);

select
  a,
  count(distinct b + 1),
  avg(distinct 1 + b) filter (where c > 0),
  sum(c)
from
  v1
group by a;
```
The Expand operator has three projections (each producing a row for each incoming row):
```
[a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for regular aggregation)
[a#87, (b#88 + 1), null, 1, null, null],          <== projection #2 (for distinct aggregation of b + 1)
[a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for distinct aggregation of 1 + b)
```
In reality, the Expand only needs one projection for `1 + b` and `b + 1`, because they are semantically equivalent.

With the proposed change, the Expand operator's projections look like this:
```
[a#67, null, 0, null, UnscaledValue(c#69)],  <== projection #1 (for regular aggregations)
[a#67, (b#68 + 1), 1, (c#69 > 0.0), null]],  <== projection #2 (for distinct aggregation on b + 1 and 1 + b)
```
With one less projection, Expand produces 2\*n rows instead of 3\*n rows, but still produces the correct result.

In the case where all distinct aggregates have semantically equivalent children, the Expand operator is not needed at all.

Benchmark code in the JIRA (SPARK-40382).

Before the PR:
```
distinct aggregates:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
all semantically equivalent                       14721          14859         195          5.7         175.5       1.0X
some semantically equivalent                      14569          14572           5          5.8         173.7       1.0X
none semantically equivalent                      14408          14488         113          5.8         171.8       1.0X
```
After the PR:
```
distinct aggregates:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
all semantically equivalent                        3658           3692          49         22.9          43.6       1.0X
some semantically equivalent                       9124           9214         127          9.2         108.8       0.4X
none semantically equivalent                      14601          14777         250          5.7         174.1       0.3X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes apache#37825 from bersprockets/rewritedistinct_issue.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Apr 7, 2023
…nto w/ and w/o `ansi` suffix to pass sql analyzer test in ansi mode

### What changes were proposed in this pull request?
After apache#40496, run

```
SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"
```

There is one test faild with `spark.sql.ansi.enabled = true`

```
[info] - timestampNTZ/datetime-special.sql_analyzer_test *** FAILED *** (11 milliseconds)
[info]   timestampNTZ/datetime-special.sql_analyzer_test
[info]   Expected "...date(999999, 3, 18, [false) AS make_date(999999, 3, 18)#x, make_date(-1, 1, 28, fals]e) AS make_date(-1, ...", but got "...date(999999, 3, 18, [true) AS make_date(999999, 3, 18)#x, make_date(-1, 1, 28, tru]e) AS make_date(-1, ..." Result did not match for query #1
[info]   select make_date(999999, 3, 18), make_date(-1, 1, 28) (SQLQueryTestSuite.scala:777)
[info]   org.scalatest.exceptions.TestFailedException:
```

The failure reason is the last parameter of function `MakeDate` is `failOnError: Boolean = SQLConf.get.ansiEnabled`.

So this pr split `timestampNTZ/datetime-special.sql` into w/ and w/o ansi to mask this test difference.

### Why are the changes needed?
Make SQLQueryTestSuite test pass with `spark.sql.ansi.enabled = true`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions
- Manual checked `SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"`

Closes apache#40552 from LuciferYang/SPARK-42921.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Apr 28, 2024
### What changes were proposed in this pull request?

This PR uses SMALLINT (as TINYINT ranges [0, 255]) instead of BYTE to fix the ByteType mapping for MsSQLServer JDBC

```java
[info]   com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #1: Cannot find data type BYTE.
[info]   at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265)
[info]   at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1662)
[info]   at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:898)
[info]   at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:793)
[info]   at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7417)
[info]   at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3488)
[info]   at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:262)
[info]   at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:237)
[info]   at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:733)
[info]   at org.apache.spark.sql.jdbc.JdbcDialect.createTable(JdbcDialects.scala:267)
```

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests
### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46164 from yaooqinn/SPARK-47938.

Lead-authored-by: Kent Yao <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Nov 3, 2025
### What changes were proposed in this pull request?

This PR aims to disable `SparkBloomFilterSuite` due to the excessive running time.
- SPARK-53077 is filed to re-enable this with the reasonable running time.

### Why are the changes needed?

Previously, `common/sketch` module took less than 10s.

```
$ mvn package --pl common/sketch
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  7.177 s
[INFO] Finished at: 2025-08-02T08:25:43-07:00
[INFO] ------------------------------------------------------------------------
```

After `SparkBloomFilterSuite` was added newly, `SparkBloomFilterSuite` took over 12 minutes. It's too long as a unit test.
- apache#50933

```
[info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyRandomDistribution(long, double, int, org.junit.jupiter.api.TestInfo):#1 started
[info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyEvenOdd(long, double, int, org.junit.jupiter.api.TestInfo):#1 started
[info] Test run finished: 0 failed, 0 ignored, 2 total, 721.939s
```

### Does this PR introduce _any_ user-facing change?

No, this is a test change.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51788 from dongjoon-hyun/SPARK-53076.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Nov 3, 2025
… `SparkBloomFilterSuite`

### What changes were proposed in this pull request?
This pr adds an environment variable named `SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED` to control whether the test case `SparkBloomFilterSuite` is executed. It also ensures that this test case is only run for validation in the daily tests specified in `build_non_ansi.yml`.

### Why are the changes needed?
The `SparkBloomFilterSuite` requires periodic validation, but due to its excessively long execution time (over 10 minutes), it is not suitable for execution in the Change Pipeline.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual verification:

- maven

```
build/mvn package --pl common/sketch
[INFO] Running org.apache.spark.util.sketch.SparkBloomFilterSuite
[WARNING] Tests run: 2, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 0.001 s -- in org.apache.spark.util.sketch.SparkBloomFilterSuite
```

```
SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED=true build/mvn package --pl common/sketch
[INFO] Running org.apache.spark.util.sketch.SparkBloomFilterSuite
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 401.9 s -- in org.apache.spark.util.sketch.SparkBloomFilterSuite
```

- sbt

```
build/sbt clean "sketch/test"
[info] Test run started (JUnit Jupiter)
[info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite ignored: Environment variable [SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED] does not exist
[info] Test run finished: 0 failed, 0 ignored, 0 total, 0.016s
```

```
SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED=true build/sbt clean "sketch/test"
[info] Test run started (JUnit Jupiter)
[info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyRandomDistribution(long, double, int, org.junit.jupiter.api.TestInfo):#1 started
[info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyEvenOdd(long, double, int, org.junit.jupiter.api.TestInfo):#1 started
[info] Test run finished: 0 failed, 0 ignored, 2 total, 359.776s
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51806 from LuciferYang/SPARK-53077.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
AngersZhuuuu pushed a commit that referenced this pull request Nov 3, 2025
…e of the closed PRs

### What changes were proposed in this pull request?

This PR aims to improve `merge_spark_pr.py` to stop early in case of the closed PRs.

### Why are the changes needed?

To help committers by removing the useless interactions.

**BEFORE**

```
$ dev/merge_spark_pr.py
git rev-parse --abbrev-ref HEAD
Which pull request would you like to merge? (e.g. 34): 1
I've re-written the title as follows to match the standard format:
Original: Removed reference to incubation in README.md.
Modified: Removed reference to incubation in README.md
Would you like to use the modified title? (y/N):
...
```

**AFTER**

```
$ dev/merge_spark_pr.py
git rev-parse --abbrev-ref HEAD
Which pull request would you like to merge? (e.g. 34): 1
#1 is closed already.
Restoring head pointer to SPARK-53277
git checkout SPARK-53277
Already on 'SPARK-53277'
git branch
```

### Does this PR introduce _any_ user-facing change?

No, this is used by committers only.

### How was this patch tested?

Manually check.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52019 from dongjoon-hyun/SPARK-53277.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants