Skip to content

Conversation

@ericm-db
Copy link
Owner

@ericm-db ericm-db commented Apr 1, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dtenedor and others added 22 commits April 3, 2024 14:38
…nsure it

### What changes were proposed in this pull request?

This PR adds a unit test to ensure that the fields of the `LogKey` enumeration are sorted alphabetically, as specified by https://issues.apache.org/jira/browse/SPARK-47705.

### Why are the changes needed?

This will make sure that the fields of the enumeration remain easy to read in the future as we add more cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR adds testing coverage only.

### Was this patch authored or co-authored using generative AI tooling?

GitHub copilot offered some suggestions, but I rejected them

Closes apache#45857 from dtenedor/logs.

Authored-by: Daniel Tenedorio <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request?
Fix formatting of error messages.

Example: In several error messages, we are concatenating the plan without any separation: `in this locationFilter (dept_id#652`. We should add a colon and space or newline in between.

Before:
```
org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationFilter (dept_id#22 = outer(dept_id#16))
+- SubqueryAlias dept
   +- View (`DEPT`, [dept_id#22, dept_name#23, state#24])
      +- Project [cast(dept_id#25 as int) AS dept_id#22, cast(dept_name#26 as string) AS dept_name#23, cast(state#27 as string) AS state#24]
         +- Project [dept_id#25, dept_name#26, state#27]
            +- SubqueryAlias DEPT
               +- LocalRelation [dept_id#25, dept_name#26, state#27]
. SQLSTATE: 0A000; line 3 pos 19;
```
After:
```
org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this location:
Filter (dept_id#71 = outer(dept_id#65))
+- SubqueryAlias dept
   +- View (`DEPT`, [dept_id#71, dept_name#72, state#73])
      +- Project [cast(dept_id#74 as int) AS dept_id#71, cast(dept_name#75 as string) AS dept_name#72, cast(state#76 as string) AS state#73]
         +- Project [dept_id#74, dept_name#75, state#76]
            +- SubqueryAlias DEPT
               +- LocalRelation [dept_id#74, dept_name#75, state#76]
. SQLSTATE: 0A000; line 3 pos 19;
```

### Why are the changes needed?
Improve error messages readability.

### Does this PR introduce _any_ user-facing change?
Improve error messages readability.

### How was this patch tested?
Unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45825 from jchen5/treenode-error-format.

Authored-by: Jack Chen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…osing

### What changes were proposed in this pull request?

When closing the rocksdb instance, we need to wait until all background work finish. If not, the following error could be observed:

```
24/03/29 06:47:11 INFO RocksDB StateStoreId(opId=0,partId=0,name=default): [NativeRocksDB-2] [/error_handler.cc:396] Background IO error IO error: No such file or directory: While open a file for appending: /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=0,name=default)-732271d8-03b3-4046-a911-5797804df25c/workingDir-3a85e625-e4fb-4a78-b668-941ca16cc7a2/000008.sst: No such file or directory
24/03/29 06:47:11 ERROR RocksDB StateStoreId(opId=0,partId=0,name=default): [NativeRocksDB-3] [/db_impl/db_impl_compaction_flush.cc:3021] Waiting after background flush error: IO error: No such file or directory: While open a file for appending: /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=0,name=default)-732271d8-03b3-4046-a911-5797804df25c/workingDir-3a85e625-e4fb-4a78-b668-941ca16cc7a2/000008.sst: No such file or directoryAccumulated background error counts: 1
<TRUNCATED LOG>
24/03/29 11:54:09 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/spark-b5dac908-59cc-4276-80f7-34dab79716b7/StateStoreId(opId=0,partId=0,name=default)-702d3c8f-245e-4119-a763-b8e963d07e7b
24/03/29 06:47:12 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=4,name=default)-0eb30b1b-b92f-4744-aff6-85f9efd2bcf2
24/03/29 06:47:12 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/streaming.metadata-d281c16c-89c7-49b3-b65a-6eb2de6ddb6f
pthread lock: Invalid argument
```
In the source code, after this error is thrown, there is a sleep for 1 second and then re lock the original mutex:

https://github.com/facebook/rocksdb/blob/e46ab9d4f0a0e63bfc668421e2994efa918d6570/db/db_impl/db_impl_compaction_flush.cc#L2613

From the logs of RocksDB and ShutdownHookManager , we can see that exactly 1 second after rocks db throws, the pthread lock: Invalid argument is thrown. So it is likely that this mutex throws.

### Why are the changes needed?

Bug fix for a transient issue

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test should be enough.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45863 from WweiL/SPARK-47722-rocksdb-cleanup.

Authored-by: Wei Liu <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

As suggested in https://github.com/apache/spark/pull/45834/files#r1549565157, I am creating initial guidelines for the structured logging framework.

### Why are the changes needed?

We need guidelines to align the logging migration works in the community.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

It's just doc change.
### Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: GitHub Copilot 1.2.17.2887

Closes apache#45862 from gengliangwang/logREADME.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… support

### What changes were proposed in this pull request?
This PR adds automatic casting and collations resolution as per `PGSQL` behaviour:

1. Collations set on the metadata level are implicit
2. Collations set using the `COLLATE` expression are explicit
3. When there is a combination of expressions of multiple collations the output will be:
- if there are explicit collations and all of them are equal then that collation will be the output
- if there are multiple different explicit collations `COLLATION_MISMATCH.EXPLICIT` will be thrown
- if there are no explicit collations and only a single type of non default collation, that one will be used
- if there are no explicit collations and multiple non-default implicit ones `COLLATION_MISMATCH.IMPLICIT` will be thrown

INDETERMINATE_COLLATION should only be thrown on comparison operations and memory storing of data, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future.
This is why we have to add another predefined collation id named INDETERMINATE_COLLATION_ID which means that the result is a combination of conflicting non-default implicit collations. Right now it would an id of -1 so it fail if it ever goes to the CollatorFactory.

### Why are the changes needed?
We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use.

### Does this PR introduce _any_ user-facing change?
Yes. We add 3 new errors and enable collation casting.

### How was this patch tested?
Tests in `CollationSuite` were done to check code validity.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#45383 from mihailom-db/SPARK-47210.

Authored-by: Mihailo Milosevic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

Currently if there's any gRPC exception, instead of just handling it, the PySpark's gRPC error handler is going to print it out to the stderr, not allowing the user to cleanly ignore the exception by using try/except control flow statement.

In this PR we are removing the logger.exception call and we rely on the downstream exception mechanism to report this to the user.

### Why are the changes needed?

Without this change, there's no way that the user ignores the gRPC error without piping the stderr to /dev/null or equivalent.

### Does this PR introduce _any_ user-facing change?

Yes, the stderr will not have the exception trace written twice.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45840 from nemanja-boric-databricks/no-log.

Authored-by: Nemanja Boric <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…park.speculation.quantile` to 0.9

### What changes were proposed in this pull request?

This PR aims to adjust the following in order to make Spark speculative execution behavior less aggressive from Apache Spark 4.0.0.
- `spark.speculation.multiplier`: 1.5 -> 3
- `spark.speculation.quantile`: 0.75 -> 0.9

### Why are the changes needed?

Although `spark.speculation` is disabled by default, this has been used in many production use cases.

### Does this PR introduce _any_ user-facing change?

This will make a speculative execution less agressive.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45858 from dongjoon-hyun/SPARK-47720.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…e package

### What changes were proposed in this pull request?

This PR proposes to release a separate `pyspark-connect` package, see also [SPIP: Pure Python Package in PyPI (Spark Connect)](https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing).

Today's PySpark package is roughly as follows:

```
pyspark
├── *.py               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming
```

There will be two packages available, `pyspark` and `pyspark-connect`.

#### `pyspark`

Same as today’s PySpark. But Core module is factored out to `pyspark.core.*`. User-facing interface stays the same at `pyspark.*`.

```
pyspark
├── core               # *Core / No Spark Connect support*
├── mllib              # MLlib / No Spark Connect support
├── resource           # Resource profile API / No Spark Connect support
├── streaming          # DStream (deprecated) / No Spark Connect support
├── ml                 # ML
│   └── connect            # Spark Connect for ML
├── pandas             # API on Spark with/without Spark Connect support
└── sql                # SQL
    └── connect            # Spark Connect for SQL
        └── streaming      # Spark Connect for Structured Streaming
```

#### `pyspark-connect`

Package after excluding modules that do not support Spark Connect, also excluding jars, that are, ml without jars:

```
pyspark
├── ml
│   └── connect
├── pandas
└── sql
    └── connect
        └── streaming
```

### Why are the changes needed?

To provide a pure Python library that does not depend on JVM.

See also [SPIP: Pure Python Package in PyPI (Spark Connect)](https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

Yes, users can install pure Python library via `pip install pyspark-connect`.

### How was this patch tested?

Manually tested the basic set of tests.

```bash
./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar`
```

```bash
cd python
python packaging/connect/setup.py sdist
cd dist
conda create -y -n clean-py-3.11 python=3.11
conda activate clean-py-3.11
pip install pyspark-connect-4.0.0.dev0.tar.gz
python
```

```python
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
>>> spark.range(10).show()
```
```
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+
```

They will be separated added, and set as a scheduled job in CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45053 from HyukjinKwon/refactoring-core.

Lead-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… test to ensure it

### What changes were proposed in this pull request?
The pr aims to fix bug about apache#45857

### Why are the changes needed?
In fact, `LogKey.values.toSeq.sorted` did not sort alphabetically as expected.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#45864 from panbingkun/fix_sort_logkey.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
…remote pure Python library

### What changes were proposed in this pull request?

This PR proposes to add an environment variable called `SPARK_CONNECT_TESTING_REMOTE` to set `remote` URL.

### Why are the changes needed?

In order to test pure Python library with a remote server.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45868 from HyukjinKwon/SPARK-47724.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…f pandas is not available

### What changes were proposed in this pull request?

This is a follow-up of the followings to skip `pandas`-related tests if pandas is not available.
- apache#44852
- apache#45232

### Why are the changes needed?

`pandas` is an optional dependency. We had better skip it without causing failures.

To recover the PyPy 3.8 CI,
- https://github.com/apache/spark/actions/runs/8541011879/job/23421483071

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually test.
```
$ python/run-tests --modules=pyspark-resource --parallelism=1 --python-executables=python3.10
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.10']
Will test the following Python modules: ['pyspark-resource']
python3.10 python_implementation is CPython
python3.10 version is: Python 3.10.13
Starting test(python3.10): pyspark.resource.profile (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/021bc7bb-242f-4cb4-8584-11ed6e711f78/python3.10__pyspark.resource.profile__jn89f1hh.log)
Finished test(python3.10): pyspark.resource.profile (1s)
Starting test(python3.10): pyspark.resource.tests.test_connect_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/244d6c6f-8799-4a2a-b7a7-20d7c50d643d/python3.10__pyspark.resource.tests.test_connect_resources__5ta1tf6e.log)
Finished test(python3.10): pyspark.resource.tests.test_connect_resources (0s) ... 1 tests were skipped
Starting test(python3.10): pyspark.resource.tests.test_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/671e7afa-e764-443f-bc40-7e940d7342ea/python3.10__pyspark.resource.tests.test_resources__lhbp6y5f.log)
Finished test(python3.10): pyspark.resource.tests.test_resources (2s) ... 1 tests were skipped
Tests passed in 4 seconds

Skipped tests in pyspark.resource.tests.test_connect_resources with python3.10:
      test_profile_before_sc_for_connect (pyspark.resource.tests.test_connect_resources.ResourceProfileTests) ... skip (0.005s)

Skipped tests in pyspark.resource.tests.test_resources with python3.10:
      test_profile_before_sc_for_sql (pyspark.resource.tests.test_resources.ResourceProfileTests) ... skip (0.001s)
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45869 from dongjoon-hyun/SPARK-46812.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…red logging framework

### What changes were proposed in this pull request?

Migrate logError with variables of core module to structured logging framework. This is part1 which transforms the logError entries of the following API
```
def logError(msg: => String): Unit
```
to
```
def logError(entry: LogEntry): Unit
```

### Why are the changes needed?

To enhance Apache Spark's logging system by implementing structured logging.
### Does this PR introduce _any_ user-facing change?

Yes, Spark core logs will contain additional MDC

### How was this patch tested?

Compiler and scala style checks, as well as code review.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45834 from gengliangwang/coreError.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request?
This patch adss a new mechanism to push query execution progress for batch queries. We add a new response message type and periodically push query progress to the client. The client can consume this data to for example display a progress bar.

This patch adds support for displaying a progress bar in the PySpark shell when started with Spark Connect.

The proto message is defined as follows:

```
// This message is used to communicate progress about the query progress during the execution.
  // This message is used to communicate progress about the query progress during the execution.
  message ExecutionProgress {
    // Captures the progress of each individual stage.
    repeated StageInfo stages = 1;

    // Captures the currently in progress tasks.
    int64 num_inflight_tasks = 2;

    message StageInfo {
      int64 stage_id = 1;
      int64 num_tasks = 2;
      int64 num_completed_tasks = 3;
      int64 input_bytes_read = 4;
      bool done = 5;
    }
  }
```

Clients can simply ignore the messages or consume them. On top of that this adds additional capabilities to register a callback for progress tracking to the SparkSession.

```
handler = lambda **kwargs: print(kwargs)
spark.register_progress_handler(handler)
spark.range(100).collect()
spark.remove_progress_handler(handler)
```

#### Example 1
![progress_medium_query_multi_stage mp4](https://github.com/apache/spark/assets/3421/5eff1ec4-def2-4d39-8a75-13a6af784c99)

#### Example 2
![progress_bar mp4](https://github.com/apache/spark/assets/3421/20638511-2da4-4bd6-83f2-da3b9f500bde)

### Why are the changes needed?
Usability and Experience

### Does this PR introduce _any_ user-facing change?
When the user opens the PySpark shell with Spark Connect mode, it will use the progress bar by default.

### How was this patch tested?
Added new tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45150 from grundprinzip/SPARK-47081.

Authored-by: Martin Grund <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

PySpark worker processes may die while they are idling. Here we aim to provide some resilience, by validating process and selectionkey aliveness prior to returning the process from idle pool.

### Why are the changes needed?

To not fail queries when a python process crashed while idling.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added appropriate testcase.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45635 from sebastianhillig-db/python-worker-factory-crash.

Authored-by: Ubuntu <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… PySpark

This reverts commit d87ac8e.

Turns out the sparkconnect logger is disabled by default. This make us loose the ability to log the error message if required (by explicitly turning on the logger).

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#45878 from nemanja-boric-databricks/revert-logger.

Authored-by: Nemanja Boric <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ILED error

### What changes were proposed in this pull request?

This is a follow-up of apache#45797 . Instead of detecting query execution errors and not wrapping them, it's better to do the error wrapping only in the data writer, which has more context. We can provide the specific file path when the error happened, instead of the destination directory name.

### Why are the changes needed?

better error message

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45844 from cloud-fan/write-error.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…connect testcases

### What changes were proposed in this pull request?

This PR proposes to get the proper default port for `pyspark-connect` testcases.

### Why are the changes needed?

`pyspark-connect` cannot access to the JVM, so cannot get the randomized port assigned from JVM.

### Does this PR introduce _any_ user-facing change?

No, `pyspark-connect` is not published yet, and this is a test-only change.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45875 from HyukjinKwon/SPARK-47729.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
This is to document G1 Concurrent GC metrics introduced with https://issues.apache.org/jira/browse/SPARK-44162

### Why are the changes needed?
To improve the documentation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45874 from LucaCanali/documentG1GCCurrentMetrics.

Authored-by: Luca Canali <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…sible=true for spark-connect-scala-client

### What changes were proposed in this pull request?

Add `-Dio.netty.tryReflectionSetAccessible=true` for `spark-connect-scala-client`

### Why are the changes needed?

The previous change missed spark-connect-scala-client, may be due to bad IDEA index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#45860 from pan3793/SPARK-47610-followup.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…red logging framework

### What changes were proposed in this pull request?
The pr aims to migrate `logError` in module `MLLib` with variables to `structured logging framework`.

### Why are the changes needed?
To enhance Apache Spark's logging system by implementing structured logging.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#45837 from panbingkun/SPARK-47598.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
…pyarrow to skip tests

### What changes were proposed in this pull request?

This is a follow-up of SPARK-46812 to skip the tests more robustly and to recover PyPy CIs.
- https://github.com/apache/spark/actions/runs/8556900899/job/23447948557

### Why are the changes needed?

- `should_test_connect` covers more edge cases than `have_pandas`.

- `test_resources.py` has Arrow usage too.
https://github.com/apache/spark/blob/25fc67fa114d2c34099c3ab50396870f543c338b/python/pyspark/resource/tests/test_resources.py#L85

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tests with `pandas` and without `pyarrow`.

```
$ pip3 freeze | grep pyarrow
$ pip3 freeze | grep pandas
pandas==2.2.1
pandas-stubs==1.2.0.53

$ python/run-tests --modules=pyspark-resource --parallelism=1 --python-executables=python3.10
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.10']
Will test the following Python modules: ['pyspark-resource']
python3.10 python_implementation is CPython
python3.10 version is: Python 3.10.13
Starting test(python3.10): pyspark.resource.profile (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/db9cb886-2698-49d9-a663-9b8bea79caba/python3.10__pyspark.resource.profile__8mg46xru.log)
Finished test(python3.10): pyspark.resource.profile (1s)
Starting test(python3.10): pyspark.resource.tests.test_connect_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/53f979bd-1073-41e6-99ba-8e787edc415b/python3.10__pyspark.resource.tests.test_connect_resources__hrgrs5sk.log)
Finished test(python3.10): pyspark.resource.tests.test_connect_resources (0s) ... 1 tests were skipped
Starting test(python3.10): pyspark.resource.tests.test_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/2b06c671-0199-4827-a0e5-f852a28313fd/python3.10__pyspark.resource.tests.test_resources__jis6mk9a.log)
Finished test(python3.10): pyspark.resource.tests.test_resources (2s) ... 1 tests were skipped
Tests passed in 4 seconds

Skipped tests in pyspark.resource.tests.test_connect_resources with python3.10:
      test_profile_before_sc_for_connect (pyspark.resource.tests.test_connect_resources.ResourceProfileTests) ... skip (0.002s)

Skipped tests in pyspark.resource.tests.test_resources with python3.10:
      test_profile_before_sc_for_sql (pyspark.resource.tests.test_resources.ResourceProfileTests) ... skip (0.001s)
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45880 from dongjoon-hyun/SPARK-46812-2.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…lly enumeration field in `LogEntry` automatically

### What changes were proposed in this pull request?
The pr aims to `introduce` a `tool` that can `sort alphabetically` enumeration field in `LogEntry` automatically.

### Why are the changes needed?
Enable developers to more conveniently write the enumeration values in `LogEntry` in alphabetical order according to the requirements of structured log development documents.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Manually test.
  ```
  SPARK_GENERATE_GOLDEN_FILES=1
  build/sbt "common-utils/testOnly *LogKeySuite -- -t \"LogKey enumeration fields are correctly sorted\""
  ```
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#45867 from panbingkun/SPARK-47723.

Lead-authored-by: panbingkun <[email protected]>
Co-authored-by: panbingkun <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
…ckage

### What changes were proposed in this pull request?

This PR is a followup of apache#45150 that adds the new `shell` module into PyPI package.

### Why are the changes needed?

So PyPI package contains `shell` module.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released yet.

### How was this patch tested?

The test case will be added at apache#45870. It was found out during working on that PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45882 from HyukjinKwon/SPARK-47081-followup.

Lead-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@ericm-db ericm-db closed this Apr 10, 2024
ericm-db pushed a commit that referenced this pull request Apr 16, 2024
### What changes were proposed in this pull request?

In the `Window` node, both `partitionSpec` and `orderSpec` must be orderable, but the current type check only verifies `orderSpec` is orderable. This can cause an error in later optimizing phases.

Given a query:

```
with t as (select id, map(id, id) as m from range(0, 10))
select rank() over (partition by m order by id) from t
```

Before the PR, it fails with an `INTERNAL_ERROR`:

```
org.apache.spark.SparkException: [INTERNAL_ERROR] grouping/join/window partition keys cannot be map type. SQLSTATE: XX000
at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.needNormalize(NormalizeFloatingNumbers.scala:103)
at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.org$apache$spark$sql$catalyst$optimizer$NormalizeFloatingNumbers$$needNormalize(NormalizeFloatingNumbers.scala:94)
...
```

After the PR, it fails with a `EXPRESSION_TYPE_IS_NOT_ORDERABLE`, which is expected:

```
  org.apache.spark.sql.catalyst.ExtendedAnalysisException: [EXPRESSION_TYPE_IS_NOT_ORDERABLE] Column expression "m" cannot be sorted because its type "MAP<BIGINT, BIGINT>" is not orderable. SQLSTATE: 42822; line 2 pos 53;
Project [RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4]
+- Project [id#1L, m#0, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4]
   +- Window [rank(id#1L) windowspecdefinition(m#0, id#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4], [m#0], [id#1L ASC NULLS FIRST]
      +- Project [id#1L, m#0]
         +- SubqueryAlias t
            +- SubqueryAlias t
               +- Project [id#1L, map(id#1L, id#1L) AS m#0]
                  +- Range (0, 10, step=1, splits=None)
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
...
```

### How was this patch tested?

Unit test.

Closes apache#45730 from chenhao-db/SPARK-47572.

Authored-by: Chenhao Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
ericm-db pushed a commit that referenced this pull request Jun 26, 2024
… throw internal error

### What changes were proposed in this pull request?

This PR fixes the error messages and classes when Python UDFs are used in higher order functions.

### Why are the changes needed?

To show the proper user-facing exceptions with error classes.

### Does this PR introduce _any_ user-facing change?

Yes, previously it threw internal error such as:

```python
from pyspark.sql.functions import transform, udf, col, array
spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: y)(x))).collect()
```

Before:

```
py4j.protocol.Py4JJavaError: An error occurred while calling o74.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 0.0 failed 1 times, most recent failure: Lost task 15.0 in stage 0.0 (TID 15) (ip-192-168-123-103.ap-northeast-2.compute.internal executor driver): org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: <lambda>(lambda x_0#3L)#2 SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
```

After:

```
pyspark.errors.exceptions.captured.AnalysisException: [INVALID_LAMBDA_FUNCTION_CALL.UNEVALUABLE] Invalid lambda function call. Python UDFs should be used in a lambda function at a higher order function. However, "<lambda>(lambda x_0#3L)" was a Python UDF. SQLSTATE: 42K0D;
Project [transform(array(id#0L), lambdafunction(<lambda>(lambda x_0#3L)#2, lambda x_0#3L, false)) AS transform(array(id), lambdafunction(<lambda>(lambda x_0#3L), namedlambdavariable()))#4]
+- Range (0, 1, step=1, splits=Some(16))
```

### How was this patch tested?

Unittest was added

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47079 from HyukjinKwon/SPARK-48706.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.