[SPARK-48555][SQL][PYTHON][CONNECT] Support using Columns as parameters for several functions in pyspark/scala #46901

Ronserruya · 2024-06-06T17:05:08Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-48555
For pyspark, added the ability to use the "Column" type or names of column for the parameters of the following functions:

array_remove
array_position
map_contains_key
substring

For scala, added the ability to use "Column" type as the parameters of the substring function

This functionality already exists in the SQL syntax:

select array_remove(col1, col2) from values (array(1,2,3), 2)

however, it isn't possible to do the same in python

df.select(F.array_remove(F.col("col1"), F.col("col2"))

Note that in scala the functions other than substring already accepted Column params (or rather, they accept Any and pass whatever the param is to lit so it ends up working), so I only needed to change substring in the scala side.

Why are the changes needed?

To align the scala/python API with the SQL one.

Does this PR introduce any user-facing change?

Yes, added new overloaded functions in scala and changed type hints/docstrings in python.

How was this patch tested?

Added doctests for the python changes, and tests in the scala test suites, then tested both manually and using the CI.

Was this patch authored or co-authored using generative AI tooling?

No.

Notes:

I opened the related JIRA ticket, but looks like I don't have the option to assign it myself, so if it is required and any reviewer does have permissions for it, I'd appreciate it
The "Build" workflow passed successfully, but the "report tes results" one didn't due to some authorization issue, I see that this is the same for many other open PRs right now so I assume its ok.
For the python changes, I tried to follow the convention used by other functions (such as array_contains or when), of using value._jc if isinstance(value, Column) else value
Im not really familiar with the connect functions, but seems like on the python side they already supported the use of columns so no extra changes were needed there
For the scala side, this is the first time I'm touching scala, I think I covered it all as I tried to match similar changes done in a similar PR
The same issue also exists for substring_index however I wasn't able to fix this one the same way I did for substring. Calling it with a lit for the count arg worked, but using a col error with a NumberFormatError for "project_value_3". I assume the error is related to trying to parse the Int here . In that case I got lost in the scala code and decided to drop it, but if anyone knows how to fix this error I could change that function as well.

The contribution is my original work and I license the work to the project under the project’s open source license.

Ronserruya · 2024-06-06T17:07:50Z

tagging @CTCC1 @LuciferYang @zhengruifeng As you wrote/reviewed the similar PR I mentioned 😄

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala

python/pyspark/sql/functions/builtin.py

zhengruifeng · 2024-06-12T00:21:57Z

python/pyspark/sql/functions/builtin.py

    """
    from pyspark.sql.classic.column import _to_java_column

+    value = value._jc if isinstance(value, Column) else value


The Example 3: Check for key using a column was already supported

Is it?
In spark 3.5.1 this gives me an error

df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k") df.select(F.map_contains_key(df.m, df.k)) # pyspark.errors.exceptions.base.PySparkTypeError: [NOT_ITERABLE] Column is not iterable.

which makes sense since you try to pass a Column type to _invoke_function which expects only native types or JavaObject for the args

I see, it was not supported in Classic mode, but supported in Connect mode.

Classic:

In [2]: df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k") ...: df.select(F.map_contains_key(df.m, df.k)) --------------------------------------------------------------------------- PySparkTypeError Traceback (most recent call last) Cell In[2], line 2 1 df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k") ----> 2 df.select(F.map_contains_key(df.m, df.k)) ... File ~/Dev/spark/python/pyspark/sql/classic/column.py:415, in Column.__iter__(self) 414 def __iter__(self) -> None: --> 415 raise PySparkTypeError( 416 error_class="NOT_ITERABLE", message_parameters={"objectName": "Column"} 417 ) PySparkTypeError: [NOT_ITERABLE] Column is not iterable.

Connect:

In [1]: from pyspark.sql import functions as F In [2]: df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k") ...: df.select(F.map_contains_key(df.m, df.k)) Out[2]: DataFrame[map_contains_key(m, k): boolean]

There is a slight difference in the handling of value: Any typed value: Spark Connect always convert value: Any to Column/Expression (because of the requirement of the UnresolvedFunction proto), while some functions (e.g. map_contains_key) in Classic don't do this.

We will need to revisit all the Any typed parameters in functions. cc @HyukjinKwon

This is an interesting behavior difference. What's the reason for not converting the classic PySpark value to a column/expression?

I guess it was not by design, seems just due to the type mismatch in the internal helper functions

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

zhengruifeng · 2024-06-14T08:43:14Z

python/pyspark/sql/functions/builtin.py

    .. versionchanged:: 3.4.0
        Supports Spark Connect.

+    .. versionchanged:: 4.0.0


let's move it under parameters section, you may refer to

spark/python/pyspark/sql/functions/builtin.py

Lines 6216 to 6220 in bb1f026

scale : :class:`~pyspark.sql.Column` or int, optional

An optional parameter to control the rounding behavior.

.. versionchanged:: 4.0.0

Support Column type.

zhengruifeng · 2024-06-14T08:43:22Z

python/pyspark/sql/functions/builtin.py

    .. versionchanged:: 3.4.0
        Supports Spark Connect.

+    .. versionchanged:: 4.0.0


zhengruifeng · 2024-06-14T08:43:33Z

python/pyspark/sql/functions/builtin.py

    .. versionchanged:: 3.4.0
        Supports Spark Connect.

+    .. versionchanged:: 4.0.0


zhengruifeng · 2024-06-14T08:44:05Z

python/pyspark/sql/functions/builtin.py

    .. versionchanged:: 3.4.0
        Supports Spark Connect.

+    .. versionchanged:: 4.0.0


zhengruifeng · 2024-06-14T08:55:12Z

LGTM, only need a few minor doc changes. Thanks for working on this.

Ronserruya · 2024-06-15T11:18:03Z

@zhengruifeng Fixed the docstrings, thanks for reviewing :)
I don't see the merge bottom, so I assume I don't have the permission for it

HyukjinKwon · 2024-06-17T00:30:57Z

Merged to master.

Ronserruya added 6 commits June 5, 2024 16:22

Support usage of columns as parameters to more pyspark functions

7d85f1c

allow usage of column params in substring functions

d690698

revert support of column params in substring functions

ce95445

Add support for columns as params in the substring function

3a3c2ac

reformat pyspark functions

08e0088

add substring test golden files

377e6ba

github-actions bot added SQL PYTHON CONNECT labels Jun 6, 2024

Ronserruya changed the title ~~[SPARK-48555] Support using Columns as parameters for several functions in pyspark/scala~~ [SPARK-48555][SQL][PYTHON][CONNECT] Support using Columns as parameters for several functions in pyspark/scala Jun 6, 2024

zhengruifeng reviewed Jun 12, 2024

View reviewed changes

Ronserruya added 3 commits June 12, 2024 20:01

Remove substrings override with columns and ints

a1de1f1

Remove substring overrides with columns and ints

f46983c

add missing golden file

f33ea88

Ronserruya requested a review from zhengruifeng June 12, 2024 23:51

zhengruifeng approved these changes Jun 14, 2024

View reviewed changes

Document parameter changes under parameters header in docstring

6d08684

HyukjinKwon approved these changes Jun 17, 2024

View reviewed changes

HyukjinKwon closed this in 8f662fc Jun 17, 2024

	scale : :class:`~pyspark.sql.Column` or int, optional
	An optional parameter to control the rounding behavior.

	.. versionchanged:: 4.0.0
	Support Column type.

[SPARK-48555][SQL][PYTHON][CONNECT] Support using Columns as parameters for several functions in pyspark/scala #46901

[SPARK-48555][SQL][PYTHON][CONNECT] Support using Columns as parameters for several functions in pyspark/scala #46901

Uh oh!

Conversation

Ronserruya commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Notes:

Uh oh!

Ronserruya commented Jun 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jun 14, 2024

Uh oh!

Ronserruya commented Jun 15, 2024

Uh oh!

HyukjinKwon commented Jun 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ronserruya commented Jun 6, 2024 •

edited

Loading

zhengruifeng Jun 14, 2024 •

edited

Loading