[SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python #46045

CTCC1 · 2024-04-15T01:04:33Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/projects/SPARK/issues/SPARK-47845

Support column type in split function for scala and python. SQL already supports this, but scala / python functions currently don't.

A hypothetical example to illustrate:

import org.apache.spark.sql.functions.{col, split}

val example = spark.createDataFrame(
    Seq(
      ("Doe, John", ", ", 2),
      ("Smith,Jane", ",", 2),
      ("Johnson", ",", 1)
    )
  )
  .toDF("name", "delim", "expected_parts_count")

example.createOrReplaceTempView("test_data")

// works for SQL
spark.sql("SELECT split(name, delim, expected_parts_count) AS name_parts FROM test_data").show()

// currently doesn't compile for scala, but will be supported after this PR
example.withColumn("name_parts", split(col("name"), col("delim"), col("expected_parts_count"))).show()

Why are the changes needed?

I have a use case to split a String typed column with different delimiters defined in other columns of the Dataframe.

But also following this guidance:

spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Lines 59 to 61 in 85c4f05

    
            * This function APIs usually have methods with `Column` signature only because it can support not 
        
            * only `Column` but also other types such as a native string. The other variants currently exist 
        
            * for historical reasons.

We might want to do this to a bunch of other functions as well. But I will start with this one and maybe do more if I get time / people generally think it's useful to do so.

Does this PR introduce any user-facing change?

Yes, new feature (via function overloading in scala, and more type handling in python).

How was this patch tested?

Added unit tests with column input type to scala functions.
Added doctest for pyspark and pyspark connect.
Added explain and plan proto generated for jvm connect test.

Was this patch authored or co-authored using generative AI tooling?

No.

Others

The contribution is my original work and I license the work to the project under the project’s open source license.

CTCC1 · 2024-04-15T03:04:36Z

python/pyspark/sql/connect/functions/builtin.py

We might want to consider renaming variable that shadows python built-in such as str. It's annoying here that it breaks the usage of isinstance(pattern, str).
Given renaming variable would be a backwards incompatible change (for user code that uses kwargs) and needs further discussion, I will work around this by aliasing the python builtin (See the workaround comment in code).

CTCC1 · 2024-04-16T01:09:38Z

Actually my first PR here :)
@zhengruifeng based on git blame you did something very similar before. Do you want to take a look? Thanks in advance!

zhengruifeng · 2024-04-16T02:20:09Z

python/pyspark/sql/functions/builtin.py

I think lit function accept both Column and int

limit can also be of type str referring to a column, hence the check to avoid making the column name a string literal.

yea, we can use it for pattern

zhengruifeng · 2024-04-16T02:23:12Z

python/pyspark/sql/connect/functions/builtin.py

Suggested change

# work around shadowing of str in the input variable name

from builtins import str as py_str

if isinstance(pattern, py_str):

_pattern = lit(pattern)

elif isinstance(pattern, Column):

_pattern = pattern

else:

raise PySparkTypeError(

error_class="NOT_COLUMN_OR_STR",

message_parameters={"arg_name": "pattern", "arg_type": type(pattern).__name__},

)

limit = lit(limit) if isinstance(limit, int) else _to_col(limit)

return _invoke_function("split", _to_col(str), _pattern, limit)

limit = lit(limit) if isinstance(limit, int) else _to_col(limit)

return _invoke_function("split", _to_col(str), lit(pattern), limit)

Thanks for the suggestion, this is simpler for sure! The only concern is that we will not raise PySparkTypeError if pattern is passed in for a type other than Column or str, and it will form a UnresolvedFunction. Is raising such error early a requirement for connect?

For now I simplified the code but kept the error raise. Lmk if we want to skip raising the error.

Only a few functions have such check, and most functions don't check the types.
We might need to figure out an easy way for type checking.

As to this function, let's keep it simpler for now.

Sure. I removed the type check now. Maybe in the future we can standardize this, e.g. with a decorator that inspect the function signature and do type check accordingly.

zhengruifeng · 2024-04-16T02:24:12Z

python/pyspark/sql/functions/builtin.py

please add more doctest in the Examples section to test the new supported types

those doctests will automatically be reused in Spark Connect Python Client.

zhengruifeng · 2024-04-16T02:27:25Z

also cc @HyukjinKwon and @LuciferYang

LuciferYang · 2024-04-16T06:06:39Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala

Let's add some new test cases for the connect module in PlanGenerationTestSuite.

@CTCC1 also need to run connect/ testOnly org. apache. spark. sql. connect. ProtoToParsedPlanTestSuite to generate the golden files needed for reverse validation testing.

thanks for the pointer, fixed

HyukjinKwon · 2024-04-17T00:00:25Z

I am fine with this change

…tion for scala and python fix python style fix style Fix python typing Fix typing

…cation

zhengruifeng · 2024-04-22T01:43:03Z

thank you @CTCC1

merged to master

…rs for several functions in pyspark/scala ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-48555 For pyspark, added the ability to use the "Column" type or names of column for the parameters of the following functions: - array_remove - array_position - map_contains_key - substring For scala, added the ability to use "Column" type as the parameters of the `substring` function This functionality already exists in the SQL syntax: ``` select array_remove(col1, col2) from values (array(1,2,3), 2) ``` however, it isn't possible to do the same in python ```python df.select(F.array_remove(F.col("col1"), F.col("col2")) ``` Note that in scala the functions other than `substring` already accepted Column params (or rather, they accept `Any` and pass whatever the param is to `lit` so it ends up working), so I only needed to change substring in the scala side. ### Why are the changes needed? To align the scala/python API with the SQL one. ### Does this PR introduce _any_ user-facing change? Yes, added new overloaded functions in scala and changed type hints/docstrings in python. ### How was this patch tested? Added doctests for the python changes, and tests in the scala test suites, then tested both manually and using the CI. ### Was this patch authored or co-authored using generative AI tooling? No. ### Notes: - I opened the related JIRA ticket, but looks like I don't have the option to assign it myself, so if it is required and any reviewer does have permissions for it, I'd appreciate it - The "Build" workflow passed successfully, but the "report tes results" one didn't due to some authorization issue, I see that this is the same for many other open PRs right now so I assume its ok. - For the python changes, I tried to follow the convention used by other functions (such as `array_contains` or `when`), of using `value._jc if isinstance(value, Column) else value` - Im not really familiar with the `connect` functions, but seems like on the python side they already supported the use of columns so no extra changes were needed there - For the scala side, this is the first time I'm touching scala, I think I covered it all as I tried to match similar changes done in [a similar PR](#46045) - The same issue also exists for `substring_index` however I wasn't able to fix this one the same way I did for `substring`. Calling it with a `lit` for the `count` arg worked, but using a `col` error with a `NumberFormatError` for "project_value_3". I assume the error is related to trying to parse the Int [here](https://github.com/apache/spark/blob/7cba1ab4d6acef4e9d73a8e6018b0902aac3a18d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L1449) . In that case I got lost in the scala code and decided to drop it, but if anyone knows how to fix this error I could change that function as well. The contribution is my original work and I license the work to the project under the project’s open source license. Closes #46901 from Ronserruya/support_columns_in_pyspark_functions. Authored-by: Ron Serruya <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added SQL PYTHON CONNECT labels Apr 15, 2024

CTCC1 changed the title ~~[SPARK-47845][SQL][PYTHON][CONNECT] Support column type in split function for scala and python~~ [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python Apr 15, 2024

CTCC1 marked this pull request as draft April 15, 2024 01:46

CTCC1 commented Apr 15, 2024

View reviewed changes

CTCC1 force-pushed the SPARK-47845-support-column-type-in-split-function branch from c7aa7af to dd68c9f Compare April 15, 2024 08:26

CTCC1 marked this pull request as ready for review April 15, 2024 08:36

CTCC1 force-pushed the SPARK-47845-support-column-type-in-split-function branch from dd68c9f to a48100d Compare April 15, 2024 17:17

zhengruifeng reviewed Apr 16, 2024

View reviewed changes

LuciferYang reviewed Apr 16, 2024

View reviewed changes

LuciferYang approved these changes Apr 17, 2024

View reviewed changes

CTCC1 added 5 commits April 19, 2024 21:19

[SPARK-47845][SQL][PYTHON][CONNECT] Support column type in split func…

32ab7eb

…tion for scala and python fix python style fix style Fix python typing Fix typing

Address Pyspark PR comments - Add python doctest and various simplifi…

29347d8

…cation

Add Plan generation test for jvm spark connect client

9ca7e0f

Add golden files for ProtoToParsedPlanTestSuite

c3c7474

Skip type checking in py connect for simplicity

46355bb

CTCC1 force-pushed the SPARK-47845-support-column-type-in-split-function branch from 3d70901 to 46355bb Compare April 20, 2024 04:20

zhengruifeng closed this in 9f34b8e Apr 22, 2024

Ronserruya mentioned this pull request Jun 6, 2024

[SPARK-48555][SQL][PYTHON][CONNECT] Support using Columns as parameters for several functions in pyspark/scala #46901

Closed

	* This function APIs usually have methods with `Column` signature only because it can support not
	* only `Column` but also other types such as a native string. The other variants currently exist
	* for historical reasons.

-    # work around shadowing of str in the input variable name
-    from builtins import str as py_str
-    if isinstance(pattern, py_str):
-        _pattern = lit(pattern)
-    elif isinstance(pattern, Column):
-        _pattern = pattern
-    else:
-        raise PySparkTypeError(
-            error_class="NOT_COLUMN_OR_STR",
-            message_parameters={"arg_name": "pattern", "arg_type": type(pattern).__name__},
-        )
-    limit = lit(limit) if isinstance(limit, int) else _to_col(limit)
-    return _invoke_function("split", _to_col(str), _pattern, limit)
+    limit = lit(limit) if isinstance(limit, int) else _to_col(limit)
+    return _invoke_function("split", _to_col(str), lit(pattern), limit)

[SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python #46045

[SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python #46045

Uh oh!

Conversation

CTCC1 commented Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Others

Uh oh!

CTCC1 Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTCC1 commented Apr 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTCC1 Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Apr 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 17, 2024

Uh oh!

zhengruifeng commented Apr 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CTCC1 commented Apr 15, 2024 •

edited

Loading

CTCC1 Apr 15, 2024 •

edited

Loading

CTCC1 Apr 16, 2024 •

edited

Loading