Skip to content

Conversation

@Ronserruya
Copy link
Contributor

@Ronserruya Ronserruya commented Jun 6, 2024

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-48555
For pyspark, added the ability to use the "Column" type or names of column for the parameters of the following functions:

  • array_remove
  • array_position
  • map_contains_key
  • substring

For scala, added the ability to use "Column" type as the parameters of the substring function

This functionality already exists in the SQL syntax:

select array_remove(col1, col2) from values (array(1,2,3), 2)

however, it isn't possible to do the same in python

df.select(F.array_remove(F.col("col1"), F.col("col2"))

Note that in scala the functions other than substring already accepted Column params (or rather, they accept Any and pass whatever the param is to lit so it ends up working), so I only needed to change substring in the scala side.

Why are the changes needed?

To align the scala/python API with the SQL one.

Does this PR introduce any user-facing change?

Yes, added new overloaded functions in scala and changed type hints/docstrings in python.

How was this patch tested?

Added doctests for the python changes, and tests in the scala test suites, then tested both manually and using the CI.

Was this patch authored or co-authored using generative AI tooling?

No.

Notes:

  • I opened the related JIRA ticket, but looks like I don't have the option to assign it myself, so if it is required and any reviewer does have permissions for it, I'd appreciate it
  • The "Build" workflow passed successfully, but the "report tes results" one didn't due to some authorization issue, I see that this is the same for many other open PRs right now so I assume its ok.
  • For the python changes, I tried to follow the convention used by other functions (such as array_contains or when), of using value._jc if isinstance(value, Column) else value
  • Im not really familiar with the connect functions, but seems like on the python side they already supported the use of columns so no extra changes were needed there
  • For the scala side, this is the first time I'm touching scala, I think I covered it all as I tried to match similar changes done in a similar PR
  • The same issue also exists for substring_index however I wasn't able to fix this one the same way I did for substring. Calling it with a lit for the count arg worked, but using a col error with a NumberFormatError for "project_value_3". I assume the error is related to trying to parse the Int here . In that case I got lost in the scala code and decided to drop it, but if anyone knows how to fix this error I could change that function as well.

The contribution is my original work and I license the work to the project under the project’s open source license.

@Ronserruya Ronserruya changed the title [SPARK-48555] Support using Columns as parameters for several functions in pyspark/scala [SPARK-48555][SQL][PYTHON][CONNECT] Support using Columns as parameters for several functions in pyspark/scala Jun 6, 2024
@Ronserruya
Copy link
Contributor Author

tagging @CTCC1 @LuciferYang @zhengruifeng As you wrote/reviewed the similar PR I mentioned 😄

"""
from pyspark.sql.classic.column import _to_java_column

value = value._jc if isinstance(value, Column) else value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Example 3: Check for key using a column was already supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it?
In spark 3.5.1 this gives me an error

df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k")
df.select(F.map_contains_key(df.m, df.k))

# pyspark.errors.exceptions.base.PySparkTypeError: [NOT_ITERABLE] Column is not iterable.

which makes sense since you try to pass a Column type to _invoke_function which expects only native types or JavaObject for the args

Copy link
Contributor

@zhengruifeng zhengruifeng Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it was not supported in Classic mode, but supported in Connect mode.

Classic:

In [2]: df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k")
   ...: df.select(F.map_contains_key(df.m, df.k))
---------------------------------------------------------------------------
PySparkTypeError                          Traceback (most recent call last)
Cell In[2], line 2
      1 df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k")
----> 2 df.select(F.map_contains_key(df.m, df.k))

...

File ~/Dev/spark/python/pyspark/sql/classic/column.py:415, in Column.__iter__(self)
    414 def __iter__(self) -> None:
--> 415     raise PySparkTypeError(
    416         error_class="NOT_ITERABLE", message_parameters={"objectName": "Column"}
    417     )

PySparkTypeError: [NOT_ITERABLE] Column is not iterable.

Connect:

In [1]: from pyspark.sql import functions as F

In [2]: df = spark.sql("select map(1, 2, 3, 4) as m, 1 as k")
   ...: df.select(F.map_contains_key(df.m, df.k))
Out[2]: DataFrame[map_contains_key(m, k): boolean]

There is a slight difference in the handling of value: Any typed value: Spark Connect always convert value: Any to Column/Expression (because of the requirement of the UnresolvedFunction proto), while some functions (e.g. map_contains_key) in Classic don't do this.

We will need to revisit all the Any typed parameters in functions. cc @HyukjinKwon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting behavior difference. What's the reason for not converting the classic PySpark value to a column/expression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was not by design, seems just due to the type mismatch in the internal helper functions

@Ronserruya Ronserruya requested a review from zhengruifeng June 12, 2024 23:51
.. versionchanged:: 3.4.0
Supports Spark Connect.

.. versionchanged:: 4.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move it under parameters section, you may refer to

scale : :class:`~pyspark.sql.Column` or int, optional
An optional parameter to control the rounding behavior.
.. versionchanged:: 4.0.0
Support Column type.

.. versionchanged:: 3.4.0
Supports Spark Connect.

.. versionchanged:: 4.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

.. versionchanged:: 3.4.0
Supports Spark Connect.

.. versionchanged:: 4.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

.. versionchanged:: 3.4.0
Supports Spark Connect.

.. versionchanged:: 4.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zhengruifeng
Copy link
Contributor

LGTM, only need a few minor doc changes. Thanks for working on this.

@Ronserruya
Copy link
Contributor Author

@zhengruifeng Fixed the docstrings, thanks for reviewing :)
I don't see the merge bottom, so I assume I don't have the permission for it

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants