[SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions #42864

peter-toth · 2023-09-10T12:52:34Z

What changes were proposed in this pull request?

This is the first cleanup PR of the ticket to use UnresolvedFunction/FunctionRegistry based resolution in SQL Dataset functions similar to what Spark Connect does.

Why are the changes needed?

If we can make the SQL and Connect Dataset functions similar then we can move the functions to sql-api module.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

…aset functions

cloud-fan · 2023-09-18T02:55:03Z

python/pyspark/sql/column.py

-        |               abc| value|
-        +------------------+------+
+        +---------------+------+
+        |substr(l, 1, 3)|d[key]|


To avoid this change, I think we can call self.substring in this method. df.l[slice(1, 3)] does not require a certain function name, so keeping it as it was is better.

Ping @peter-toth

I tried self.substring (see d611fd5) but there is no self.substring so reverted it in 6bd03d1. If column name doesn't matter then let's use self.substr.

cloud-fan · 2023-09-18T02:56:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {

+  def this(valueAndList: Seq[Expression]) = {


does it mean we support select in(a, b, c) now?

Yes. But it was not a goal, it is just a useful side effect.

I'm not sure if this is useful. in(a, b, c) looks pretty weird to me. Can we revert it and treat def in as a special case? The SQL parser also treat IN as a special case and has dedicated syntax for it.

All right, reverted in 8bf64a7.

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

cloud-fan · 2023-09-18T03:18:25Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

-  def count(e: Column): Column = withAggregateFunction {
-    e.expr match {
+  def count(e: Column): Column = {
+    val withoutStar = e.expr match {
      // Turn count(*) into count(1)


Unrelated issue: this is not the right place to do this conversion. should be done in the analyzer. cc @zhengruifeng does spark connect hit the same issue?

No, spark connect doesn't hit such issue.

spark/python/pyspark/sql/connect/functions.py

Lines 1014 to 1015 in 89041a4

def count(col: "ColumnOrName") -> Column:

return _invoke_function_over_columns("count", col)

spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala

Line 391 in 89041a4

def count(e: Column): Column = Column.fn("count", e)

cloud-fan · 2023-09-18T03:24:53Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

-  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column = {
-    withExpr(SchemaOfJson(json.expr, options.asScala.toMap))
-  }
+  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column =


I think these should be the few exceptions that we should still keep them as they were, because the expression (SchemaOfJson in this case) takes non-expression inputs, and it's tricky to define a protocol that can pass non-expression inputs as expressions.

No sure I get your point. SchemaOfJson seems to accept options as expressions: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L776-L778. We just follow Connect here.

Oh I see, then I'm fine with it. It's inevitable that we need to create map expression from the language native (scala or python) map values.

cloud-fan · 2023-09-18T03:27:43Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

    case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
      UnresolvedAlias(a, Some(Column.generateAlias))
-    case u: UnresolvedFunction => UnresolvedAlias(expr, None)
+    case e if !e.resolved => UnresolvedAlias(expr, None)


why is this change?

Without this change some Python tests fail like sum_udf(col("v2")) + 5 here: https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py#L404C17-L404C40.
I debugged this today and it seems these are AggregateExpressions but the aggregate function is PythonUDAF so they don't match the previous case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => case. Shall we remove the if condition? Column.generateAlias seems to take care of it: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L44-L50

I changed the AggregateExpression path to case a: AggregateExpression => UnresolvedAlias(a, Some(Column.generateAlias)) but still kept the case _ if !expr.resolved => UnresolvedAlias(expr, None) because in the plus_one(sum_udf(col("v1"))) case an unresolved PythonUDF is passed here.

zhengruifeng · 2023-09-18T03:35:19Z

also cc @beliefer @panbingkun

beliefer · 2023-09-18T07:28:10Z

@zhengruifeng Thank you for you ping.
@peter-toth I don't get the idea. Why we need let dataset API follows Connect? It seems this PR made the plan to be parsed later.

peter-toth · 2023-09-18T08:13:32Z

@peter-toth I don't get the idea. Why we need let dataset API follows Connect? It seems this PR made the plan to be parsed later.

The ultimate goal is to move dataset and connect functions into the sql-api module and have a common implementation where possible.

…112-use-unresolvedfunction-in-dataset-functions # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/functions.scala

cloud-fan · 2023-09-18T15:19:13Z

To add more color to @peter-toth 's comment. Think about a base trait to define these function APIs

trait SQLFunctionsBase {
  // api doc ...
  def substring(...) = fn("substring", ...)
  
  protected def fn(...)
}

Then in the spark connect side, we override fn to create proto message for SQL function, in Spark SQL, we override fn to create UnresolvedFunction

This reverts commit d611fd5.

…set-functions

cloud-fan · 2023-09-20T07:37:03Z

sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala

      override def builder(e: Seq[Expression]): Expression = {
        assert(e.length == 1, "Defined UDF only has one column")
        val expr = e.head
-        assert(expr.resolved, "column should be resolved to use the same type " +


I'm a bit confused about this change. IIUC we always call builder with resolved expressions.

In https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonUDFSuite.scala#L41 the + is unresolved now.

cloud-fan · 2023-09-20T07:38:21Z

...re/src/test/scala/org/apache/spark/sql/streaming/StreamingSymmetricHashJoinHelperSuite.scala

+        And(
+          LessThan(lit(1).expr, lit(5).expr),
+          LessThan(lit(6).expr, lit(7).expr)),
+        EqualTo(lit(0).expr, lit(-1).expr))


shall we use dsl to build expressions?

fixed in 580f97b

cloud-fan · 2023-09-20T14:10:01Z

thanks, merging to master!

peter-toth · 2023-09-20T16:11:24Z

Thanks for the review!

### What changes were proposed in this pull request? This PR proposes bottum-up resolution in `ResolveFunctions`, which is much faster (requires less number of resolution rounds) if we have deeply nested `UnresolvedFunctions`. These structures are more likely to occur after #42864. ### Why are the changes needed? Performance optimization. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43146 from peter-toth/SPARK-45354-resolve-functions-bottom-up. Authored-by: Peter Toth <[email protected]> Signed-off-by: Peter Toth <[email protected]>

github-actions bot added the SQL label Sep 10, 2023

peter-toth force-pushed the SPARK-45112-use-unresolvedfunction-in-dataset-functions branch from 1ce6422 to b7fab4c Compare September 11, 2023 17:00

github-actions bot added STRUCTURED STREAMING PYTHON labels Sep 11, 2023

peter-toth force-pushed the SPARK-45112-use-unresolvedfunction-in-dataset-functions branch from b7fab4c to 03bb5bc Compare September 12, 2023 09:00

github-actions bot added the R label Sep 12, 2023

peter-toth force-pushed the SPARK-45112-use-unresolvedfunction-in-dataset-functions branch 6 times, most recently from 8e3e600 to fd88b32 Compare September 14, 2023 16:48

github-actions bot added the CONNECT label Sep 14, 2023

peter-toth force-pushed the SPARK-45112-use-unresolvedfunction-in-dataset-functions branch 2 times, most recently from 9a49db8 to e9fe889 Compare September 16, 2023 08:54

[SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dat…

3191591

…aset functions

peter-toth force-pushed the SPARK-45112-use-unresolvedfunction-in-dataset-functions branch from e9fe889 to 3191591 Compare September 16, 2023 14:58

cloud-fan reviewed Sep 18, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Column.scala Show resolved Hide resolved

cloud-fan reviewed Sep 18, 2023

View reviewed changes

peter-toth added 2 commits September 18, 2023 11:39

use substring in __getitem__ to avoid column name change

d611fd5

Merge commit '89041a4a8c7b7787fa10f090d4324f20447c4dd3' into SPARK-45…

43430dd

…112-use-unresolvedfunction-in-dataset-functions # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/functions.scala

fix RelationalGroupedDataset

b5ec266

Revert "use substring in __getitem__ to avoid column name change"

6bd03d1

This reverts commit d611fd5.

peter-toth force-pushed the SPARK-45112-use-unresolvedfunction-in-dataset-functions branch from dcae4e6 to 6bd03d1 Compare September 19, 2023 09:28

peter-toth added 4 commits September 19, 2023 15:50

Merge branch 'master' into SPARK-45112-use-unresolvedfunction-in-data…

c1c6fb5

…set-functions

fix alias

8e97c9e

fix from_xml

ba60472

revert In change

8bf64a7

cloud-fan reviewed Sep 20, 2023

View reviewed changes

use expression dsl

580f97b

cloud-fan changed the title ~~[WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions~~ [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions Sep 20, 2023

cloud-fan closed this in 288a92b Sep 20, 2023

This was referenced Sep 25, 2023

[SPARK-45022][SQL] Provide context for dataset API errors #42740

Closed

[SPARK-45354][SQL] Resolve functions bottom-up #43146

Closed

RunyaoChen mentioned this pull request Oct 7, 2023

[Spark] Update test suite in advance for changes in Spark delta-io/delta#2152

Closed

5 tasks

zhztheplayer mentioned this pull request Dec 16, 2025

[GLUTEN-11088][VL] Spark 4.0: Fix ArrowEvalPythonExecSuite apache/incubator-gluten#11288

Merged

	def count(col: "ColumnOrName") -> Column:
	return _invoke_function_over_columns("count", col)

[SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions #42864

[SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions #42864

Uh oh!

Conversation

peter-toth commented Sep 10, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Sep 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 18, 2023

Uh oh!

beliefer commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 18, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 20, 2023

Uh oh!

peter-toth commented Sep 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

peter-toth Sep 20, 2023 •

edited

Loading

cloud-fan Sep 18, 2023 •

edited

Loading

beliefer commented Sep 18, 2023 •

edited

Loading

peter-toth commented Sep 18, 2023 •

edited

Loading