[SPARK-26493][SQL] Allow multiple spark.sql.extensions #23398

jamisonbennett · 2018-12-28T17:13:07Z

What changes were proposed in this pull request?

Allow multiple spark.sql.extensions to be specified in the
configuration.

How was this patch tested?

New tests are added.

## What changes were proposed in this pull request? Allow multiple spark.sql.extensions to be specified in the configuration. ## How was this patch tested? New tests are added.

jamisonbennett · 2018-12-28T17:14:56Z

This is my original work and I license the work to the project under the project’s open source license.
I have signed an ICLA with apache.org.

jamisonbennett · 2018-12-28T17:42:45Z

@sameeragarwal , @gatorsmile , @RussellSpitzer please review my changes.

gatorsmile · 2018-12-28T18:24:18Z

ok to test

RussellSpitzer · 2018-12-28T18:28:32Z

Lgtm +1

RussellSpitzer · 2018-12-28T18:30:04Z

Actually how does this work from a spark defaults perspective, does a comma separated string work?

gatorsmile · 2018-12-28T18:35:54Z

@RussellSpitzer See this example.

    val conf = new SparkConf()
    val seq = ConfigBuilder(testKey("seq")).stringConf.toSequence.createWithDefault(Seq())
    conf.set(seq.key, "1,,2, 3 , , 4")
    assert(conf.get(seq) === Seq("1", "2", "3", "4"))
    conf.set(seq, Seq("1", "2"))
    assert(conf.get(seq) === Seq("1", "2"))

dongjoon-hyun · 2018-12-28T20:05:08Z

This PR looks useful. Thank you for your first contribution, @jamisonbennett .

dongjoon-hyun · 2018-12-28T20:13:14Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

+      assert(session.sessionState.functionRegistry
+        .lookupFunction(MyExtensions.myFunction._1).isDefined)
+      assert(session.sessionState.functionRegistry
+        .lookupFunction(MyExtensions2.myFunction._1).isDefined)


So, now we have multiple extension registrations. The order of extension names might have side-effects.

Can we have a test case for duplicated extension names? MyExtension2 and MyExtension2?

Can we have a negative test case for function name conflicts? MyExtension2.myFunction and MyExtension3.myFunction?

I think the order matters, but we need to discuss and document the behavior when we have name conflicts.

For example, the same rule will be added twice in extendedResolutionRules. Is it desired?

class MyExtensions extends (SparkSessionExtensions => Unit) { def apply(e: SparkSessionExtensions): Unit = { e.injectResolutionRule(MyRule) } } class MyExtensions2 extends (SparkSessionExtensions => Unit) { def apply(e: SparkSessionExtensions): Unit = { e.injectResolutionRule(MyRule) } }

Yep. If there is no reason to allow that, we had better disallow that by design before this PR.

There are use cases where you want to execute rules in a certain order. So I think it is reasonable to add the same rule multiple times. If you want more control you could even create 'micro' optimizer batches by calling multiple rules from one rule.

I think this is more a matter of proper documentation than one where we should explicitly block things. Also note that this is a pretty advanced feature and by this stage users are expected to know what they are doing.

Prior to this change, it was possible to programmatically register multiple extensions but it was not possible to do so through the spark.sql.extensions configuration. Although it wasn't documented/tested until this pull request. E.g. The following works without this pull request:

SparkSession.builder() .master("..") .withExtensions(sparkSessionExtensions1) .withExtensions(sparkSessionExtensions2) .getOrCreate()

So I think conflicting function names are already currently possible (but not documented). In the following cases:

Conflicting function names are registered by calling .withExtenions() multiple times

An extension accidentally registers a function that was already registered with the builtin functions

An extension accidentally registers a function multiple times by calling injectFunction(myFunction)

As for the order, it looks to me like the last function to be stored with conflicting names is the one which is retrieved:

class SimpleFunctionRegistry extends FunctionRegistry { @GuardedBy("this") private val functionBuilders = new mutable.HashMap[FunctionIdentifier, (ExpressionInfo, FunctionBuilder)] override def registerFunction( name: FunctionIdentifier, info: ExpressionInfo, builder: FunctionBuilder): Unit = synchronized { functionBuilders.put(normalizeFuncName(name), (info, builder)) }

I will update this PR to document what happens in order of operations and conflicts. If we need to explicitly block duplicates functions from being registered, I can temporarily drop this PR and see about making those changes first.

Thanks for explaining it. We do not need to block it, but we might need to detect and throw a warning message at least.

More importantly, we need to document the current behavior and also add a test case to ensure the future changes will not break it. In the future, we can revisit the current behavior and make a change if needed.

Can we have a test case for duplicated extension names? Done
Can we have a negative test case for function name conflicts? MyExtension2.myFunction and MyExtension3.myFunction? Done

I added documentation for the behavior.
I added a warning message if a registered function is replaced.
I added a test case for the ordering.

hvanhovell · 2018-12-28T21:39:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala


  val SPARK_SESSION_EXTENSIONS = buildStaticConf("spark.sql.extensions")
-    .doc("Name of the class used to configure Spark Session extensions. The class should " +
+    .doc("List of the class names used to configure Spark Session extensions. The classes should " +


Please document in what order rules from multiple extensions are executed, how listeners are invoked and how functions are registered (last one wins?). Be sure to cover what happens if you add duplicate listeners/rules/functions etc...

A suggestion of update the comment is replace 'List of the class names used to configure Spark Session extensions. The classes should implement Function1[SparkSessionExtension, Unit], and must have a no-args constructor.' to 'A comma-separated list of classes that implement Function1[SparkSessionExtension, Unit] used to configure Spark Session extensions. The classes must have a no-args constructor.'

I updated the documentation as suggested with respect to the ordering. I think the comment about "listeners" is related to code that is close in proximity to the code for this pull request, but it is a different configuration item. So I didn't update the spark.sql.queryExecutionListeners documentation. If you think that documentation should be updated, let me know if I should include it as a part of this pull request or as a separate pull request.

SparkQA · 2018-12-28T22:25:17Z

Test build #100511 has finished for PR 23398 at commit cef8eb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
.doc(\"List of the class names used to configure Spark Session extensions. The classes should \" +

This addresses the comments for apache#23398

SparkQA · 2018-12-30T07:54:18Z

Test build #100544 has finished for PR 23398 at commit 689a4d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SimpleFunctionRegistry extends FunctionRegistry with Logging

hvanhovell · 2018-12-31T16:50:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

      builder: FunctionBuilder): Unit = synchronized {
-    functionBuilders.put(normalizeFuncName(name), (info, builder))
+    val normalizedName = normalizeFuncName(name)
+    if (functionBuilders.put(normalizedName, (info, builder)).isDefined) {


It would be great if we can check if the new function and the old function are different. This will help to increase the signal of the error message.

I added a check which which will only log if different function objects are registered. The "allow an extension to be duplicated" unit tests that I previously added registers the same object twice. This test no longer prints the warning. The "use the last registered function name when there are duplicates" unit tests that I previously added registers different functions with the same name. This test prints the warning.

This addresses the comments for apache#23398

SparkQA · 2019-01-02T16:20:34Z

Test build #100643 has finished for PR 23398 at commit d89cfd9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2019-01-02T18:02:49Z

retest this please

SparkQA · 2019-01-02T22:02:32Z

Test build #100647 has finished for PR 23398 at commit d89cfd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala

HyukjinKwon

Looks okay. It's unstable + experimental FWIW.

felixcheung · 2019-01-05T05:12:04Z

@hvanhovell @gatorsmile ?

This addresses the comments for apache#23398

SparkQA · 2019-01-05T22:21:54Z

Test build #100800 has finished for PR 23398 at commit fb4ad34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

vanzin · 2019-01-08T20:46:06Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

+      assert(session.sessionState.analyzer.extendedCheckRules.containsSlice(orderedCheckRules))
+      assert(session.sessionState.optimizer.batches.flatMap(_.rules).filter(orderedRules.contains)
+        .containsSlice(orderedRules ++ orderedRules)) // The optimizer rules are duplicated
+      assert(session.sessionState.sqlParser == parser)


In all these asserts, use === and !==.

That's actually arguable, Vanzin. Some people prefer === whereas some prefer ==. === doesn't look always reporting a better error message give my tests. See also databricks/scala-style-guide#36.

Based on databricks/scala-style-guide#36, it looks like == might now be preferred over ===. For what its worth, it seems that in the cases for this test == produces reasonable error messages such as MyParser(org.apache.spark.sql.SparkSession@6e8a9c30,org.apache.spark.sql.catalyst.parser.CatalystSqlParser$@5d01ea21) did not equal IntentionalErrorThatIInsertedHere and 2 did not equal 3. So please let me know if there is newer guidance to use === and I can make the changes.

== and === are not equivalent, and we use === in tests. The latter, for example, handles arrays correctly, which the former does not.

Thanks for the follow up, I was not aware of that difference. I updated the tests to use === and !== as originally recommended.

@vanzin, where does it say we use ===? If it's practically yes, let's document it.

HyukjinKwon · 2019-01-09T02:45:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

+    functionBuilders.put(normalizedName, newFunction) match {
+      case Some(previousFunction) if previousFunction != newFunction =>
+        logWarning(s"The function $normalizedName replaced a previously registered function.")
+      case _ => Unit


This is returning Unit type object. This can be just removed.

I made a change here which I think was what you were recommending. I didn't want to remove the case _ => otherwise I think it may result in a non-exhaustive match exception. If you wanted me to remove the entire match statement, just let me know.

HyukjinKwon · 2019-01-09T02:52:40Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

-    new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended usage" ),
+    new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended usage"),
    (myArgs: Seq[Expression]) => Literal(5, IntegerType))
+


I would remove this newline. Looks unrelated.

HyukjinKwon · 2019-01-09T02:54:38Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

+    e.injectPostHocResolutionRule(MyRule2)
+    e.injectCheckRule(MyCheckRule2)
+    e.injectOptimizerRule(MyRule2)
+    e.injectParser((_, _) => CatalystSqlParser)


nit: e.injectParser((_: SparkSession, _: ParserInterface) => CatalystSqlParser)

I also made the suggested change in 2 other places in this file so that it is consistent.

HyukjinKwon · 2019-01-09T02:55:29Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

  val myFunction = (FunctionIdentifier("myFunction"),
-    new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended usage" ),
+    new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended usage"),
    (myArgs: Seq[Expression]) => Literal(5, IntegerType))


nit: (_: Seq[Expression]) => Literal(5, IntegerType))

HyukjinKwon · 2019-01-09T02:56:43Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

+object MyExtensions2Duplicate {
+
+  val myFunction = (FunctionIdentifier("myFunction2"),
+    new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "last wins" ),


nit "last wins" -> "extended usage"

nit: " ) -> ")

I made both changes and also updated one of the tests to validate the ExpressionInfo object rather than the extended usage text.

HyukjinKwon · 2019-01-09T02:57:00Z

sql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala

+object MyExtensions2 {
+
+  val myFunction = (FunctionIdentifier("myFunction2"),
+    new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended usage" ),


nit: " ) -> ")

HyukjinKwon · 2019-01-09T03:00:07Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala

 * }}}
 *
+ * The extensions can also be used by setting the Spark SQL configuration property
+ * spark.sql.extensions, for example:


nit: spark.sql.extensions -> 'spark.sql.extensions'

Shall we also mention multiple rule can be set via comma separated?

I added the recommended documentation updates and also added code marks around withExtensions and fixed the typo there. I didn't think it was necessary to provide the example of using the comma-separated string but I did note it in the documentation.

This addresses the comments for apache#23398

SparkQA · 2019-01-09T15:56:17Z

Test build #100971 has finished for PR 23398 at commit 9c0181d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-09T16:00:37Z

retest this please

SparkQA · 2019-01-09T17:21:30Z

Test build #100968 has finished for PR 23398 at commit 65a5f3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-09T17:45:45Z

Test build #100977 has finished for PR 23398 at commit 9c0181d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This addresses the comments for apache#23398

SparkQA · 2019-01-09T23:13:55Z

Test build #100986 has finished for PR 23398 at commit deaf73e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-10T02:21:18Z

Merged to master.

For === vs ==, if that's not documented, let's document in databricks/scala-style-guide which is an official style guide to refer. If that's not documented, I think both are okay. Both are already being used here and there.

## What changes were proposed in this pull request? Allow multiple spark.sql.extensions to be specified in the configuration. ## How was this patch tested? New tests are added. Closes apache#23398 from jamisonbennett/SPARK-26493. Authored-by: Jamison Bennett <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Allow multiple spark.sql.extensions to be specified in the configuration. New tests are added. Closes apache#23398 from jamisonbennett/SPARK-26493. Authored-by: Jamison Bennett <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

[SPARK-26493][SQL] Allow multiple spark.sql.extensions

cef8eb0

## What changes were proposed in this pull request? Allow multiple spark.sql.extensions to be specified in the configuration. ## How was this patch tested? New tests are added.

dongjoon-hyun reviewed Dec 28, 2018

View reviewed changes

hvanhovell reviewed Dec 28, 2018

View reviewed changes

Address comments from dongjoon-hyun, hvanhovell, beliefer, gatorsmile

689a4d2

This addresses the comments for apache#23398

hvanhovell reviewed Dec 31, 2018

View reviewed changes

Address comment from hvanhovell

d89cfd9

This addresses the comments for apache#23398

HyukjinKwon reviewed Jan 4, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala Show resolved Hide resolved

HyukjinKwon approved these changes Jan 4, 2019

View reviewed changes

Address comments from HyukjinKwon

fb4ad34

This addresses the comments for apache#23398

vanzin reviewed Jan 8, 2019

View reviewed changes

HyukjinKwon reviewed Jan 9, 2019

View reviewed changes

jamisonbennett added 2 commits January 8, 2019 23:19

Address comments from HyukjinKwon, vanzin

65a5f3f

This addresses the comments for apache#23398

Fix a typo

9c0181d

This addresses the comments for apache#23398

Address comments from vanzin

deaf73e

This addresses the comments for apache#23398

asfgit closed this in 1a47233 Jan 10, 2019

[SPARK-26493][SQL] Allow multiple spark.sql.extensions #23398

[SPARK-26493][SQL] Allow multiple spark.sql.extensions #23398

Uh oh!

Conversation

jamisonbennett commented Dec 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jamisonbennett commented Dec 28, 2018

Uh oh!

jamisonbennett commented Dec 28, 2018

Uh oh!

gatorsmile commented Dec 28, 2018

Uh oh!

RussellSpitzer commented Dec 28, 2018

Uh oh!

RussellSpitzer commented Dec 28, 2018

Uh oh!

gatorsmile commented Dec 28, 2018

Uh oh!

dongjoon-hyun commented Dec 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 28, 2018

Uh oh!

SparkQA commented Dec 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 2, 2019

Uh oh!

bersprockets commented Jan 2, 2019

Uh oh!

SparkQA commented Jan 2, 2019

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Jan 5, 2019

Uh oh!

SparkQA commented Jan 5, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamisonbennett commented Dec 28, 2018 •

edited

Loading