[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using a scala TreeMap (RB Tree) #46404

GideonPotok · 2024-05-06T15:40:43Z

What changes were proposed in this pull request?

SPARK-47353

Pull requests

Scala TreeMap (RB Tree)
GroupMapReduce <- Most performant
Comparing Experimental Approaches

Central Change to Mode `eval` Algorithm:

Update to eval Method: The eval method now checks if the column being looked at is string with non-default collation and if so, uses a TreeMap to manage keys and counts. - TreeMap takes a comparator to determine equality, so supports an efficient combination of counts by keys equal under that comparator but that were not by the OpenHashMap, which uses the default physical equality.

Minor Change to Mode:

Introduction of collationId: A new lazy value collationId is computed from the dataType of the child expression, used to fetch the appropriate collation comparator when collationEnabled is true.

Unit Test Enhancements: Significant additions to `CollationStringExpressionsSuite` to test new functionality including:

Tests for the Mode function when handling strings with different collation settings.

Benchmark Updates:

Enhanced the CollationBenchmark classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.

Why are the changes needed?

Ensures consistency in handling string comparisons under various collation settings.
Improves global usability by enabling compatibility with different collation standards.

Does this PR introduce any user-facing change?

Yes, this PR introduces the following user-facing changes:

Adds a new collationEnabled property to the Mode expression.
Users can now specify collation settings for the Mode expression to customize its behavior.

How was this patch tested?

This patch was tested through a combination of new and existing unit and end-to-end SQL tests.

Unit Tests:
- CollationStringExpressionsSuite:
  - TODO/WIP: Make the newly added tests more in the same design pattern as the existing tests
- Added multiple test cases to verify that the Mode function correctly handles strings with different collation settings.
- Specifically, tests include:
  - UTF8_BINARY_LCASE Collation: Verified that the function behaves as expected when processing lower-case-sensitive data.
  - Null Handling: Included tests for correct handling of null values with various collation settings.
Benchmark Tests:
- TODO DISCUSS PERFORMANCE
Manual Testing:

 ./build/mvn -DskipTests clean package 
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
   spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
    import org.apache.spark.sql.types.StringType
    import org.apache.spark.sql.functions
    import spark.implicits._
    val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
    val df = data.toDF("word")
    val dfLC = df.withColumn("word",
      col("word").cast(StringType("UTF8_BINARY_LCASE")))
    val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
    dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
|  abc|
+-----+

AFTER:
+-----+
|count|
+-----+
|  Def|
+-----+

*/

Continuous Integration (CI):
- The patch passed all relevant Continuous Integration (CI) checks, including:
  - Unit test suite
  - Benchmark suite [TODO RUN AND ATTACH RESULTS]

Was this patch authored or co-authored using generative AI tooling?

No, but the PR Description was co-authored with Chat-GPT.

What is left to do before taking off the WIP tag?

Discuss/analyze performance in PR Description
Make sure benchmarks invoke the function enough times to have non-zero values in the benchmark results
Run benchmarks in GHA and add resultant benchmark reports to Pull Request
Make the newly added tests more in the same design pattern as the existing tests
Make the newly added tests cover all collations
Figure out within group return value (EG if lower case collation, between Seq(("Def"), ("def"), ("def"), ("DEF"), ("abc"), ("abc"), ("abc")), is Def, DEF, or def the answer?
Choices are

First in data frame (Def)
Within group Mode:
Arbitrary/ Undefined
Physical ordering
Need to check but I think it is arbitrary in mode

GideonPotok · 2024-05-06T15:45:16Z

@uros-db This is ready for the initial round of review. Let me know what you think!

HyukjinKwon · 2024-05-07T02:38:21Z

Can you fill the PR description please

collagtion enabled or not Factored in null count.: benchmark benchmark ready for review ready for review ready for review ready for review ready for review ready for review use collation id tests pass tidy implementation idea: tree map tests tests support mode eval test passes

GideonPotok · 2024-05-07T13:19:36Z

Can you fill the PR description please

@HyukjinKwon done.

GideonPotok · 2024-05-07T17:04:43Z

@cloud-fan @MaxGekk @dbatomic just letting y'all know this is ready for first round of review. Thanks!

sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

uros-db · 2024-05-07T17:23:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

+    val buffer = if (child.dataType.isInstanceOf[StringType] && collationEnabled) {
+      val modeMap = buff.foldLeft(
+        new TreeMap[org.apache.spark.unsafe.types.UTF8String, Long]()(Ordering.comparatorToOrdering(
+          CollationFactory.fetchCollation(collationId).comparator


I like the general approach of using TreeMap with custom comparator - however, I do worry about the performance

it's good that you added the benchmark, so we can take a look at the overall results?

I will be running it soon in GHA and will then add the ...results.txt report to the code change.

Here is a sneak Peek.

Note that once I make your other suggested changes, the UTF8_BINARY case will become the "Collation Not Enabled" Case, once I remove the collationEnabled flag from the Mode class signature...

[info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1 [info] Apple M3 Max [info] collation unit benchmarks - mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] UTF8_BINARY_LCASE 1027 1198 241 0.1 10272.2 1.0X [info] UNICODE 3249 3262 19 0.0 32485.8 0.3X [info] UTF8_BINARY 805 806 1 0.1 8050.6 1.3X [info] UNICODE_CI 3623 3632 13 0.0 36233.6 0.3X [info] Collation Not Enabled 28 30 1 3.5 284.6 36.1X [info] Numerical Type 28 29 1 3.6 277.5 37.0X

As you can see, the performance is impacted!

Here is my plan:

Get your thoughts on these performance results...

Analyze the computation complexity in theory. Being very standard java Collection classes, the TreeMap vs OpenHashMap documentation will surely will state what their complexity class/big-o is known to be. After that, I will provide this analysis in the PR description.

I can analyze the computation complexity empirically: I can either add to the benchmark or do an ad-hoc benchmark to determine how it grows as input grows the time grows .

Do you have any thoughts on what else I might want to look into?

@uros-db see above comments -- I forgot to tag you

so I think we can agree that a 130x slowdown isn't exactly great... perhaps we should look into other approaches - I would suggest that you open another PR for that tho

@uros-db org.apache.spark.util.collection.OpenHashMap is actually a spark data structure, so I could try to modify it in a second candidate PR and then see about having Mode rely on a OpenHashMap configured to be collation aware..

The way I could do it would be to modify org.apache.spark.util.collection.OpenHashSet.Hasher.hash() , specifically the override of hash within a class we would create Hasher[String with Collation...], to branch to an alternative hashing method that is collation sensitive.

(I would not be modifying org.apache.spark.util.collection.OpenHashSet#hashcode , which distributes the hashes more evenly and is called on values returned by Hasher.hash)

This should work because OpenHashMap relies internally on OpenHashSet which uses org.apache.spark.util.collection.OpenHashSet.Hasher

@uros-db I basically implemented a proof of concept here.

I believe the benchmarking I did there (see the PR description) maybe shows 28x slowdown rather then a 89x slowdown. (I know we had a 130x slowdown before, but I modified the benchmarks slightly and now it is the following

Here is for the originally proposed implementation (This PR) which was the Red-black tree / TreeMap

[info] collation unit benchmarks - mode - 10000 elements: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] --------------------------------------------------------------------------------------------------------------------------------- [info] UTF8_BINARY_LCASE - mode - 10000 elements 13 15 5 7.6 130.9 1.0X [info] UNICODE - mode - 10000 elements 39 41 2 2.6 392.0 0.3X [info] UTF8_BINARY - mode - 10000 elements 1 1 0 129.0 7.7 16.9X [info] UNICODE_CI - mode - 10000 elements 39 41 2 2.6 391.4 0.3X [info] Numerical Type - mode - 10000 elements 0 1 0 204.5 4.9 26.8X [info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1 [info] Apple M3 Max [info] collation unit benchmarks - mode - 5000 elements: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] -------------------------------------------------------------------------------------------------------------------------------- [info] UTF8_BINARY_LCASE - mode - 5000 elements 4 5 0 11.1 89.7 1.0X [info] UNICODE - mode - 5000 elements 13 14 1 3.8 266.2 0.3X [info] UTF8_BINARY - mode - 5000 elements 0 0 0 140.5 7.1 12.6X [info] UNICODE_CI - mode - 5000 elements 13 14 1 3.8 265.1 0.3X [info] Numerical Type - mode - 5000 elements 0 0 0 251.4 4.0 22.5X

And here are results for the OpenHashMap implementation seen here

collation unit benchmarks - mode - 10000 elements: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] --------------------------------------------------------------------------------------------------------------------------------- [info] UTF8_BINARY_LCASE - mode - 10000 elements 8 9 1 12.6 79.5 1.0X [info] UNICODE - mode - 10000 elements 13 14 1 7.4 135.0 0.6X [info] UTF8_BINARY - mode - 10000 elements 1 1 0 120.0 8.3 9.5X [info] UNICODE_CI - mode - 10000 elements 14 14 1 7.3 137.9 0.6X [info] Numerical Type - mode - 10000 elements 0 1 0 218.0 4.6 17.3X [info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1 [info] Apple M3 Max [info] collation unit benchmarks - mode - 5000 elements: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] -------------------------------------------------------------------------------------------------------------------------------- [info] UTF8_BINARY_LCASE - mode - 5000 elements 3 3 0 18.6 53.6 1.0X [info] UNICODE - mode - 5000 elements 5 6 0 9.5 104.9 0.5X [info] UTF8_BINARY - mode - 5000 elements 0 0 0 148.4 6.7 8.0X [info] UNICODE_CI - mode - 5000 elements 5 5 0 11.1 90.3 0.6X [info] Numerical Type - mode - 5000 elements 0 0 0 267.6 3.7 14.3X

The thing I notice though is the slowdown is really happening most accutely for the unicode collations. Is there some underlying reason they are so much less performance then either UTF8_BINARY Collation? Is it because it can't compare byte by byte? I get why that would affect the TreeMap implementation, but if it is affecting the prototype for use of OpenHashMap with different hashing function, maybe it is from the conversion to UTF8String by both implementations. Which is not related to choice of data structures.

Please let me know what you think.

Actually, the Numerical benchmark case should be ignored for this analysis. The case we can compare the various cases against is the UTF8_BINARY case. Thus, For this implementation, the slowdown ranges from 16.9x to 56x. And for the prototype, it ranges from 9.5x to 15x

I am happy to try additional alternative approaches, BTW.

@uros-db

I also added a third approach to the experimental pr #46488

What kinds of cardinalities are expected? Eg even if the data frame has 1 Trillion Rows, would it not be highly rare use case to have 1 Trillion unique values? That might be worth thinking about when it comes to the range of what to benchmark.

And what percent of values in the data frames to be benchmarked should be duplicate values, within the same collation? It is really just one very specific case we are benchmarking right now, and at least in the experimental branch, it could be useful to play around with that while we evaluate different algorithms.

uros-db · 2024-05-07T17:24:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

+        case (_, _) =>
+          throw new IllegalArgumentException("Mode expects non-null string type.")


just checking - is this how Mode handles null values currently in Spark?
(without the collation changes in this PR)

@uros-db Yep.

The buffer should never contain a key that is null, because it is not added to the map during Mode.update(). The behavior this supports is that, according to the docs: "NULL values are ignored. If all the values are NULL, or there are 0 rows, returns NULL."

The subtleties of all this are fresh in my mind from looking into it recently -- So just let me know if there is an aspect of this you need more details about and I am happy to explain.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

uros-db

kudos for your effort on this ticket! I left some comments

also, please try to ensure that your code doesn't introduce any breaking changes (for example, I see that SQLQueryTestSuite has a failing test for Mode) before requesting thorough review

however, it's also good to verify the approach as soon as possible - so I would advise making these fixes and committing the benchmark results

uros-db · 2024-05-07T18:32:19Z

just another quick note based on your previous PRs - no need to force push and disrupt the commit history, in general I think it's easier to follow changes in this PR as we comment & resolve conversations, and look at commits as they come

GideonPotok · 2024-05-07T18:39:05Z

kudos for your effort on this ticket! I left some comments

also, please try to ensure that your code doesn't introduce any breaking changes (for example, I see that SQLQueryTestSuite has a failing test for Mode) before requesting thorough review

however, it's also good to verify the approach as soon as possible - so I would advise making these fixes and committing the benchmark results

@uros-db Thanks for the review! I hear you that I might have waited before I brought in the other reviewers.

Thanks for initial feedback!

…st can then be removed

beliefer · 2024-05-09T07:02:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala


-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {


Could we avoid the change ?

Either I am renaming here or line 83. If I change line 83 instead, I will have to change all of the references to buffer on L85 and L86. Do you have a particular preference and if so, why?

Some changes may be unavoidable, but since we can choose - let's opt for keeping the method signature intact

beliefer · 2024-05-09T07:02:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

      return null
    }
+    val buffer = if (isCollatedString(child)) {
+      val modeMap = buff.foldLeft(


place val collationId: Int = child.dataType.asInstanceOf[StringType].collationId here so that avoid lazy.

I am copying a pattern I have seen in other Expression implementations of collation. But I am fine with making the suggested change.

I think this is fine either way, I think we should generally aim to include collationId as part of the expression field (as we've done for other expressions), and there should be no harm since it's a lazy val

on the other hand, collationId is used only in one place here, but still I'd keep it as it is

### What changes were proposed in this pull request? [SPARK-47353](https://issues.apache.org/jira/browse/SPARK-47353) #### Pull requests [Scala TreeMap (RB Tree)](#46404) [GroupMapReduce](#46526) <- Most performant [GroupMapReduce (Cleaned up) (This PR)](#46597) <- Most performant [Comparing Experimental Approaches ](#46488) https://github.com/apache/spark/pull/46597/files#r1626058908 -> #46917 (comment) #### Central Change to Mode `eval` Algorithm: - Update to `eval` Method: The `eval` method now checks if the column being looked at is string with non-default collation and if so, uses a grouping ``` buff.toSeq.groupMapReduce { case (key: String, _) => CollationFactory.getCollationKey(UTF8String.fromString(key), collationId) case (key: UTF8String, _) => CollationFactory.getCollationKey(key, collationId) case (key, _) => key }(x => x)((x, y) => (x._1, x._2 + y._2)).values ``` #### Minor Change to Mode: - Introduction of `collationId`: A new lazy value `collationId` is computed from the `dataType` of the `child` expression, used to fetch the appropriate collation comparator when `collationEnabled` is true. This PR will fail for complex types containing collated strings Follow up PR will implement that #### Unit Test Enhancements: Significant additions to `CollationStringExpressionsSuite` to test new functionality including: - Tests for the `Mode` function when handling strings with different collation settings. #### Benchmark Updates: - Enhanced the `CollationBenchmark` classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types. ### Why are the changes needed? 1. Ensures consistency in handling string comparisons under various collation settings. 2. Improves global usability by enabling compatibility with different collation standards. ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces the following user-facing changes: 1. Adds a new `collationEnabled` property to the `Mode` expression. 2. Users can now specify collation settings for the `Mode` expression to customize its behavior. ### How was this patch tested? This patch was tested through a combination of new and existing unit and end-to-end SQL tests. 1. **Unit Tests:** - **CollationStringExpressionsSuite:** - Make the newly added tests more in the same design pattern as the existing tests - Added multiple test cases to verify that the `Mode` function correctly handles strings with different collation settings. Out of scope: Special Unicode Cases higher planes Tests do not need to include Null Handling. 3. **Benchmark Tests:** 4. **Manual Testing:** ``` ./build/mvn -DskipTests clean package export SPARK_HOME=/Users/gideon/repos/spark $SPARK_HOME/bin/spark-shell spark.sqlContext.setConf("spark.sql.collation.enabled", "true") import org.apache.spark.sql.types.StringType import org.apache.spark.sql.functions import spark.implicits._ val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc")) val df = data.toDF("word") val dfLC = df.withColumn("word", col("word").cast(StringType("UTF8_BINARY_LCASE"))) val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count")) dfLCA.show() /* BEFORE: -----+ |count| +-----+ | abc| +-----+ AFTER: +-----+ |count| +-----+ | Def| +-----+ */ ``` 6. **Continuous Integration (CI):** - The patch passed all relevant Continuous Integration (CI) checks, including: - Unit test suite - Benchmark suite - Consider moving the new benchmark to the catalyst module ### Was this patch authored or co-authored using generative AI tooling? Nope! Closes #46597 from GideonPotok/spark_47353_3_clean. Lead-authored-by: GideonPotok <[email protected]> Co-authored-by: Gideon Potok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label May 6, 2024

dongjoon-hyun changed the title ~~SPARK-47353: Enable collation support for the Mode expression in Spark.~~ [SPARK-47353][SQL] Enable collation support for the Mode expression May 6, 2024

GideonPotok marked this pull request as ready for review May 7, 2024 00:58

GideonPotok force-pushed the spark_47353_2 branch 3 times, most recently from 43b4177 to 4214dd2 Compare May 7, 2024 02:34

GideonPotok changed the title ~~[SPARK-47353][SQL] Enable collation support for the Mode expression~~ [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression May 7, 2024

fixed up tests according to expectations surrounding nulls.

0f0eedf

GideonPotok force-pushed the spark_47353_2 branch from 4214dd2 to 0f0eedf Compare May 7, 2024 14:00