-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46526
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
d88555c
Factored in null count.:
GideonPotok 0f0eedf
fixed up tests according to expectations surrounding nulls.
GideonPotok e1a533f
mode benchmark
GideonPotok a98eebe
mode benchmark
GideonPotok e4bf907
remove class member for collatrion enabled
GideonPotok c89af54
remove class member for collatrion enabled
GideonPotok 794b20a
remove class member for collatrion enabled
GideonPotok 3e891df
dataType check can be incorporated into the previous test, so this te…
GideonPotok 0849a21
dataType check can be incorporated into the previous test, so this te…
GideonPotok e79e14e
scalastyle
GideonPotok 9a32243
scalastyle
GideonPotok 506f7fc
fix add back ._1
GideonPotok 16ed98f
fix up
GideonPotok 6891c9b
tests pass
GideonPotok 3a0edac
tests pass
GideonPotok abea836
test
GideonPotok 7fc7561
import _
GideonPotok 05fe1a9
added bm results
GideonPotok a5710d1
tests pass
GideonPotok 76f089f
tests pass
GideonPotok d8ea771
tests pass
GideonPotok cc63899
Merge branch 'master' into spark_47353_3
GideonPotok af187ac
Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…
GideonPotok 184317d
removed withCollatedString
GideonPotok cd74bc8
buff->buffer
GideonPotok 8f2525a
added the jdk 17 benchmarks.
GideonPotok 7aca2e3
better benchmark
GideonPotok 81e2b44
better benchmark
GideonPotok dec21a5
better benchmark
GideonPotok 045c007
Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…
GideonPotok d85c052
up to date benchmarks
GideonPotok af016f4
scalastyle
GideonPotok 3fbe2b2
scalastyle
GideonPotok ab7fa8e
tests with higher unicode planes - corner cases
GideonPotok c86e01a
removed those unicode tests for now at least
GideonPotok 08a3e0a
removed extra benchmarks
GideonPotok 6ff346d
Merge branch 'master' into spark_47353_3
GideonPotok d3911cb
move tests to CollationSQLExpressionsSuite
GideonPotok 904b9c5
undo inadvertant change
GideonPotok 4ae4534
imports
GideonPotok 0fac136
Add back old benchmark logic, too.
GideonPotok 9b60a31
Update sql/core/src/test/scala/org/apache/spark/sql/CollationStringEx…
GideonPotok 36f38e3
wip
GideonPotok 0f2a456
wip
GideonPotok File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back to the original issue (why Mode doesn't work already, while otherwise Aggregation generally works with collated strings in Spark), here's what I'm interested in: why does
PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]]not work here automatically?afaik, ordering for
PhysicalStringTypeis defined correctly:so one would naturally expect Mode to work "as is"
did you investigate this maybe?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uros-db While the ordering for PhysicalStringType is correctly established using
private[sql] val ordering = CollationFactory.fetchCollation(collationId).comparator.compare(_, _), this does not automatically resolve the issue with Mode. To illustrate, consider the example of UTF8_BINARY_LCASE where an input likeMap("a" -> 3L, "b" -> 2L, "B" -> 2L)results in evaluating the maximum over the tuples(2L, "B"), (2L, "b"), (3L, "a")rather than the expected(3L, "a"), (4L, "b"). This indicates that the current approach doesn't aggregate values as required for Mode to operate correctly. Unit tests confirm that Mode otherwise won't work for such cases.