-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47353][SQL] Enable collation support for the Mode expression #46597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0bab248
f054589
5d171d6
a49ccd4
4574546
66457b1
4303b72
e1d7f51
39e2a1b
81cd946
835a450
603b834
e66ec1a
87d8699
357fa3f
82a89a3
254dc1f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,13 +18,14 @@ | |
| package org.apache.spark.sql.catalyst.expressions.aggregate | ||
|
|
||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.analysis.{ExpressionBuilder, UnresolvedWithinGroup} | ||
| import org.apache.spark.sql.catalyst.analysis.{ExpressionBuilder, TypeCheckResult, UnresolvedWithinGroup} | ||
| import org.apache.spark.sql.catalyst.expressions.{Ascending, Descending, Expression, ExpressionDescription, ImplicitCastInputTypes, SortOrder} | ||
| import org.apache.spark.sql.catalyst.trees.UnaryLike | ||
| import org.apache.spark.sql.catalyst.types.PhysicalDataType | ||
| import org.apache.spark.sql.catalyst.util.GenericArrayData | ||
| import org.apache.spark.sql.catalyst.util.{CollationFactory, GenericArrayData, UnsafeRowUtils} | ||
| import org.apache.spark.sql.errors.QueryCompilationErrors | ||
| import org.apache.spark.sql.types.{AbstractDataType, AnyDataType, ArrayType, BooleanType, DataType} | ||
| import org.apache.spark.sql.types.{AbstractDataType, AnyDataType, ArrayType, BooleanType, DataType, StringType} | ||
| import org.apache.spark.unsafe.types.UTF8String | ||
| import org.apache.spark.util.collection.OpenHashMap | ||
|
|
||
| case class Mode( | ||
|
|
@@ -48,6 +49,21 @@ case class Mode( | |
|
|
||
| override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) | ||
|
|
||
| override def checkInputDataTypes(): TypeCheckResult = { | ||
| if (UnsafeRowUtils.isBinaryStable(child.dataType) || child.dataType.isInstanceOf[StringType]) { | ||
GideonPotok marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| /* | ||
| * The Mode class uses collation awareness logic to handle string data. | ||
| * Complex types with collated fields are not yet supported. | ||
| */ | ||
| // TODO: SPARK-48700: Mode expression for complex types (all collations) | ||
| super.checkInputDataTypes() | ||
| } else { | ||
| TypeCheckResult.TypeCheckFailure("The input to the function 'mode' was" + | ||
| " a type of binary-unstable type that is " + | ||
| s"not currently supported by ${prettyName}.") | ||
| } | ||
| } | ||
|
|
||
| override def prettyName: String = "mode" | ||
|
|
||
| override def update( | ||
|
|
@@ -74,16 +90,38 @@ case class Mode( | |
| if (buffer.isEmpty) { | ||
| return null | ||
| } | ||
|
|
||
| /* | ||
| * The Mode class uses special collation awareness logic | ||
| * to handle string data types with various collations. | ||
| * | ||
| * For string types that don't support binary equality, | ||
| * we create a new map where the keys are the collation keys of the original strings. | ||
| * | ||
| * Keys from the original map are aggregated based on the corresponding collation keys. | ||
| * The groupMapReduce method groups the entries by collation key and maps each group | ||
| * to a single value (the sum of the counts), and finally reduces the groups to a single map. | ||
| * | ||
| * The new map is then used in the rest of the Mode evaluation logic. | ||
| */ | ||
| val collationAwareBuffer = child.dataType match { | ||
GideonPotok marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| case c: StringType if | ||
| !CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => | ||
| val collationId = c.collationId | ||
| val modeMap = buffer.toSeq.groupMapReduce { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not expert in this part of code but I wander if we could do better than this. On collation level we do have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dbatomic What you are proposing would make sense. The complexity is increased. But i can whip up a draft PR and we can see whether it makes sense to proceed.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dbatomic @uros-db here is a mockup/proof of concept of this proposal: #46917. The relevant unit test has passed, which indicates that this approach is viable! Now, we need to consider whether to advance and determine how to integrate the relevant information about key datatype into the OpenHashMap. What are your thoughts on the feasibility of moving forward? I'm primarily concerned about the risks involved: Integrating collation with specialized types and complex hash functions might lead to subtle bugs. Considering the crucial nature of this data structure, we should approach any changes with a detailed plan for validation and with caution. It may be wise to consider less invasive modifications , such as the one proposed in this PR (#46597). Despite these concerns, this approach is functioning, and it touches on a particularly intriguing part of the codebase that I am eager to work on. If you think it's a promising route, I'm ready to complete the implementation and perform further benchmarks. However, I would appreciate some design suggestions as mentioned below. To effectively implement this, I see two possible directions:
Lastly, while I am eager to complete the implementation, I hope to ensure that this is something you would definitively want to pursue, barring any significant performance setbacks revealed by benchmarking. I've developed this proof of concept and it's operational, but a full implementation should ideally be something you are confident is the right direction.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I left some comments in #46917 - overall, I'm liking the new approach, but I agree with you that group mapReduce was far less invasive... Personally, I'm fine with either approach and don't have a hunch on why one might obviously prevail over the other at this point, but there might be good value in drilling deeper and running benchmarks on #46917 - then we would be able to compare, discuss, and opt for our winner After all, we shouldn't spend much more time here, so I suggest the following - if you're feeling up for it @GideonPotok please continue investigating further with the new approach (OHM) until we reach benchmark results for a final decision. If not, I would be happy with cleaning everything up with the old approach (GMR). After all, we can always re-visit this in the future How does this sound? @GideonPotok @dbatomic
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can create another PR for change the method signature.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @beliefer sounds good.. Once it is ready I will tag you in that one, assuming we proceed with that approach.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dbatomic see comment here #46917 (comment) And let us know if on board to continue with this approach. It's always an option to not have mode support collations. But if supporting collations, this is the best way to go. We have tried a bunch of approaches and this way is simple, tested, decoupled and at the correct layer of abstraction, and easy to modify. I can also add support for complex types and for PandasMode in this pr once I know you are on board with this approach. |
||
| case (k, _) => CollationFactory.getCollationKey(k.asInstanceOf[UTF8String], collationId) | ||
| }(x => x)((x, y) => (x._1, x._2 + y._2)).values | ||
| modeMap | ||
| case _ => buffer | ||
| } | ||
| reverseOpt.map { reverse => | ||
| val defaultKeyOrdering = if (reverse) { | ||
| PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]].reverse | ||
| } else { | ||
| PhysicalDataType.ordering(child.dataType).asInstanceOf[Ordering[AnyRef]] | ||
| } | ||
| val ordering = Ordering.Tuple2(Ordering.Long, defaultKeyOrdering) | ||
| buffer.maxBy { case (key, count) => (count, key) }(ordering) | ||
| }.getOrElse(buffer.maxBy(_._2))._1 | ||
| collationAwareBuffer.maxBy { case (key, count) => (count, key) }(ordering) | ||
| }.getOrElse(collationAwareBuffer.maxBy(_._2))._1 | ||
| } | ||
|
|
||
| override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): Mode = | ||
|
|
@@ -128,6 +166,7 @@ case class Mode( | |
| copy(child = newChild) | ||
| } | ||
|
|
||
| // TODO: SPARK-48701: PandasMode (all collations) | ||
| // scalastyle:off line.size.limit | ||
| @ExpressionDescription( | ||
| usage = """ | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.