You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-53559][SQL][CATALYST] Fix HLL sketch updates to use raw collation key bytes
### What changes were proposed in this pull request?
- Extract the input UTF8String.
- Ignore strings that are collation equal to the empty string when updating the sketch.
Before:
```
val cKey = CollationFactory.getCollationKey(v.asInstanceOf[UTF8String], st.collationId)
sketch.update(cKey.toString)
```
After:
```
val collation = CollationFactory.fetchCollation(st.collationId)
val str = v.asInstanceOf[UTF8String]
if (!collation.equalsFunction(str, UTF8String.EMPTY_UTF8)) {
sketch.update(collation.sortKeyFunction.apply(str))
}
````
### Why are the changes needed?
As discussed in #51298 (comment). Collation keys are arbitrary byte sequences, not guaranteed to be valid UTF-8. Converting them to a Java String replaces invalid UTF-8 bytes with U+FFFD (the replacement character). This can collapse distinct strings into identical values, causing the sketch to treat different strings as the same. Also, string collations must be considered when updating a sketch.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build repo and test suites
### Was this patch authored or co-authored using generative AI tooling?
No
### Jira
https://issues.apache.org/jira/browse/SPARK-53559Closes#52316 from cboumalh/SPARK-53559_refactor_hll_sketch_update.
Lead-authored-by: Chris Boumalhab <[email protected]>
Co-authored-by: Chris Boumalhab <[email protected]>
Signed-off-by: Daniel Tenedorio <[email protected]>
Copy file name to clipboardExpand all lines: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala
+9-2Lines changed: 9 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -130,6 +130,10 @@ case class HllSketchAgg(
130
130
* Evaluate the input row and update the HllSketch instance with the row's value. The update
131
131
* function only supports a subset of Spark SQL types, and an exception will be thrown for
132
132
* unsupported types.
133
+
* Notes:
134
+
* - Null values are ignored.
135
+
* - Empty byte arrays are ignored.
136
+
* - Strings that are collation-equal to the empty string are ignored.
INSERT INTO hll_binary_test VALUES (X''), (CAST(' ' AS BINARY)), (X'e280'), (X'c1'), (X'c120')
45
+
-- !query analysis
46
+
InsertIntoHadoopFsRelationCommand file:[not included in comparison]/{warehouse_dir}/hll_binary_test, false, Parquet, [path=file:[not included in comparison]/{warehouse_dir}/hll_binary_test], Append, `spark_catalog`.`default`.`hll_binary_test`, org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:[not included in comparison]/{warehouse_dir}/hll_binary_test), [bytes]
47
+
+- Project [col1#x AS bytes#x]
48
+
+- LocalRelation [col1#x]
49
+
50
+
51
+
-- !query
52
+
INSERT INTO hll_string_test VALUES (''), (' '), (CAST(X'C1' AS STRING)), (CAST(X'80' AS STRING)), ('\uFFFD'), ('Å'), ('å'), ('a\u030A'), ('Å '), ('å '), ('a\u030A ')
53
+
-- !query analysis
54
+
InsertIntoHadoopFsRelationCommand file:[not included in comparison]/{warehouse_dir}/hll_string_test, false, Parquet, [path=file:[not included in comparison]/{warehouse_dir}/hll_string_test], Append, `spark_catalog`.`default`.`hll_string_test`, org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:[not included in comparison]/{warehouse_dir}/hll_string_test), [s]
55
+
+- Project [col1#x AS s#x]
56
+
+- LocalRelation [col1#x]
57
+
58
+
17
59
-- !query
18
60
SELECT hll_sketch_estimate(hll_sketch_agg(col)) AS result FROM t1
0 commit comments