[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

uros-db · 2024-06-20T11:19:54Z

What changes were proposed in this pull request?

Update Lower & Upper Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default.

Why are the changes needed?

To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

Does this PR introduce any user-facing change?

Yes, the behaviour of lower & upper string functions for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the ICU_CASE_MAPPINGS_ENABLED flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

How was this patch tested?

Existing tests, with extended CollationSupport unit tests for Lower/Upper to verify both ICU and JVM behaviour.

Was this patch authored or co-authored using generative AI tooling?

No.

mkaravel

LGTM.

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

mkaravel · 2024-06-21T07:07:51Z

Please also update the PR description and explain why we are making this change.

cloud-fan · 2024-06-24T08:19:33Z

thanks, merging to master!

### What changes were proposed in this pull request? Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: #47043. ### Why are the changes needed? To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation. ### Does this PR introduce _any_ user-facing change? Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle. ### How was this patch tested? Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47100 from uros-db/change-initcap. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Jun 20, 2024

uros-db changed the title ~~[WIP][SQL] Use ICU for Lower/Upper for JVM version 17+~~ [WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 20, 2024

Use ICU

194177d

uros-db force-pushed the change-lower-upper branch from ca93b88 to 194177d Compare June 20, 2024 15:11

uros-db changed the title ~~[WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings~~ [SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 21, 2024

mkaravel approved these changes Jun 21, 2024

View reviewed changes

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Show resolved Hide resolved

Add some comments

e170e45

cloud-fan approved these changes Jun 24, 2024

View reviewed changes

cloud-fan closed this in a7dc020 Jun 24, 2024

uros-db mentioned this pull request Jun 26, 2024

[SPARK-48682][SQL] Use ICU in InitCap expression for UTF8_BINARY strings #47100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

Uh oh!

uros-db commented Jun 20, 2024 •

edited

Loading

Uh oh!

mkaravel left a comment

Uh oh!

Uh oh!

mkaravel commented Jun 21, 2024

Uh oh!

cloud-fan commented Jun 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

Uh oh!

Conversation

uros-db commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mkaravel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkaravel commented Jun 21, 2024

Uh oh!

cloud-fan commented Jun 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

uros-db commented Jun 20, 2024 •

edited

Loading