Skip to content

Conversation

@uros-db
Copy link
Contributor

@uros-db uros-db commented Jun 20, 2024

What changes were proposed in this pull request?

Update Lower & Upper Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default.

Why are the changes needed?

To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

Does this PR introduce any user-facing change?

Yes, the behaviour of lower & upper string functions for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the ICU_CASE_MAPPINGS_ENABLED flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

How was this patch tested?

Existing tests, with extended CollationSupport unit tests for Lower/Upper to verify both ICU and JVM behaviour.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 20, 2024
@uros-db uros-db changed the title [WIP][SQL] Use ICU for Lower/Upper for JVM version 17+ [WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 20, 2024
@uros-db uros-db force-pushed the change-lower-upper branch from ca93b88 to 194177d Compare June 20, 2024 15:11
@uros-db uros-db changed the title [WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings [SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 21, 2024
Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mkaravel
Copy link
Contributor

Please also update the PR description and explain why we are making this change.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a7dc020 Jun 24, 2024
cloud-fan pushed a commit that referenced this pull request Jun 28, 2024
### What changes were proposed in this pull request?
Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: #47043.

### Why are the changes needed?
To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

### Does this PR introduce _any_ user-facing change?
Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

### How was this patch tested?
Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47100 from uros-db/change-initcap.

Authored-by: Uros Bojanic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants