[SPARK-48712][SQL] Perf Improvement for encode with empty values or UTF-8 charset #47096

yaooqinn · 2024-06-26T03:02:46Z

What changes were proposed in this pull request?

This PR makes a short-circuit, which gets the underlying byte array directly and bypasses the encoding progress, for 'encoding UTF8String instances w/ UTF-8 charset' or 'UTF8String.EMPTY_STRING w/ any charset'.

Why are the changes needed?

Performance improvement, 10x~20x according to benchmark results

Does this PR introduce any user-facing change?

no

How was this patch tested?

new unit tests
benchmark tests

Was this patch authored or co-authored using generative AI tooling?

no

…F-8 charset

yaooqinn · 2024-06-26T08:50:09Z

cc @cloud-fan @HyukjinKwon @dongjoon-hyun please help review this PR when you are available, thank you in advance

yaooqinn · 2024-06-27T02:24:21Z

Merged to master
Thank you @cloud-fan

cloud-fan · 2024-09-20T20:01:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

      legacyCharsets: Boolean,
      legacyErrorAction: Boolean): Array[Byte] = {
    val toCharset = charset.toString
+    if (input.numBytes == 0 || "UTF-8".equalsIgnoreCase(toCharset)) {


this is actually a behavior change. If the input bytes are not valid utf 8 encoding, previously the result was not the same as the input bytes, but now it is.

We should either remove this utf 8 shortcut, or check the input bytes to see if it's valid utf8 encoding first.

cc @yaooqinn

Do you mean that we will encode the unmappable characters to mojibakes before this PR, but now we use its identity?

Do you think we can call input.isValid to check here?

yea I think so. For the happy path it's still faster than doing the actual encoding, and invalid utf8 bytes should be rare so it's ok to have an extra isValid call.

yaooqinn added 3 commits June 25, 2024 20:27

[SPARK-48712][SQL] Perf Improvement for encode with empty value or UT…

8d36c1e

…F-8 charset

add more case

4d6c49d

add tests

64dccfe

github-actions bot added the SQL label Jun 26, 2024

yaooqinn added 3 commits June 26, 2024 13:36

update golden files for 17

8903ab6

update golden files for 21

7db3f31

add more tests

5c4b5d0

Merge branch 'master' into SPARK-48712

9a8c302

cloud-fan approved these changes Jun 27, 2024

View reviewed changes

yaooqinn closed this in 7c7c196 Jun 27, 2024

yaooqinn deleted the SPARK-48712 branch June 27, 2024 02:24

cloud-fan reviewed Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48712][SQL] Perf Improvement for encode with empty values or UTF-8 charset #47096

[SPARK-48712][SQL] Perf Improvement for encode with empty values or UTF-8 charset #47096

Uh oh!

yaooqinn commented Jun 26, 2024 •

edited

Loading

Uh oh!

yaooqinn commented Jun 26, 2024

Uh oh!

yaooqinn commented Jun 27, 2024

Uh oh!

cloud-fan Sep 20, 2024

Uh oh!

yaooqinn Sep 21, 2024

Uh oh!

cloud-fan Sep 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-48712][SQL] Perf Improvement for encode with empty values or UTF-8 charset #47096

[SPARK-48712][SQL] Perf Improvement for encode with empty values or UTF-8 charset #47096

Uh oh!

Conversation

yaooqinn commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn commented Jun 26, 2024

Uh oh!

yaooqinn commented Jun 27, 2024

Uh oh!

cloud-fan Sep 20, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Sep 21, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Sep 21, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaooqinn commented Jun 26, 2024 •

edited

Loading