Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jun 26, 2024

What changes were proposed in this pull request?

This PR makes a short-circuit, which gets the underlying byte array directly and bypasses the encoding progress, for 'encoding UTF8String instances w/ UTF-8 charset' or 'UTF8String.EMPTY_STRING w/ any charset'.

Why are the changes needed?

Performance improvement, 10x~20x according to benchmark results

Does this PR introduce any user-facing change?

no

How was this patch tested?

new unit tests
benchmark tests

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jun 26, 2024
@yaooqinn
Copy link
Member Author

cc @cloud-fan @HyukjinKwon @dongjoon-hyun please help review this PR when you are available, thank you in advance

@yaooqinn yaooqinn closed this in 7c7c196 Jun 27, 2024
@yaooqinn yaooqinn deleted the SPARK-48712 branch June 27, 2024 02:24
@yaooqinn
Copy link
Member Author

Merged to master
Thank you @cloud-fan

legacyCharsets: Boolean,
legacyErrorAction: Boolean): Array[Byte] = {
val toCharset = charset.toString
if (input.numBytes == 0 || "UTF-8".equalsIgnoreCase(toCharset)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a behavior change. If the input bytes are not valid utf 8 encoding, previously the result was not the same as the input bytes, but now it is.

We should either remove this utf 8 shortcut, or check the input bytes to see if it's valid utf8 encoding first.

cc @yaooqinn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we will encode the unmappable characters to mojibakes before this PR, but now we use its identity?

Do you think we can call input.isValid to check here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I think so. For the happy path it's still faster than doing the actual encoding, and invalid utf8 bytes should be rare so it's ok to have an extra isValid call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants