-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48712][SQL] Perf Improvement for encode with empty values or UTF-8 charset #47096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @HyukjinKwon @dongjoon-hyun please help review this PR when you are available, thank you in advance |
|
Merged to master |
| legacyCharsets: Boolean, | ||
| legacyErrorAction: Boolean): Array[Byte] = { | ||
| val toCharset = charset.toString | ||
| if (input.numBytes == 0 || "UTF-8".equalsIgnoreCase(toCharset)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually a behavior change. If the input bytes are not valid utf 8 encoding, previously the result was not the same as the input bytes, but now it is.
We should either remove this utf 8 shortcut, or check the input bytes to see if it's valid utf8 encoding first.
cc @yaooqinn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we will encode the unmappable characters to mojibakes before this PR, but now we use its identity?
Do you think we can call input.isValid to check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I think so. For the happy path it's still faster than doing the actual encoding, and invalid utf8 bytes should be rare so it's ok to have an extra isValid call.
What changes were proposed in this pull request?
This PR makes a short-circuit, which gets the underlying byte array directly and bypasses the encoding progress, for 'encoding UTF8String instances w/ UTF-8 charset' or 'UTF8String.EMPTY_STRING w/ any charset'.
Why are the changes needed?
Performance improvement, 10x~20x according to benchmark results
Does this PR introduce any user-facing change?
no
How was this patch tested?
new unit tests
benchmark tests
Was this patch authored or co-authored using generative AI tooling?
no