-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48573][SQL] Upgrade ICU version #47011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48573][SQL] Upgrade ICU version #47011
Conversation
This reverts commit d54453e.
|
@dbatomic Could you review this PR? |
|
I think we should update the benchmark result of |
should run |
|
@LuciferYang Regenerated golden file. As for benchmarks, we are currently working on a new plan to update them, as they are pretty unstable, so those files will be regenerated with some separate PR. |
Convert to draft first to avoid being merged unexpectedly |
|
@mihailom-db fyi: good to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just update benchmarks
|
Can we reverse the order of these cases, and then the |
|
@yaooqinn do you mean printing out 1/Relative, and represent relative time instead of speed? Otherwise if you ask me it makes no difference if we sort fastest to slowest or slowest to fastest. |
|
@mihailom-db @yaooqinn this is an interesting topic (which has nothing to do with one pro of using relative time (essentially the inverse of relative speed in this context) would be better precision - no loss of decimals. However, all other benchmarks in spark rely on BenchmarkBase and compute relative speed, so I would suggest adding a paramter to |
No, the current relative column is Okay to me, but CollationBenchmark is not. Some of the relative values are rounded |
|
Will run benchmarks now, when I do will upload them and mark this as ready. |
|
Actually, @yaooqinn I am not quite sure what you are referring to. Our row that is a group control is UTF8_BINARY as it is the default, backwards compatible implementation of string collators. UTF8_LCASE is our implementation of a collation that is expected to be faster than ICU implemented collations, and UNICODE and UNICODE_CI are completely ICU implemented. What ordering exactly would you like to see, ordered on what column and in asc or desc order? |
|
Using the run order such as |
|
@yaooqinn I wouldn't say that would be a good way to compare collations right now - most / all of these collations are still under development, and it would only make sense to compare them agains As for the "x0.0" problem, this stems from the fact that some collations are very slow compared to others (with the current implementation), but this problem of precision loss can simply be solved by just computing the inverse value: instead of saying that this is a "x0.0" speed-up, let's say it's a "x21.0" slow-down (that is, let's compute thoughts? |
If However, what we have been discussing is not a blocker for merging this PR. |
|
@yaooqinn or @LuciferYang could we move forward with merging this PR in, we will create a PR for benchmark reorganisation in a separate ticket |
|
Merged to master, thank you all |
What changes were proposed in this pull request?
Upgrade of ICU version from 72.1 -> 75.1
Why are the changes needed?
We need to keep the version up-to-date.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests were not broken.
Was this patch authored or co-authored using generative AI tooling?
No.