[SPARK-48573][SQL] Upgrade ICU version #47011

mihailomilosevic2001 · 2024-06-18T11:07:15Z

What changes were proposed in this pull request?

Upgrade of ICU version from 72.1 -> 75.1

Why are the changes needed?

We need to keep the version up-to-date.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests were not broken.

Was this patch authored or co-authored using generative AI tooling?

No.

This reverts commit d54453e.

mihailomilosevic2001 · 2024-06-18T11:07:45Z

@dbatomic Could you review this PR?

LuciferYang · 2024-06-18T12:15:10Z

I think we should update the benchmark result of CollationBenchmark and CollationNonASCIIBenchmark

LuciferYang · 2024-06-19T03:30:19Z

[info] - ICU locales map breaking change *** FAILED *** (3 milliseconds)
[info]   scala.Predef.wrapRefArray[(String, Int)](input).sameElements[(String, Int)](scala.Predef.wrapRefArray[(String, Int)](scala.Predef.refArrayOps[String](org.apache.spark.sql.catalyst.util.CollationFactory.getICULocaleNames()).zipWithIndex)) was false (ICUCollationsMapSuite.scala:67)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.sql.ICUCollationsMapSuite.$anonfun$new$2(ICUCollationsMapSuite.scala:67)
[info]   at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info]   at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
[info]   at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
[info]   at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)

should run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationsMapSuite" to re-generate collations map golden file

mihailomilosevic2001 · 2024-06-19T07:45:50Z

@LuciferYang Regenerated golden file. As for benchmarks, we are currently working on a new plan to update them, as they are pretty unstable, so those files will be regenerated with some separate PR.

mihailomilosevic2001 · 2024-06-19T10:29:17Z

@uros-db @mkaravel discussed offline, we will wait for expressions changes to come in first, so we can rerun benchmarks there first and then we will run them again for this PR, so we have a clear idea of what impact did this upgrade have on the collations

LuciferYang · 2024-06-21T09:53:31Z

@uros-db @mkaravel discussed offline, we will wait for expressions changes to come in first, so we can rerun benchmarks there first and then we will run them again for this PR, so we have a clear idea of what impact did this upgrade have on the collations

Convert to draft first to avoid being merged unexpectedly

uros-db · 2024-06-24T07:12:27Z

@mihailom-db fyi: good to go

uros-db

lgtm, just update benchmarks

…de-icu

yaooqinn · 2024-06-24T13:08:47Z

Can we reverse the order of these cases, and then the Relative columns can produce more reasonable relative values

mihailomilosevic2001 · 2024-06-24T13:31:13Z

@yaooqinn do you mean printing out 1/Relative, and represent relative time instead of speed? Otherwise if you ask me it makes no difference if we sort fastest to slowest or slowest to fastest.

uros-db · 2024-06-24T13:38:48Z

@mihailom-db @yaooqinn this is an interesting topic (which has nothing to do with CollationBenchmark) - we could compute either relative speed or relative time, and the way BenchmarkBase works now is computing relative speed (only)

one pro of using relative time (essentially the inverse of relative speed in this context) would be better precision - no loss of decimals. However, all other benchmarks in spark rely on BenchmarkBase and compute relative speed, so I would suggest adding a paramter to BenchmarkBase which would allow computing either RELATIVE_SPEED (default value) or RELATIVE_TIME (inverse that we want for our benchmark)

yaooqinn · 2024-06-24T13:42:27Z

Do you mean printing out 1/Relative

No, the current relative column is Okay to me, but CollationBenchmark is not. Some of the relative values are rounded 0.0 as they are pretty small. But If we can make it the head row, which plays the role of the control group, it makes more sense

mihailomilosevic2001 · 2024-06-24T13:54:54Z

Will run benchmarks now, when I do will upload them and mark this as ready.

mihailomilosevic2001 · 2024-06-25T08:05:49Z

Actually, @yaooqinn I am not quite sure what you are referring to. Our row that is a group control is UTF8_BINARY as it is the default, backwards compatible implementation of string collators. UTF8_LCASE is our implementation of a collation that is expected to be faster than ICU implemented collations, and UNICODE and UNICODE_CI are completely ICU implemented. What ordering exactly would you like to see, ordered on what column and in asc or desc order?

yaooqinn · 2024-06-25T09:59:40Z

Using the run order such as Seq("UNICODE", "UNICODE_CI", "UTF8_BINARY", "UTF8_LCASE").forEach(addBenchmark) to apply UNICODE in the head row will preserve our observation purposes, right? It will also show relatively how much faster UTF8_LCASE could be compared to the current 0.0 values.

uros-db · 2024-06-25T10:12:42Z

@yaooqinn I wouldn't say that would be a good way to compare collations right now - most / all of these collations are still under development, and it would only make sense to compare them agains UTF8_BINARY (I believe nothing else can really be a reasonable baseline). Putting UNICODE as the baseline doesn't make much sense imo, for example: we may eventually change how UNICODE or UTF8_LCASE work, and then the relative comparison of the two no longer makes sense

As for the "x0.0" problem, this stems from the fact that some collations are very slow compared to others (with the current implementation), but this problem of precision loss can simply be solved by just computing the inverse value: instead of saying that this is a "x0.0" speed-up, let's say it's a "x21.0" slow-down (that is, let's compute RELATIVE_TIME instead of RELATIVE_SPEED in our Relative benchmark results column)

thoughts?

yaooqinn · 2024-06-25T10:30:53Z

it would only make sense to compare them agains UTF8_BINARY

If UTF8_BINARY is still included in the group, we are able to maintain relative consistency compared to others.

However, what we have been discussing is not a blocker for merging this PR.

mihailomilosevic2001 · 2024-06-25T10:48:17Z

I agree, lets for now leave this benchmark reorganisation as a separate PR, so we can check in upgrade for ICU version. @yaooqinn @uros-db

…de-icu

mihailomilosevic2001 · 2024-06-25T19:22:39Z

@yaooqinn or @LuciferYang could we move forward with merging this PR in, we will create a PR for benchmark reorganisation in a separate ticket

yaooqinn · 2024-06-26T02:14:06Z

Merged to master, thank you all

mihailomilosevic2001 added 8 commits May 30, 2024 09:51

Fix OracleIntegrationSuite for JDK8

d54453e

Revert "Fix OracleIntegrationSuite for JDK8"

7d00764

This reverts commit d54453e.

Merge remote-tracking branch 'upstream/master'

ad01ff9

Merge remote-tracking branch 'upstream/master'

1592c78

Merge remote-tracking branch 'upstream/master'

7fe2a4d

Merge remote-tracking branch 'upstream/master'

8e811d4

Merge remote-tracking branch 'upstream/master'

7586b5f

Upgrade icu version

fe94569

github-actions bot added the BUILD label Jun 18, 2024

dbatomic approved these changes Jun 18, 2024

View reviewed changes

Fix deps file

e042f7f

Regenerate golden file

92b4d26

github-actions bot added the SQL label Jun 19, 2024

LuciferYang approved these changes Jun 19, 2024

View reviewed changes

LuciferYang marked this pull request as draft June 21, 2024 09:53

Merge branch 'apache:master' into upgrade-icu

2de46e1

uros-db approved these changes Jun 24, 2024

View reviewed changes

mihailomilosevic2001 added 3 commits June 24, 2024 12:07

Merge remote-tracking branch 'upstream/master' into upgrade-icu

005d6db

Add benchmarks

84ba74b

Merge branch 'upgrade-icu' of github.com:mihailom-db/spark into upgra…

9a115b3

…de-icu

mihailomilosevic2001 marked this pull request as ready for review June 24, 2024 10:19

Reverse benchmark order

7d8876e

mihailomilosevic2001 marked this pull request as draft June 24, 2024 13:54

Merge branch 'apache:master' into upgrade-icu

5a13a1f

mihailomilosevic2001 marked this pull request as ready for review June 25, 2024 10:48

mihailomilosevic2001 added 2 commits June 25, 2024 12:49

Revert order change

5587aeb

Merge branch 'upgrade-icu' of github.com:mihailom-db/spark into upgra…

3d9505d

…de-icu

yaooqinn closed this in c459afb Jun 26, 2024

[SPARK-48573][SQL] Upgrade ICU version #47011

[SPARK-48573][SQL] Upgrade ICU version #47011

Uh oh!

Conversation

mihailomilosevic2001 commented Jun 18, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mihailomilosevic2001 commented Jun 18, 2024

Uh oh!

LuciferYang commented Jun 18, 2024

Uh oh!

LuciferYang commented Jun 19, 2024

Uh oh!

mihailomilosevic2001 commented Jun 19, 2024

Uh oh!

mihailomilosevic2001 commented Jun 19, 2024

Uh oh!

LuciferYang commented Jun 21, 2024

Uh oh!

uros-db commented Jun 24, 2024

Uh oh!

uros-db left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Jun 24, 2024

Uh oh!

mihailomilosevic2001 commented Jun 24, 2024

Uh oh!

uros-db commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Jun 24, 2024

Uh oh!

mihailomilosevic2001 commented Jun 24, 2024

Uh oh!

mihailomilosevic2001 commented Jun 25, 2024

Uh oh!

yaooqinn commented Jun 25, 2024

Uh oh!

uros-db commented Jun 25, 2024

Uh oh!

yaooqinn commented Jun 25, 2024

Uh oh!

mihailomilosevic2001 commented Jun 25, 2024

Uh oh!

mihailomilosevic2001 commented Jun 25, 2024

Uh oh!

yaooqinn commented Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

uros-db left a comment •

edited

Loading

uros-db commented Jun 24, 2024 •

edited

Loading