Skip to content

Conversation

@mihailomilosevic2001
Copy link
Contributor

What changes were proposed in this pull request?

Upgrade of ICU version from 72.1 -> 75.1

Why are the changes needed?

We need to keep the version up-to-date.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests were not broken.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the BUILD label Jun 18, 2024
@mihailomilosevic2001
Copy link
Contributor Author

@dbatomic Could you review this PR?

@LuciferYang
Copy link
Contributor

I think we should update the benchmark result of CollationBenchmark and CollationNonASCIIBenchmark

@LuciferYang
Copy link
Contributor

[info] - ICU locales map breaking change *** FAILED *** (3 milliseconds)
[info]   scala.Predef.wrapRefArray[(String, Int)](input).sameElements[(String, Int)](scala.Predef.wrapRefArray[(String, Int)](scala.Predef.refArrayOps[String](org.apache.spark.sql.catalyst.util.CollationFactory.getICULocaleNames()).zipWithIndex)) was false (ICUCollationsMapSuite.scala:67)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.sql.ICUCollationsMapSuite.$anonfun$new$2(ICUCollationsMapSuite.scala:67)
[info]   at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info]   at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
[info]   at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
[info]   at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
[info]   at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)

should run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.ICUCollationsMapSuite" to re-generate collations map golden file

@github-actions github-actions bot added the SQL label Jun 19, 2024
@mihailomilosevic2001
Copy link
Contributor Author

@LuciferYang Regenerated golden file. As for benchmarks, we are currently working on a new plan to update them, as they are pretty unstable, so those files will be regenerated with some separate PR.

@mihailomilosevic2001
Copy link
Contributor Author

@uros-db @mkaravel discussed offline, we will wait for expressions changes to come in first, so we can rerun benchmarks there first and then we will run them again for this PR, so we have a clear idea of what impact did this upgrade have on the collations

@LuciferYang
Copy link
Contributor

@uros-db @mkaravel discussed offline, we will wait for expressions changes to come in first, so we can rerun benchmarks there first and then we will run them again for this PR, so we have a clear idea of what impact did this upgrade have on the collations

Convert to draft first to avoid being merged unexpectedly

@LuciferYang LuciferYang marked this pull request as draft June 21, 2024 09:53
@uros-db
Copy link
Contributor

uros-db commented Jun 24, 2024

@mihailom-db fyi: good to go

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just update benchmarks

@mihailomilosevic2001 mihailomilosevic2001 marked this pull request as ready for review June 24, 2024 10:19
@yaooqinn
Copy link
Member

Can we reverse the order of these cases, and then the Relative columns can produce more reasonable relative values

@mihailomilosevic2001
Copy link
Contributor Author

@yaooqinn do you mean printing out 1/Relative, and represent relative time instead of speed? Otherwise if you ask me it makes no difference if we sort fastest to slowest or slowest to fastest.

@uros-db
Copy link
Contributor

uros-db commented Jun 24, 2024

@mihailom-db @yaooqinn this is an interesting topic (which has nothing to do with CollationBenchmark) - we could compute either relative speed or relative time, and the way BenchmarkBase works now is computing relative speed (only)

one pro of using relative time (essentially the inverse of relative speed in this context) would be better precision - no loss of decimals. However, all other benchmarks in spark rely on BenchmarkBase and compute relative speed, so I would suggest adding a paramter to BenchmarkBase which would allow computing either RELATIVE_SPEED (default value) or RELATIVE_TIME (inverse that we want for our benchmark)

@yaooqinn
Copy link
Member

Do you mean printing out 1/Relative

No, the current relative column is Okay to me, but CollationBenchmark is not. Some of the relative values are rounded 0.0 as they are pretty small. But If we can make it the head row, which plays the role of the control group, it makes more sense

@mihailomilosevic2001
Copy link
Contributor Author

Will run benchmarks now, when I do will upload them and mark this as ready.

@mihailomilosevic2001 mihailomilosevic2001 marked this pull request as draft June 24, 2024 13:54
@mihailomilosevic2001
Copy link
Contributor Author

Actually, @yaooqinn I am not quite sure what you are referring to. Our row that is a group control is UTF8_BINARY as it is the default, backwards compatible implementation of string collators. UTF8_LCASE is our implementation of a collation that is expected to be faster than ICU implemented collations, and UNICODE and UNICODE_CI are completely ICU implemented. What ordering exactly would you like to see, ordered on what column and in asc or desc order?

@yaooqinn
Copy link
Member

Using the run order such as Seq("UNICODE", "UNICODE_CI", "UTF8_BINARY", "UTF8_LCASE").forEach(addBenchmark) to apply UNICODE in the head row will preserve our observation purposes, right? It will also show relatively how much faster UTF8_LCASE could be compared to the current 0.0 values.

@uros-db
Copy link
Contributor

uros-db commented Jun 25, 2024

@yaooqinn I wouldn't say that would be a good way to compare collations right now - most / all of these collations are still under development, and it would only make sense to compare them agains UTF8_BINARY (I believe nothing else can really be a reasonable baseline). Putting UNICODE as the baseline doesn't make much sense imo, for example: we may eventually change how UNICODE or UTF8_LCASE work, and then the relative comparison of the two no longer makes sense

As for the "x0.0" problem, this stems from the fact that some collations are very slow compared to others (with the current implementation), but this problem of precision loss can simply be solved by just computing the inverse value: instead of saying that this is a "x0.0" speed-up, let's say it's a "x21.0" slow-down (that is, let's compute RELATIVE_TIME instead of RELATIVE_SPEED in our Relative benchmark results column)

thoughts?

@yaooqinn
Copy link
Member

it would only make sense to compare them agains UTF8_BINARY

If UTF8_BINARY is still included in the group, we are able to maintain relative consistency compared to others.


However, what we have been discussing is not a blocker for merging this PR.

@mihailomilosevic2001
Copy link
Contributor Author

I agree, lets for now leave this benchmark reorganisation as a separate PR, so we can check in upgrade for ICU version. @yaooqinn @uros-db

@mihailomilosevic2001 mihailomilosevic2001 marked this pull request as ready for review June 25, 2024 10:48
@mihailomilosevic2001
Copy link
Contributor Author

@yaooqinn or @LuciferYang could we move forward with merging this PR in, we will create a PR for benchmark reorganisation in a separate ticket

@yaooqinn yaooqinn closed this in c459afb Jun 26, 2024
@yaooqinn
Copy link
Member

Merged to master, thank you all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants