Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Mar 20, 2018

What changes were proposed in this pull request?

As stated in Jira, there are problems with current Uuid expression which uses java.util.UUID.randomUUID for UUID generation.

This patch uses the newly added RandomUUIDGenerator for UUID generation. So we can make Uuid deterministic between retries.

How was this patch tested?

Added unit tests.

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88399 has finished for PR 20861 at commit 306dbe8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Uuid(randomSeed: Option[Long] = None) extends LeafExpression with Nondeterministic

this(catalog, conf, conf.optimizerMaxIterations)
}

private lazy val random = new Random()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put random in the ResolvedUuidExpressions? That makes it a little bit easier to follow.

val uuids = new ArrayBuffer[Uuid]()
plan.transformUp {
case p =>
p.transformExpressionsUp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: use collect?


private def getUuidExpressions(plan: LogicalPlan): Seq[Uuid] = {
val uuids = new ArrayBuffer[Uuid]()
plan.transformUp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit use flatMap?

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88414 has finished for PR 20861 at commit 8676495.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Uuid(randomSeed: Option[Long] = None) extends LeafExpression with Nondeterministic

@viirya
Copy link
Member Author

viirya commented Mar 20, 2018

@hvanhovell Thanks! Your comments are addressed.

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88422 has finished for PR 20861 at commit 37a7c8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - merging to master. Thanks!

@asfgit asfgit closed this in 4d37008 Mar 22, 2018
@hvanhovell
Copy link
Contributor

@viirya can you create a backport for 2.3?

@viirya
Copy link
Member Author

viirya commented Mar 23, 2018

@hvanhovell Ok. But this needs #20817. Since #20817 just adds new class and doesn't change existing code, I think it can be directly merged into 2.3. Should I create a backport PR of it too? Or you can direct backport it?

@hvanhovell
Copy link
Contributor

@viirya I have backported #20817 to 2.3

override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp {
case p if p.resolved => p
case p => p transformExpressionsUp {
case Uuid(None) => Uuid(Some(random.nextLong()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do the same thing for Rand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if we want to make it deterministic between re-tries of same query. I think we should do it. I can make a PR for it, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. We can even create a base trait for these random functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked, and actually Rand and Randn already have this behavior that they are deterministic between re-tries, though their random seeds are not determined at analysis but at constructing.

So I'm thinking should we do the same thing (random seed initialized at analysis) to Rand and Randn? Besides just to be consistent with Uuid, is any good reason to do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I think is, is any use case that we need to re-initialize random seed for Rand? Maybe streaming? For streaming query, I think Rand should use different random seed in each execution. For now, the random seed is initialized when constructing, even we re-analyze the query, it still uses same seed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the current behavior for rand in streaming?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a PR #21854 which shows behavior of Uuid in streaming. I think rand should be the same.

@viirya viirya deleted the SPARK-23599-2 branch December 27, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants