-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24896][SQL] Uuid should produce different values for each execution in streaming query #21854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp { | ||
| case p => p transformExpressionsUp { | ||
| case Uuid(_) if p.isStreaming => Uuid(Some(random.nextLong())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal at all but can we remove (_) like _: Uuid ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sure.
|
Actually I think |
|
Test build #93476 has finished for PR 21854 at commit
|
|
Test build #93479 has finished for PR 21854 at commit
|
|
retest this please |
|
Test build #93491 has finished for PR 21854 at commit
|
|
Test build #93517 has finished for PR 21854 at commit
|
|
regardless of the implementation, is it expected to produce different UUID for different micro batches? Personally I think it's reasonable, micro batch and continuous execution should produce same result. |
|
ping @tdas @zsxwing @jose-torres |
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Left some minor comments.
|
|
||
| override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp { | ||
| case p => p transformExpressionsUp { | ||
| case _: Uuid if p.isStreaming => Uuid(Some(random.nextLong())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this into IncrementalExecution? You can put this close to the rule for CurrentBatchTimestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This rule is only needed for streaming queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It should be more clear.
| case p => p transformExpressionsUp { | ||
| // Produces a placeholder random seed for streaming query, the real random seed | ||
| // is given at the beginning of Optimizer. | ||
| case Uuid(None) if p.isStreaming => Uuid(Some(-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not necessary. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uuid need to have a random seed initialized to make it resolved. This gives it a fake seed. Since we assign random seeds at optimizer, we can get rid of it. The intent here is to have a placeholder seed shown in analyzed plan. Not a big deal, so I'm going to remove it.
67a9387 to
c127053
Compare
|
LGTM pending tests |
|
Test build #93850 has finished for PR 21854 at commit
|
|
ping @cloud-fan @zsxwing Is this ready to merge? Thanks. |
|
Thanks! Merging to master. |
What changes were proposed in this pull request?
Uuid's results depend on random seed given during analysis. Thus under streaming query, we will have the same uuids in each execution. This seems to be incorrect for streaming query execution.How was this patch tested?
Added test.