-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48428][SQL]: Fix IllegalStateException in NestedColumnAliasing #46756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan @dongjoon-hyun @viirya @ulysses-you have been involved in earlier PRs on this code. Do you know who would be a good reviewer of this change? |
In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect rewrite. This fixes these family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`.
| .select( | ||
| GetStructField(GetStructField(CreateStruct(Seq($"id", $"employer")), 1), 0), | ||
| $"employer.id") | ||
| .analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I haven't touched this part of code for a while. Can you give a detailed explanation about how IllegalStateException can be thrown, with this test case, to help reviewers understand the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test case with the code on master collectRootReferenceAndExtractValue will collect to candidates for aliasing $"employer.id" and GetStructField(GetStructField(CreateStruct(Seq($"id", $"employer")), 1), 0) and no root references. The second one gets filtered out due to this condition
https://github.com/apache/spark/blob/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L253
Then the logic after this will only consider that we are accessing only $"employer.id" and will miss/be unaware that we are accessing $employer inside the other expression. This will cause it to perform the transformation/optimization incorrectly which leads to the IllegalStateException.
With the fix collectRootReferenceAndExtractValue will return $"employer.id" as a candidate and $id and $employer as root references. This means that when evaluating $"employer.id this time it will be discarded due to this filter: https://github.com/apache/spark/blob/v3.5.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L264
Does that help understand the issue?
|
@eejbyfeldt thanks for the fix! Can you re-trigger the GA jobs? The failure seems flaky. |
@cloud-fan done. |
|
thanks, merging to master/3.5! |
### What changes were proposed in this pull request? In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b11608c) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? In apache#35170 SPARK-37855 and apache#32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after `collectRootReferenceAndExtractValue` and just removes these candidates. This is problematic since these expressions might have contained `AttributeReference` that needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the function `collectRootReferenceAndExtractValue`. ### Why are the changes needed? The current code leads to `IllegalStateException` runtime failures. ### Does this PR introduce _any_ user-facing change? Yes, fixes a bug. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46756 from eejbyfeldt/SPARK-48428. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b11608c) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In #35170 SPARK-37855 and #32301 SPARK-35194 introduced conditions for ExtractValues that can currently not be handled. The considtion is introduced after
collectRootReferenceAndExtractValueand just removes these candidates. This is problematic since these expressions might have containedAttributeReferencethat needed to not do an incorrect aliasing. This fixes this family of bugs by moving the conditions into the functioncollectRootReferenceAndExtractValue.Why are the changes needed?
The current code leads to
IllegalStateExceptionruntime failures.Does this PR introduce any user-facing change?
Yes, fixes a bug.
How was this patch tested?
Existing and new unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.