-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41660][SQL] Only propagate metadata columns if they are used #39152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| plan: LogicalPlan, | ||
| isRequired: Attribute => Boolean): LogicalPlan = plan match { | ||
| case s: ExposesMetadataColumns if s.metadataOutput.exists(isRequired) => | ||
| s.withMetadataColumns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are doing this, shall we propagate the required metadata columns only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, but we need to change the ExposesMetadataColumns.withMetadataColumns API to pass required attributes, which may break custom logical plans. Since the benefit is small, this may not worth.
| s.withMetadataColumns() | ||
| case p: Project if p.metadataOutput.exists(isRequired) => | ||
| val newProj = p.copy( | ||
| projectList = p.projectList ++ p.metadataOutput, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto: shall we propagate the required metadata columns only?
|
LGTM except minor comments |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule only adds metadata columns when a node is resolved but is missing input from its
children. This ensures that metadata columns are not added to the plan unless they are used.
Based on the rule description, I think this change makes it more correct now.
|
thanks for review, merging to master! |
…tadataColumnSuite ### What changes were proposed in this pull request? Move the new test case for Metadata column in #39081 to `MetadataColumnSuite` ### Why are the changes needed? All metadata column related test cases should go into `MetadataColumnSuite`. For example: - #37758 - #39152 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests Closes #39425 from gengliangwang/moveTest. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request? backporting #39152 to 3.3 ### Why are the changes needed? bug fixing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #40889 from huaxingao/metadata. Authored-by: huaxingao <[email protected]> Signed-off-by: huaxingao <[email protected]>
What changes were proposed in this pull request?
Ideally it's OK to always propagate metadata columns, as column pruning will kick in later and prune them aways if they are not used. However, it may cause problems in cases like CTE. #39081 fixed such a bug.
This PR only propagates metadata columns if they are used, to keep the analyzed plan simple and reliable.
Why are the changes needed?
avoid potential bugs.
Does this PR introduce any user-facing change?
no
How was this patch tested?
new tests