-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45920][SQL] group by ordinal should be idempotent #43797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case Literal(_: Int, IntegerType) => | ||
| Literal(index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only for IntegerType? If ordinalExpr is something like cast(long as int) that is foldable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only fixing repeated analysis (analyzing an analyzed plan). It's a much bigger topic to support analyzing an optimized plan, so constant folding is not considered here.
| // analyze the plan. Using a different integer literal may lead to | ||
| // a repeat GROUP BY ordinal resolution which is wrong. GROUP BY | ||
| // constant is meaningless so whatever value does not matter here. | ||
| // TODO: GROUP BY ordinal should pull out grouping expressions to a Project, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a temporary fix only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to backport this surgical fix, and only do the refactoring in the master branch. So yes, this is a temporary fix for the master branch.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay for the targeted issue.
| } | ||
| } | ||
|
|
||
| test("group by ordinal repeated analysis") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a test prefix SPARK-45920:?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| // analyze the plan. Using a different integer literal may lead to | ||
| // a repeat GROUP BY ordinal resolution which is wrong. GROUP BY | ||
| // constant is meaningless so whatever value does not matter here. | ||
| // TODO: GROUP BY ordinal should pull out grouping expressions to a Project, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you file an official JIRA issue and make this as the IDed TODO, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making this fix, @cloud-fan .
Given the following warning, this sounds like this could cause a correctness issue. Did I understand correctly?
a repeat GROUP BY ordinal resolution which is wrong.
Yes, but only when you manipulate logical plans directly. SQL/DataFrame API can't trigger it. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cloud-fan .
|
BTW, the first commit failed with the following relevant failure. Could you double-check that because the last commit seems irrelevant with that failure, @cloud-fan ?
|
|
Merged to master first. |
|
In you don't mind, could you make a backporting PR to branch-3.5/3.4/3.3 separately? For this PR, I believe we should be more careful due to the different golden files per branches, @cloud-fan . |
### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#43797 from cloud-fan/group. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. No new test no Closes apache#43797 from cloud-fan/group. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. No new test no Closes apache#43797 from cloud-fan/group. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
backport #43797 ### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43836 from cloud-fan/3.5-port. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
backport #43797 ### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43838 from cloud-fan/3.4-port. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
backport #43797 ### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43839 from cloud-fan/3.3-port. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
backport apache#43797 ### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#43838 from cloud-fan/3.4-port. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This is a followup of #43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46113 from cloud-fan/group-all. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of #43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46113 from cloud-fan/group-all. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b5bb75c) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of #43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46113 from cloud-fan/group-all. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b5bb75c) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of #43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? Yes, we will fix the error. ### How was this patch tested? Test added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50461 from mihailom-db/mihailom-db/fixGroupBy. Authored-by: Mihailo Milosevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of #43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? Yes, we will fix the error. ### How was this patch tested? Test added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50461 from mihailom-db/mihailom-db/fixGroupBy. Authored-by: Mihailo Milosevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 02db872) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of apache/spark#43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? Yes, we will fix the error. ### How was this patch tested? Test added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50461 from mihailom-db/mihailom-db/fixGroupBy. Authored-by: Mihailo Milosevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…sis to avoid issue with group by ordinal ### What changes were proposed in this pull request? This is a followup to #43797 and #50461 where hacks were introduced in order to solve the issue of repeated analysis on plans that have a group by ordinal. The latter PR caused a regression so the hack needs to be removed. This PR proposed a move of `UnresolvedOrdinal` construction before Analyzer runs. ### Why are the changes needed? We are reverting a hack introduced in the previous PRs to improve the behavior of group by ordinal and additionally fix the issue that #50461 was trying to solve. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50606 from mihailotim-db/mihailotim-db/new_group_by_ordinal. Authored-by: Mihailo Timotic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…tent (apache#333) [SPARK-45920][SQL][3.5] group by ordinal should be idempotent backport apache#43797 ### What changes were proposed in this pull request? GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal. ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#43836 from cloud-fan/3.5-port. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>
This is a followup of apache#43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. no new test no Closes apache#46113 from cloud-fan/group-all. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b5bb75c) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of apache#43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS ### Why are the changes needed? For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable. ### Does this PR introduce _any_ user-facing change? Yes, we will fix the error. ### How was this patch tested? Test added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50461 from mihailom-db/mihailom-db/fixGroupBy. Authored-by: Mihailo Milosevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 9cc8b18) Signed-off-by: Wenchen Fan <[email protected]>

What changes were proposed in this pull request?
GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.
Why are the changes needed?
For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.
Does this PR introduce any user-facing change?
No
How was this patch tested?
new test
Was this patch authored or co-authored using generative AI tooling?
no