Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Nov 14, 2023
@cloud-fan
Copy link
Contributor Author

cc @gengliangwang @viirya @beliefer

Comment on lines +2010 to +2011
case Literal(_: Int, IntegerType) =>
Literal(index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only for IntegerType? If ordinalExpr is something like cast(long as int) that is foldable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only fixing repeated analysis (analyzing an analyzed plan). It's a much bigger topic to support analyzing an optimized plan, so constant folding is not considered here.

// analyze the plan. Using a different integer literal may lead to
// a repeat GROUP BY ordinal resolution which is wrong. GROUP BY
// constant is meaningless so whatever value does not matter here.
// TODO: GROUP BY ordinal should pull out grouping expressions to a Project, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a temporary fix only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to backport this surgical fix, and only do the refactoring in the master branch. So yes, this is a temporary fix for the master branch.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay for the targeted issue.

}
}

test("group by ordinal repeated analysis") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a test prefix SPARK-45920:?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

// analyze the plan. Using a different integer literal may lead to
// a repeat GROUP BY ordinal resolution which is wrong. GROUP BY
// constant is meaningless so whatever value does not matter here.
// TODO: GROUP BY ordinal should pull out grouping expressions to a Project, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you file an official JIRA issue and make this as the IDed TODO, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making this fix, @cloud-fan .

Given the following warning, this sounds like this could cause a correctness issue. Did I understand correctly?

a repeat GROUP BY ordinal resolution which is wrong.

@cloud-fan
Copy link
Contributor Author

Given the following warning, this sounds like this could cause a correctness issue. Did I understand correctly?

Yes, but only when you manipulate logical plans directly. SQL/DataFrame API can't trigger it.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cloud-fan .

@dongjoon-hyun
Copy link
Member

BTW, the first commit failed with the following relevant failure. Could you double-check that because the last commit seems irrelevant with that failure, @cloud-fan ?

Screenshot 2023-11-14 at 10 25 16 PM

@dongjoon-hyun
Copy link
Member

Merged to master first.

@dongjoon-hyun
Copy link
Member

In you don't mind, could you make a backporting PR to branch-3.5/3.4/3.3 separately? For this PR, I believe we should be more careful due to the different golden files per branches, @cloud-fan .

cloud-fan added a commit to cloud-fan/spark that referenced this pull request Nov 16, 2023
### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#43797 from cloud-fan/group.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Nov 16, 2023
GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

No

new test

no

Closes apache#43797 from cloud-fan/group.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Nov 16, 2023
GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

No

new test

no

Closes apache#43797 from cloud-fan/group.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Nov 16, 2023
backport #43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43836 from cloud-fan/3.5-port.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Nov 16, 2023
backport #43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43838 from cloud-fan/3.4-port.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Nov 16, 2023
backport #43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43839 from cloud-fan/3.3-port.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
backport apache#43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#43838 from cloud-fan/3.4-port.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 18, 2024
### What changes were proposed in this pull request?

This is a followup of #43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46113 from cloud-fan/group-all.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 18, 2024
### What changes were proposed in this pull request?

This is a followup of #43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46113 from cloud-fan/group-all.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit b5bb75c)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Apr 18, 2024
### What changes were proposed in this pull request?

This is a followup of #43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46113 from cloud-fan/group-all.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit b5bb75c)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Mar 31, 2025
### What changes were proposed in this pull request?
This is a followup of #43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS

### Why are the changes needed?
For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?
Yes, we will fix the error.

### How was this patch tested?
Test added.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50461 from mihailom-db/mihailom-db/fixGroupBy.

Authored-by: Mihailo Milosevic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Mar 31, 2025
### What changes were proposed in this pull request?
This is a followup of #43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS

### Why are the changes needed?
For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?
Yes, we will fix the error.

### How was this patch tested?
Test added.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50461 from mihailom-db/mihailom-db/fixGroupBy.

Authored-by: Mihailo Milosevic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 02db872)
Signed-off-by: Wenchen Fan <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Mar 31, 2025
### What changes were proposed in this pull request?
This is a followup of apache/spark#43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS

### Why are the changes needed?
For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?
Yes, we will fix the error.

### How was this patch tested?
Test added.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50461 from mihailom-db/mihailom-db/fixGroupBy.

Authored-by: Mihailo Milosevic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Apr 18, 2025
…sis to avoid issue with group by ordinal

### What changes were proposed in this pull request?
This is a followup to #43797 and #50461 where hacks were introduced in order to solve the issue of repeated analysis on plans that have a group by ordinal. The latter PR caused a regression so the hack needs to be removed. This PR proposed a move of `UnresolvedOrdinal` construction before Analyzer runs.

### Why are the changes needed?
We are reverting a hack introduced in the previous PRs to improve the behavior of group by ordinal and additionally fix the issue that #50461 was trying to solve.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #50606 from mihailotim-db/mihailotim-db/new_group_by_ordinal.

Authored-by: Mihailo Timotic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…tent (apache#333)

[SPARK-45920][SQL][3.5] group by ordinal should be idempotent

backport apache#43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another integer literal and the plan get analyzed again, we will re-do the ordinal resolution which can lead to wrong result or index out-of-bound error. This PR fixes it by using a hack: if the ordinal points to another integer literal, don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#43836 from cloud-fan/3.5-port.

Authored-by: Wenchen Fan <[email protected]>

Signed-off-by: Dongjoon Hyun <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
This is a followup of apache#43797 . GROUP BY ALL has the same bug and this PR applies the same fix to GROUP BY ALL

For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

no

new test

no

Closes apache#46113 from cloud-fan/group-all.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit b5bb75c)
Signed-off-by: Wenchen Fan <[email protected]>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 14, 2025
### What changes were proposed in this pull request?
This is a followup of apache#43797 . GROUP BY ALIAS has the same bug and this PR applies the same fix to GROUP BY ALIAS

### Why are the changes needed?
For advanced users or Spark plugins, they may manipulate the logical plans directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?
Yes, we will fix the error.

### How was this patch tested?
Test added.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#50461 from mihailom-db/mihailom-db/fixGroupBy.

Authored-by: Mihailo Milosevic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 9cc8b18)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants