[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() #35066

JoshRosen · 2021-12-30T03:14:19Z

What changes were proposed in this pull request?

This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator).

The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row.

The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes.

This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping.

Why are the changes needed?

Fix a correctness issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784

So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment).

JoshRosen · 2021-12-30T03:23:35Z

Please let me know if you have suggestions for good ways to write a regression test for this bug. So far I've been unable to adapt my existing reproduction into something which fails in CI.

Given enough time, I might be able to contrive a failing regression test by manually instantiating a SortMergeJoinExec operator and controlling its input iterators such that the non-copied values are mutated when the iterator advances (I'd use the SparkPlanTest helpers for this).

OTOH this particular helper function changes very infrequently, so I think the risk of future regression might be small enough that it might be okay to forgo writing the more complicated test. If anyone has strong opinions here then please let me know.

I'm now curious about whether there could be other similar UDT-related bugs in our code generation. I plan to search through the code for all other places where we generate copy() / clone() logic to check whether they properly handle UDTs.

JoshRosen · 2022-01-04T18:37:28Z

I'm going to merge this to master, branch-3.2, branch-3.1, and branch-3.0.

…State() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes #35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Josh Rosen <[email protected]> (cherry picked from commit eeef48f) Signed-off-by: Josh Rosen <[email protected]>

dongjoon-hyun · 2022-01-04T19:16:21Z

Thank you!

…State() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes apache#35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Josh Rosen <[email protected]> (cherry picked from commit eeef48f) Signed-off-by: Josh Rosen <[email protected]>

…State() ### What changes were proposed in this pull request? This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator). The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row. The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes. This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784 So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment). Closes apache#35066 from JoshRosen/SPARK-37784. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Josh Rosen <[email protected]> (cherry picked from commit eeef48f) Signed-off-by: Josh Rosen <[email protected]> (cherry picked from commit 45b7b7e) Signed-off-by: Dongjoon Hyun <[email protected]>

Correctly handle UDTs in CodeGenerator.addBufferedState()

0de6445

JoshRosen changed the title ~~[SPARK-37784] Correctly handle UDTs in CodeGenerator.addBufferedState()~~ [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() Dec 30, 2021

github-actions bot added the SQL label Dec 30, 2021

HyukjinKwon approved these changes Dec 30, 2021

View reviewed changes

dongjoon-hyun approved these changes Dec 30, 2021

View reviewed changes

viirya approved these changes Jan 2, 2022

View reviewed changes

JoshRosen closed this in eeef48f Jan 4, 2022

JoshRosen deleted the SPARK-37784 branch January 4, 2022 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() #35066

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() #35066

Uh oh!

JoshRosen commented Dec 30, 2021

Uh oh!

JoshRosen commented Dec 30, 2021

Uh oh!

JoshRosen commented Jan 4, 2022

Uh oh!

dongjoon-hyun commented Jan 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() #35066

[SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState() #35066

Uh oh!

Conversation

JoshRosen commented Dec 30, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

JoshRosen commented Dec 30, 2021

Uh oh!

JoshRosen commented Jan 4, 2022

Uh oh!

dongjoon-hyun commented Jan 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants