-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator
#35368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator
#35368
Conversation
| if (scriptOutputReader != null) { | ||
| if (scriptOutputReader.next(scriptOutputWritable) <= 0) { | ||
| checkFailureAndPropagate(writerThread, null, proc, stderrBuffer) | ||
| completed = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I could just set scriptOutputWritable to null here, and I wouldn't need the if statement at the top. That seemed to work. However, it feels a little unhygienic to read from an inputStream that has already returned EOF, so I added the completed flag instead.
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @bersprockets . Do you know when this bug started?
cc @viirya , too.
Not sure how far back, but at I can reproduce in 2.4.8, 3.1.2, 3.2.0, 3.2.1, and master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that ORDER BY clause issue and irrelevant to LIMIT clause. Could you confirm the following?
scala> spark.version
res19: String = 3.2.1
scala> sql("SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a").show
+----+
| a|
+----+
|null|
| 1|
+----+|
Oh, never mind. I realized that the |
hasNext in HiveScriptTransformationExec's process output iterator
| outputSoi: StructObjectInspector, | ||
| hadoopConf: Configuration): Iterator[InternalRow] = { | ||
| new Iterator[InternalRow] with HiveInspectors { | ||
| var curLine: String = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is irrelevant to this correctness issue, the clean-up looks okay.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
| hadoopConf: Configuration): Iterator[InternalRow] = { | ||
| new Iterator[InternalRow] with HiveInspectors { | ||
| var curLine: String = null | ||
| var completed = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private
|
Thank you for updates, @bersprockets . After compilation, I'll merge this. |
…process output iterator ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes #35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 46885be) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Merged to master/3.2. For 3.1, there is a conflict. |
I will take a look at what needs to be done. |
…process output iterator Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` No, other than removing the correctness issue. New unit test. Closes apache#35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…c`'s process output iterator Backport #35368 to 3.1. ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes #35375 from bersprockets/SPARK-38075_3.1. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…c`'s process output iterator Backport apache#35368 to 3.1. ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes apache#35375 from bersprockets/SPARK-38075_3.1. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…process output iterator ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes apache#35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 46885be) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit babde31)
What changes were proposed in this pull request?
Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.
Why are the changes needed?
When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.
The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with
order byandlimit. For example:This returns:
Does this PR introduce any user-facing change?
No, other than removing the correctness issue.
How was this patch tested?
New unit test.