-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24259][SQL] ArrayWriter for Arrow produces wrong output #21312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| override def reset(): Unit = { | ||
| super.reset() | ||
| elementWriter.reset() | ||
| valueVector.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks @BryanCutler added reset() interface in 0.9.0 mentioned in:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
Line 132 in eb386be
| // TODO: reset() should be in a common interface |
at apache/arrow@4dbce60 and https://issues.apache.org/jira/browse/ARROW-1962
but if we think about backporting, probably I guess we can go this way as a bug fix as is? Roughly looks making sense.
Would it be also safe to do:
valueVector match {
case fixedWidthVector: BaseFixedWidthVector => fixedWidthVector.reset()
case variableWidthVector: BaseVariableWidthVector => variableWidthVector.reset()
case repeatedValueVector: BaseRepeatedValueVector => repeatedValueVector.clear()
case _ =>
}
? @icexelloss, @BryanCutler and @viirya?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also noticed that @BryanCutler added reset to ListVector. But we can only use clear for now.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side but let me leave this to @BryanCutler and @icexelloss.
|
Thanks @HyukjinKwon |
|
Test build #90544 has finished for PR 21312 at commit
|
|
Test build #90547 has finished for PR 21312 at commit
|
|
Thanks for catching this @viirya! Looks good from a first glance, but my only concern is that |
|
@viirya Thanks for catching this! I think we have many tests that excise the array types. I am curious why this is not caught by existing tests, e.g: |
|
@icexelloss It only happens when there are more than one batch in each partition. Existing tests do not hit this condition. That is why the added test here is doing a
|
|
@viirya I looked into it a bit more and calling Once we upgrade to Arrow 0.10.0, this can be cleaned up because there is a common interface to |
|
Agree on both points. |
|
@BryanCutler I have such thought but wondered if it is good to do that. If you @HyukjinKwon @icexelloss are also agreed on manual reset like this, I'm fine with it. |
|
I'm okay with either way. |
|
Ok. I will use manual reset for now and leave a TODO comment. |
|
It looks like the |
|
Not sure why, but previously calling override def reset(): Unit = {
+ elementWriter.reset()
super.reset()
- elementWriter.reset()
}Now with manual reset, this order doesn't affect test result anymore. I respect original order and restore it back. |
|
Test build #90618 has finished for PR 21312 at commit
|
|
Test build #90621 has finished for PR 21312 at commit
|
|
retest this please. |
| valueVector match { | ||
| case fixedWidthVector: BaseFixedWidthVector => fixedWidthVector.reset() | ||
| case variableWidthVector: BaseVariableWidthVector => variableWidthVector.reset() | ||
| case listVector: ListVector => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! So this bug was there since the day 1 when we have arrow writer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so.
|
Test build #90626 has finished for PR 21312 at commit
|
icexelloss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
thanks, merging to master/2.3! |
## What changes were proposed in this pull request? Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <[email protected]> Closes #21312 from viirya/SPARK-24259. (cherry picked from commit d610d2a) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <[email protected]> Closes apache#21312 from viirya/SPARK-24259.
@viirya I looked into this and found it to be a bug in Arrow when clearing, then reusing a vector. I filed https://issues.apache.org/jira/browse/ARROW-2594. Changing the order of the elements only masked the problem, because otherwise it would reuse a buffer with incorrect values. This won't happen with the change here using the manual reset, so we should be good. |
|
@BryanCutler Thanks! I'm happy we can identify this possible bug. Looking forward to the fixing. |
What changes were proposed in this pull request?
Right now
ArrayWriterused to output Arrow data for array type, doesn't doclearorresetafter each batch. It produces wrong output.How was this patch tested?
Added test.