Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 13, 2018

What changes were proposed in this pull request?

Right now ArrayWriter used to output Arrow data for array type, doesn't do clear or reset after each batch. It produces wrong output.

How was this patch tested?

Added test.

@viirya
Copy link
Member Author

viirya commented May 13, 2018

cc @HyukjinKwon @BryanCutler

override def reset(): Unit = {
super.reset()
elementWriter.reset()
valueVector.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks @BryanCutler added reset() interface in 0.9.0 mentioned in:

// TODO: reset() should be in a common interface

at apache/arrow@4dbce60 and https://issues.apache.org/jira/browse/ARROW-1962

but if we think about backporting, probably I guess we can go this way as a bug fix as is? Roughly looks making sense.

Would it be also safe to do:

    valueVector match {
      case fixedWidthVector: BaseFixedWidthVector => fixedWidthVector.reset()
      case variableWidthVector: BaseVariableWidthVector => variableWidthVector.reset()
      case repeatedValueVector: BaseRepeatedValueVector => repeatedValueVector.clear()
      case _ =>
    }

? @icexelloss, @BryanCutler and @viirya?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also noticed that @BryanCutler added reset to ListVector. But we can only use clear for now.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side but let me leave this to @BryanCutler and @icexelloss.

@viirya
Copy link
Member Author

viirya commented May 13, 2018

Thanks @HyukjinKwon

@SparkQA
Copy link

SparkQA commented May 13, 2018

Test build #90544 has finished for PR 21312 at commit 093728e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2018

Test build #90547 has finished for PR 21312 at commit 48ef560.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

Thanks for catching this @viirya! Looks good from a first glance, but my only concern is that clear() will release the vector buffers, where reset() just zeros them out. Let me look into that a little further tomorrow and make sure it won't cause a problem.

@icexelloss
Copy link
Contributor

@viirya Thanks for catching this!

I think we have many tests that excise the array types. I am curious why this is not caught by existing tests, e.g:
https://github.com/apache/spark/blob/master/python/pyspark/sql/tests.py#L4645

@viirya
Copy link
Member Author

viirya commented May 14, 2018

@icexelloss It only happens when there are more than one batch in each partition. Existing tests do not hit this condition. That is why the added test here is doing a repartition:

df = self.data.withColumn("arr", array(col("id"))).repartition(1, "id")

@BryanCutler
Copy link
Member

@viirya I looked into it a bit more and calling clear() won't cause any problems but it does trigger a reallocation of the vector buffers the next time writing. What do you think about changing this to do a manual "reset" so that the buffers can be reused? It just needs to zero out the buffers and set the value count to 0, so something like this:

val buffers = repeatedValueVector.getBuffers(false)
buffers.foreach(buf => buf.setZero(0, buf.capacity()))
repeatedValueVector.setValueCount(0)

Once we upgrade to Arrow 0.10.0, this can be cleaned up because there is a common interface to reset(). I think we should definitely get this backported to the 2.3 branch too.

@icexelloss
Copy link
Contributor

Agree on both points.

@viirya
Copy link
Member Author

viirya commented May 14, 2018

@BryanCutler I have such thought but wondered if it is good to do that. If you @HyukjinKwon @icexelloss are also agreed on manual reset like this, I'm fine with it.

@HyukjinKwon
Copy link
Member

I'm okay with either way.

@viirya
Copy link
Member Author

viirya commented May 15, 2018

Ok. I will use manual reset for now and leave a TODO comment.

@BryanCutler
Copy link
Member

It looks like the ListVector also needs setLastSet to be called with 0, which is only in ListVector. This is fine though, since ListVector is the only vector extending BaseRepeatedValueVector

case listVector: ListVector =>
        val buffers = listVector.getBuffers(false)
        buffers.foreach(buf => buf.setByte(0, buf.capacity()))
        listVector.setValueCount(0)
        listVector.setLastSet(0)

@viirya
Copy link
Member Author

viirya commented May 15, 2018

Not sure why, but previously calling ListVector.clear, I must change the reset order to reset element writer first to pass the test:

 override def reset(): Unit = {
+    elementWriter.reset()
     super.reset()
-    elementWriter.reset()
   }

Now with manual reset, this order doesn't affect test result anymore. I respect original order and restore it back.

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90618 has finished for PR 21312 at commit a2d1444.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90621 has finished for PR 21312 at commit 400f605.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 15, 2018

retest this please.

valueVector match {
case fixedWidthVector: BaseFixedWidthVector => fixedWidthVector.reset()
case variableWidthVector: BaseVariableWidthVector => variableWidthVector.reset()
case listVector: ListVector =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! So this bug was there since the day 1 when we have arrow writer, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so.

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90626 has finished for PR 21312 at commit 400f605.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@asfgit asfgit closed this in d610d2a May 15, 2018
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

@viirya
Copy link
Member Author

viirya commented May 15, 2018

asfgit pushed a commit that referenced this pull request May 15, 2018
## What changes were proposed in this pull request?

Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <[email protected]>

Closes #21312 from viirya/SPARK-24259.

(cherry picked from commit d610d2a)
Signed-off-by: Wenchen Fan <[email protected]>
icexelloss pushed a commit to twosigma/spark that referenced this pull request May 16, 2018
## What changes were proposed in this pull request?

Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#21312 from viirya/SPARK-24259.
@BryanCutler
Copy link
Member

Not sure why, but previously calling ListVector.clear, I must change the reset order to reset element writer first to pass the test

@viirya I looked into this and found it to be a bug in Arrow when clearing, then reusing a vector. I filed https://issues.apache.org/jira/browse/ARROW-2594. Changing the order of the elements only masked the problem, because otherwise it would reuse a buffer with incorrect values.

This won't happen with the change here using the manual reset, so we should be good.

@viirya
Copy link
Member Author

viirya commented May 16, 2018

@BryanCutler Thanks! I'm happy we can identify this possible bug. Looking forward to the fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants