Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Feb 19, 2018

What changes were proposed in this pull request?

This PR addresses two issues in BufferHolderSparkSubmitSuite.

  1. While BufferHolderSparkSubmitSuite tried to allocate a large object several times, it actually allocated an object once and reused the object.
  2. BufferHolderSparkSubmitSuite may fail due to timeout

To assign a small object before allocating a large object each time solved issue 1 by avoiding reuse.
To increasing heap size from 4g to 7g solved issue 2. It can also avoid OOM after fixing issue 1.

How was this patch tested?

Updated existing BufferHolderSparkSubmitSuite

@SparkQA
Copy link

SparkQA commented Feb 19, 2018

Test build #87541 has finished for PR 20636 at commit 39a715c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Feb 19, 2018

cc @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Thank you for pining me and working on this issue, @kiszk .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping, @liufengdb and @gatorsmile .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk
Copy link
Member Author

kiszk commented Mar 9, 2018

ping @liufengdb and @gatorsmile

@kiszk
Copy link
Member Author

kiszk commented Mar 19, 2018

ping @gatorsmile and @liufengdb

@kiszk
Copy link
Member Author

kiszk commented Mar 26, 2018

ping @gatorsmile and @hvanhovell

@kiszk
Copy link
Member Author

kiszk commented Apr 10, 2018

ping @hvanhovell

@kiszk
Copy link
Member Author

kiszk commented Apr 10, 2018

retest this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still support this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Several tests still seem to use local-cluster. Is it better to use local while it may require more memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we support this for testing purpose since, IIRC, that's going to make separate processes for workers.

@SparkQA
Copy link

SparkQA commented Apr 10, 2018

Test build #89146 has finished for PR 20636 at commit 39a715c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 17, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89447 has finished for PR 20636 at commit 39a715c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 18, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89475 has finished for PR 20636 at commit 21b3708.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 18, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89477 has finished for PR 20636 at commit 21b3708.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89485 has finished for PR 20636 at commit 21b3708.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 18, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89494 has finished for PR 20636 at commit 21b3708.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@kiszk Why do we need to allocate a large array several times? I thought the objective of this test is to check if we can safely grow to int max? I don't really see the need for resetting the array before every grow call. Am I missing something here?

@kiszk
Copy link
Member Author

kiszk commented Apr 19, 2018

Ah, I see. I thought the objective is to check if we can safely allocate buffer with each size. According to my understanding, you think to reuse buffer is an intention of this test.

While buffer.reset() has existed before I submitted this PR, we do not need buffer.reset() since we do not update cursor. I also remove setBuffer(holder, smallBuffer) to enable reuse of a buffer.

@kiszk
Copy link
Member Author

kiszk commented Apr 19, 2018

@hvanhovell BTW, could you favor me?
I realized that grow(roundToWord(Integer.MAX_VALUE)) does nothing. We may see integer overflow here.

At holder.grow(roundToWord(Integer.MAX_VALUE)), the given argument Integer.MAX_VALUE is passed as roundToWord(0x7FFF_FFFF) -> ByteArrayMethods.roundNumberOfBytesToNearestWord(0x7FFF_FFFF)

In the method roundNumberOfBytesToNearestWord, the return value will be 0x7FFF_FFFF + (8 - (0x7FFF_FFFF & 7) = 0x8000_0000.

  public static int roundNumberOfBytesToNearestWord(int numBytes) {
    int remainder = numBytes & 0x07;  // This is equivalent to `numBytes % 8`
    if (remainder == 0) {
      return numBytes;
    } else {
      return numBytes + (8 - remainder);
    }
  }

Then, we execute grow(0x8000_0000). Since neededSize is negative, in grow(), we do not execute both of then block.

  void grow(int neededSize) {
    if (neededSize > ARRAY_MAX - totalSize()) {
      throw new UnsupportedOperationException(
        "Cannot grow BufferHolder by size " + neededSize + " because the size after growing " +
          "exceeds size limitation " + ARRAY_MAX);
    }
    final int length = totalSize() + neededSize;
    if (buffer.length < length) {
      // This will not happen frequently, because the buffer is re-used.
      int newLength = length < ARRAY_MAX / 2 ? length * 2 : ARRAY_MAX;
      final byte[] tmp = new byte[newLength];
      Platform.copyMemory(
        buffer,
        Platform.BYTE_ARRAY_OFFSET,
        tmp,
        Platform.BYTE_ARRAY_OFFSET,
        totalSize());
      buffer = tmp;
      row.pointTo(buffer, buffer.length);
    }
  }

As a result, when roundToWord(Integer.MAX_VALUE) is passed to grow(), no allocation happens.

Should we keep this? Or, should we add check if neededSize is zero or positive? What do you think?

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89653 has finished for PR 20636 at commit f946631.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 21, 2018

@hvanhovell When I added the new check code to see whether the growth value is negative, we see the following error. Finally, Integer.MAX_VALUE is changed to negative value.

How do we handle this? Should we pass Integer.MAX_VALUE - n (where n is 64 or something) instead of Integer.MAX_VALUE? WDYT?

 Exception in thread "main" java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size -2147483648 because the size is nevative
        at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:65)
        at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolderSparkSubmitSuite$.main(BufferHolderSparkSubmitSuite.scala:69)
        at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolderSparkSubmitSuite.main(BufferHolderSparkSubmitSuite.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:838)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:166)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:193)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:913)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:924)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@kiszk
Copy link
Member Author

kiszk commented May 2, 2018

ping @hvanhovell

2 similar comments
@kiszk
Copy link
Member Author

kiszk commented May 9, 2018

ping @hvanhovell

@kiszk
Copy link
Member Author

kiszk commented May 15, 2018

ping @hvanhovell

@kiszk
Copy link
Member Author

kiszk commented Jun 18, 2018

cc @cloud-fan

@HyukjinKwon
Copy link
Member

retest this please

1 similar comment
@kiszk
Copy link
Member Author

kiszk commented Jul 16, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93123 has finished for PR 20636 at commit a134091.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 17, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93147 has finished for PR 20636 at commit a134091.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93153 has finished for PR 20636 at commit a134091.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 17, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93158 has finished for PR 20636 at commit a134091.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 17, 2018

cc @cloud-fan

1 similar comment
@kiszk
Copy link
Member Author

kiszk commented Jul 28, 2018

cc @cloud-fan

holder.grow(ARRAY_MAX + 1 - holder.totalSize())
assert(false)
} catch {
case _: UnsupportedOperationException => assert(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the indents here. assert(true) is a no-op, so just omit it. assert(false) is less useful than fail(...message...), above. Let an unexpected Throwable just fly out of the method to fail it rather than swallow it. But do you really just want to use intercept here?

"--master", "local-cluster[2,1,1024]",
"--driver-memory", "4g",
"--master", "local-cluster[1,1,7168]",
"--driver-memory", "7g",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, just wondering if it's going to be problematic that the test now spawns a job that needs more than 7G of RAM? maybe I misunderstand.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94426 has finished for PR 20636 at commit 84940ee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94434 has finished for PR 20636 at commit 2dd1b82.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94436 has finished for PR 20636 at commit 5e59aca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 9, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94463 has finished for PR 20636 at commit 5e59aca.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94465 has finished for PR 20636 at commit 81d6477.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 9, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94472 has finished for PR 20636 at commit 81d6477.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 9, 2018

retest this please

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94477 has finished for PR 20636 at commit 81d6477.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 386fbd3 Aug 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants