Skip to content

Conversation

@Victsm
Copy link
Contributor

@Victsm Victsm commented Sep 22, 2019

What changes were proposed in this pull request?

This patch builds on top of #23762 to resolve the memory leak issue with SMJ. Specifically, when the underlying iterator from UnsafeExternalRowSorter is not fully consumed by SortMergeJoinExec, the requested memory by UnsafeExternalSorter will not be released, which leads to OOM, spill, and task failures. On top of the change introduced in #23762, this patch fixes the issue for when whole-stage codegen is enabled.

How was this patch tested?

Manually tested against the example provided in SPARK-21492 to verify both inner join and non-inner join work when whole-stage codegen is either enabled or disabled.

@dongjoon-hyun
Copy link
Member

Hi, @Victsm .
Could you run dev/scalastyle in your PR directory and fix all errors?

@Victsm
Copy link
Contributor Author

Victsm commented Sep 22, 2019

@dongjoon-hyun
One of the failed Java style issue about the unused import was there before this patch.
Should that also be fixed?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 22, 2019

Of course if that is caused by this PR.

@maropu
Copy link
Member

maropu commented Sep 26, 2019

ok to test

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111398 has finished for PR 25888 at commit e081989.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 26, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 27, 2019

Test build #111457 has finished for PR 25888 at commit e081989.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 27, 2019

@Victsm Can you check the failure? It seems that is related to this pr.

@SparkQA
Copy link

SparkQA commented Sep 27, 2019

Test build #111488 has finished for PR 25888 at commit 9f78b70.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Victsm
Copy link
Contributor Author

Victsm commented Sep 28, 2019

@maropu the test failure should be fixed now

@SparkQA
Copy link

SparkQA commented Sep 28, 2019

Test build #111512 has finished for PR 25888 at commit c20b3b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Victsm
Copy link
Contributor Author

Victsm commented Sep 28, 2019

A bit more context on the 2 corner cases that were fixed in the recent commits:
The first corner case is when two SMJs inner joins are stacked together in a task like the following and they both drain the right table iterator but not the left table iterator.

                   SMJ (inner)
                  /         \
               Sort       SMJ (inner)
                            /      \
                          Sort       Sort

When this happens, the generated code for the top SMJ inner join might invoke hasNext() twice on the iterator of the bottom SMJ inner join when no more records can be retrieved. The first time this happens, the iterators of both the bottom SMJ inner join's left and right child will be freed up. The second time this happens, it could lead to NPE.

The second corner case is when one of a SMJ inner join's child operator is not codegened, e.g. a SMJ inner join on top of a SMJ left semi join:

                   SMJ (inner)
                   /         \
               Sort       SMJ (leftsemi)

In this case, when the top SMJ inner join finishes the join and attempts to release the resources of both iterators of its child operator, in the previous version of this PR it would attempt to cast the iterators of both children as ScalaIteratorWithBufferedIterator. However, since the right child operator is not codegened, the casting would fail.

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112046 has finished for PR 25888 at commit 7ebfb3e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SMJMemoryLeakTestSuite extends SparkPlanTest with SharedSQLContext

@SparkQA
Copy link

SparkQA commented Oct 14, 2019

Test build #112048 has finished for PR 25888 at commit 830adff.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112062 has finished for PR 25888 at commit 265cd93.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SMJMemoryLeakTestSuite extends SparkPlanTest with SharedSparkSession

@xuanyuanking
Copy link
Member

@Victsm After checking your PR in detail, I believe the changes can fix the leak issue. But instead of fixing this for SortMergeJoin particularly, I recommend introducing a new mechanism during the fix, which can involve a resource clean change for all SparkPlan nodes. Please have a look for #26164 when you have time. Thanks :)

*/
private final int numElementsForSpillThreshold;

private boolean resourceCleand = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleand -> cleaned, but is it simpler to call this 'closed'?

@HyukjinKwon
Copy link
Member

Closing in favour of #26164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants