-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21492] [SQL] Fix memory leak issue in SMJ #25888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, @Victsm . |
|
@dongjoon-hyun |
|
Of course if that is caused by this PR. |
|
ok to test |
|
Test build #111398 has finished for PR 25888 at commit
|
|
retest this please |
|
Test build #111457 has finished for PR 25888 at commit
|
|
@Victsm Can you check the failure? It seems that is related to this pr. |
|
Test build #111488 has finished for PR 25888 at commit
|
|
@maropu the test failure should be fixed now |
|
Test build #111512 has finished for PR 25888 at commit
|
|
A bit more context on the 2 corner cases that were fixed in the recent commits:
When this happens, the generated code for the top SMJ inner join might invoke hasNext() twice on the iterator of the bottom SMJ inner join when no more records can be retrieved. The first time this happens, the iterators of both the bottom SMJ inner join's left and right child will be freed up. The second time this happens, it could lead to NPE. The second corner case is when one of a SMJ inner join's child operator is not codegened, e.g. a SMJ inner join on top of a SMJ left semi join:
In this case, when the top SMJ inner join finishes the join and attempts to release the resources of both iterators of its child operator, in the previous version of this PR it would attempt to cast the iterators of both children as ScalaIteratorWithBufferedIterator. However, since the right child operator is not codegened, the casting would fail. |
…after the resources have been cleaned.
|
Test build #112046 has finished for PR 25888 at commit
|
|
Test build #112048 has finished for PR 25888 at commit
|
|
Test build #112062 has finished for PR 25888 at commit
|
|
@Victsm After checking your PR in detail, I believe the changes can fix the leak issue. But instead of fixing this for SortMergeJoin particularly, I recommend introducing a new mechanism during the fix, which can involve a resource clean change for all SparkPlan nodes. Please have a look for #26164 when you have time. Thanks :) |
| */ | ||
| private final int numElementsForSpillThreshold; | ||
|
|
||
| private boolean resourceCleand = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleand -> cleaned, but is it simpler to call this 'closed'?
|
Closing in favour of #26164 |
What changes were proposed in this pull request?
This patch builds on top of #23762 to resolve the memory leak issue with SMJ. Specifically, when the underlying iterator from UnsafeExternalRowSorter is not fully consumed by SortMergeJoinExec, the requested memory by UnsafeExternalSorter will not be released, which leads to OOM, spill, and task failures. On top of the change introduced in #23762, this patch fixes the issue for when whole-stage codegen is enabled.
How was this patch tested?
Manually tested against the example provided in SPARK-21492 to verify both inner join and non-inner join work when whole-stage codegen is either enabled or disabled.