-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4495] Fix memory leak in JobProgressListener #3372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks.
|
/cc @kayousterhout @pwendell. This might be over-engineered, but I think it's a pretty bulletproof way to ensure that we never have a memory leak here. |
|
Test build #23639 has started for PR 3372 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this kinda threw me off a bit. The code is correct and the test works as it should, but the logic is a little weird because this might remove more elements than needed to satisfy the limits.
This method is called on every change to the passed jobs list, so at most jobs.size - retainedJobs will be 1. If retainedStages >= 20, you'll remove more than the needed element to satisfy the limit.
This is fine, but it would be nice if this behavior were documented (even if it's just a comment here somewhere), and the test actually triggered it (by using a value for retainedStages that would trigger this condition, instead of 5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a little puzzling (this was copied over from the old code). It looks like the pattern here is essentially to create some size-limited collections with a FIFO eviction policy plus some callbacks when items are evicted. A more bulletproof approach would be to create our own size-limited collection wrapper / subclass with these eviction callbacks, since this would prevent mistakes where someone adds an item to the collection but forgets to tall trim*IfNecessary. I think we should do this as part of a separate commit, though, since I want to limit the scope of this change and want to get this in now to unblock a different patch.
|
Personally, I think this is fine to add. IMO it's slightly over engineered given how much additional safety it provides. For instance, someone could add a new piece of state and still forget to clean it up. For this test to catch it, they'd have to explicitly add their state to these test fixtures. However, maybe having the code in-line there makes it more obvious that they should do it. So happy to have this. |
|
LGTM. The extra code is not nearly as complex as the commit description might suggest, so this is fine to add. |
|
Test build #23639 has finished for PR 3372 at commit
|
|
Test PASSed. |
|
Alright, I'm going to merge this to unblock 1.2 and another patch of mine. Thanks for looking this over! |
This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks. This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle). Based on this, we can write fairly generic tests that run workloads that submit more than `spark.ui.retainedStages` stages and `spark.ui.retainedJobs` jobs then check that these various collections' sizes obey their contracts. Author: Josh Rosen <[email protected]> Closes #3372 from JoshRosen/SPARK-4495 and squashes the following commits: c73fab5 [Josh Rosen] "data structures" -> collections be72e81 [Josh Rosen] [SPARK-4495] Fix memory leaks in JobProgressListener (cherry picked from commit 04d462f) Signed-off-by: Josh Rosen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment a little unnecessary given that's what this section is normally for?
|
By the way, the test infrastructure added this patch was really useful for preventing memory leaks when I added new collections as part of my web UI job page PR. If your'e interested in reviewing JobProgressListener changes, check out #3009. If there are any nits / issues here, I can touch them up as part of that PR. |
This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks.
This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle).
Based on this, we can write fairly generic tests that run workloads that submit more than
spark.ui.retainedStagesstages andspark.ui.retainedJobsjobs then check that these various collections' sizes obey their contracts.