Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks.

This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle).

Based on this, we can write fairly generic tests that run workloads that submit more than spark.ui.retainedStages stages and spark.ui.retainedJobs jobs then check that these various collections' sizes obey their contracts.

This commit fixes a memory leak in JobProgressListener that I introduced
in SPARK-2321 and adds a testing framework to ensure that it’s very
difficult to inadvertently introduce new memory leaks.
@JoshRosen JoshRosen changed the title [SPARK-4495] Fix memory leaks in JobProgressListener [SPARK-4495] Fix memory leak in JobProgressListener Nov 19, 2014
@JoshRosen
Copy link
Contributor Author

/cc @kayousterhout @pwendell. This might be over-engineered, but I think it's a pretty bulletproof way to ensure that we never have a memory leak here.

@SparkQA
Copy link

SparkQA commented Nov 19, 2014

Test build #23639 has started for PR 3372 at commit c73fab5.

  • This patch merges cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this kinda threw me off a bit. The code is correct and the test works as it should, but the logic is a little weird because this might remove more elements than needed to satisfy the limits.

This method is called on every change to the passed jobs list, so at most jobs.size - retainedJobs will be 1. If retainedStages >= 20, you'll remove more than the needed element to satisfy the limit.

This is fine, but it would be nice if this behavior were documented (even if it's just a comment here somewhere), and the test actually triggered it (by using a value for retainedStages that would trigger this condition, instead of 5).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a little puzzling (this was copied over from the old code). It looks like the pattern here is essentially to create some size-limited collections with a FIFO eviction policy plus some callbacks when items are evicted. A more bulletproof approach would be to create our own size-limited collection wrapper / subclass with these eviction callbacks, since this would prevent mistakes where someone adds an item to the collection but forgets to tall trim*IfNecessary. I think we should do this as part of a separate commit, though, since I want to limit the scope of this change and want to get this in now to unblock a different patch.

@pwendell
Copy link
Contributor

Personally, I think this is fine to add. IMO it's slightly over engineered given how much additional safety it provides. For instance, someone could add a new piece of state and still forget to clean it up. For this test to catch it, they'd have to explicitly add their state to these test fixtures. However, maybe having the code in-line there makes it more obvious that they should do it.

So happy to have this.

@vanzin
Copy link
Contributor

vanzin commented Nov 19, 2014

LGTM. The extra code is not nearly as complex as the commit description might suggest, so this is fine to add.

@SparkQA
Copy link

SparkQA commented Nov 20, 2014

Test build #23639 has finished for PR 3372 at commit c73fab5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23639/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

Alright, I'm going to merge this to unblock 1.2 and another patch of mine. Thanks for looking this over!

asfgit pushed a commit that referenced this pull request Nov 20, 2014
This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks.

This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle).

Based on this, we can write fairly generic tests that run workloads that submit more than `spark.ui.retainedStages` stages and `spark.ui.retainedJobs` jobs then check that these various collections' sizes obey their contracts.

Author: Josh Rosen <[email protected]>

Closes #3372 from JoshRosen/SPARK-4495 and squashes the following commits:

c73fab5 [Josh Rosen] "data structures" -> collections
be72e81 [Josh Rosen] [SPARK-4495] Fix memory leaks in JobProgressListener

(cherry picked from commit 04d462f)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in 04d462f Nov 20, 2014
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment a little unnecessary given that's what this section is normally for?

@JoshRosen
Copy link
Contributor Author

By the way, the test infrastructure added this patch was really useful for preventing memory leaks when I added new collections as part of my web UI job page PR. If your'e interested in reviewing JobProgressListener changes, check out #3009. If there are any nits / issues here, I can touch them up as part of that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants