Skip to content

Conversation

@tsliwowicz
Copy link
Contributor

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2854 except it's on master

…ster without a remove in between. The cause for that is unknown, and assumed a temp network issue.

    However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

    The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

    Also - added some logging for register and unregister.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have started for PR 2886 at commit f48bce9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have finished for PR 2886 at commit f48bce9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22016/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exists, so remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually what I meant was to add a comma between "exists" and "so"... It's ok I can fix this myself when I merge it

@andrewor14
Copy link
Contributor

Hey @tsliwowicz there are a few style and wording issues that I'd like to see fixed. Also, I would prefer to have the whitespace changes reverted. However, I think the fix is correct and this LGTM logically.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2886 at commit df9d98f.

  • This patch merges cleanly.

@tsliwowicz
Copy link
Contributor Author

@andrewor14 - thanks for the comments. I believe I fixed them all. Let me know!

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2886 at commit 094d508.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22071/
Test FAILed.

@tsliwowicz
Copy link
Contributor Author

the failure seems technical (not related to my fix), I think. Local maven build works fine for me.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2886 at commit df9d98f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22070/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2886 at commit 094d508.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22072/
Test PASSed.

@andrewor14
Copy link
Contributor

Ok I have merged this into master. It doesn't merge cleanly into 1.1, so @tsliwowicz can you create a new PR against that branch? Thanks.

@andrewor14
Copy link
Contributor

(also branch 1.0)

@asfgit asfgit closed this in 6b48522 Oct 23, 2014
@tsliwowicz
Copy link
Contributor Author

will do. Can you also merge into the 0.9 branch? I will update the PR I already have for it. #2854

@tsliwowicz
Copy link
Contributor Author

@andrewor14 I created PR for 1.0, 1.1 and updated the 0.9 PR - can you please review and merge if ok? Non of the merges were clean, so I decided to do it for each branch.

@andrewor14
Copy link
Contributor

Thanks a lot @tsliwowicz I'll do that shortly.

asfgit pushed a commit that referenced this pull request Oct 24, 2014
…f d...

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.1

Author: Tal Sliwowicz <[email protected]>

Closes #2915 from tsliwowicz/branch-1.1-block-mgr-removal and squashes the following commits:

d122236 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
asfgit pushed a commit that referenced this pull request Dec 18, 2014
…f d...

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.0

Author: Tal Sliwowicz <[email protected]>

Closes #2914 from tsliwowicz/branch-1.0-block-mgr-removal and squashes the following commits:

1014493 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants