-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4006] In long running contexts, we encountered the situation of double registe... #2886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ster without a remove in between. The cause for that is unknown, and assumed a temp network issue.
However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.
The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.
Also - added some logging for register and unregister.
|
Can one of the admins verify this patch? |
|
ok to test |
|
QA tests have started for PR 2886 at commit
|
|
QA tests have finished for PR 2886 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already exists, so remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually what I meant was to add a comma between "exists" and "so"... It's ok I can fix this myself when I merge it
|
Hey @tsliwowicz there are a few style and wording issues that I'd like to see fixed. Also, I would prefer to have the whitespace changes reverted. However, I think the fix is correct and this LGTM logically. |
|
QA tests have started for PR 2886 at commit
|
|
@andrewor14 - thanks for the comments. I believe I fixed them all. Let me know! |
|
QA tests have started for PR 2886 at commit
|
|
Test FAILed. |
|
the failure seems technical (not related to my fix), I think. Local maven build works fine for me. |
|
QA tests have finished for PR 2886 at commit
|
|
Test PASSed. |
|
QA tests have finished for PR 2886 at commit
|
|
Test PASSed. |
|
Ok I have merged this into master. It doesn't merge cleanly into 1.1, so @tsliwowicz can you create a new PR against that branch? Thanks. |
|
(also branch 1.0) |
|
will do. Can you also merge into the 0.9 branch? I will update the PR I already have for it. #2854 |
|
@andrewor14 I created PR for 1.0, 1.1 and updated the 0.9 PR - can you please review and merge if ok? Non of the merges were clean, so I decided to do it for each branch. |
|
Thanks a lot @tsliwowicz I'll do that shortly. |
…f d... ...ouble registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like #2886 except it's on branch-1.1 Author: Tal Sliwowicz <[email protected]> Closes #2915 from tsliwowicz/branch-1.1-block-mgr-removal and squashes the following commits: d122236 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
…f d... ...ouble registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like #2886 except it's on branch-1.0 Author: Tal Sliwowicz <[email protected]> Closes #2914 from tsliwowicz/branch-1.0-block-mgr-removal and squashes the following commits: 1014493 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.
However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.
The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.
Also - added some logging for register and unregister.
This is just like #2854 except it's on master