Skip to content

Conversation

@nraychaudhuri
Copy link
Contributor

Fix for SPARK-7831

@nraychaudhuri
Copy link
Contributor Author

@dragos @tnachen @skyluc Could you please take a look at this one?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of killing without failover, we could also start it without failover.

In the start method, to use:

val driver = createSchedulerDriver(
       master,
       MesosClusterScheduler.this,
       Utils.getCurrentUserName(),
       appName,
       conf,
       Some(frameworkUrl),
       Some(driverFailOver),                                 // <-- with or without checkpoint data
       Some(if (driverFailOver) Double.MaxValue else 0.0),   // <-- timeout for failover recovery
       fwId)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find @skyluc

I will make the change

@tnachen
Copy link
Contributor

tnachen commented Jan 14, 2016

jenkins please test

@tnachen
Copy link
Contributor

tnachen commented Jan 14, 2016

Besides what @skyluc and my comments I think this patch LGTM. Have you tested this btw?

@nraychaudhuri
Copy link
Contributor Author

Yes. I have tested this and it seems to work. I will make the necessary changes

@dragos
Copy link
Contributor

dragos commented Jan 26, 2016

ok to test

@dragos
Copy link
Contributor

dragos commented Jan 26, 2016

I confirm that the framework deregisters from Mesos. However, I don't see the old behavior anymore, where the framework stays even after stopping it. The new flag seems to have no effect.

$ sbin/start-mesos-dispatcher.sh  --master mesos://lausanne1.local:5050
starting org.apache.spark.deploy.mesos.MesosClusterDispatcher, logging to /Users/dragos/workspace/Spark/dev/spark/logs/spark-dragos-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-sagitarius.local.out
$ sbin/stop-mesos-dispatcher.sh 
stopping org.apache.spark.deploy.mesos.MesosClusterDispatcher

The framework is gone.

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50096 has finished for PR 10701 at commit 9002258.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tnachen
Copy link
Contributor

tnachen commented Feb 14, 2016

@dragos you mean the framework no longer shows up in the UI? the console output doesn't seem to suggest it's gone.

@dragos
Copy link
Contributor

dragos commented Feb 14, 2016

On 14 feb. 2016, at 10:01, Timothy Chen [email protected] wrote:

@dragos you mean the framework no longer shows up in the UI? the console output doesn't seem to suggest it's gone.

Yes, that's what I mean.


Reply to this email directly or view it on GitHub.

@tnachen
Copy link
Contributor

tnachen commented Mar 2, 2016

I've tested this myself and is indeed now doing the correct behavior when not adding the flag in. I'll need to dig more, @nraychaudhuri have you tried this as well?

@tnachen
Copy link
Contributor

tnachen commented Mar 4, 2016

I just found out that this is actually a bug in Mesos, where we cannot store a duration that's larger than int64_t. I filed a Mesos jira for this (https://issues.apache.org/jira/browse/MESOS-4862).
As a workaround, please don't use Double.MAX_VALUE but use Integer.MAX_VALUE instead which is what I did before, I forgot about hitting this in the past. We should also leave a comment to make sure we don't change this until it's fixed.

@andrewor14
Copy link
Contributor

OK, let's not add a flag if it's a bug in Mesos. In the mean time before they fix it downstream we can use the workaround @tnachen suggested.

@dragos
Copy link
Contributor

dragos commented Mar 30, 2016

Sounds good. Who can close this PR?

@andrewor14
Copy link
Contributor

@nraychaudhuri can you close this PR?

@tnachen
Copy link
Contributor

tnachen commented Apr 19, 2016

@andrewor14 @nraychaudhuri @dragos Sorry I'm not suggesting we close this PR, we still need the flag since we want to be able to either failover automatically or not. We only need to revert the particular line of change where the PR changed the timeout to DOUBLE.MAX_VALUE

@srowen
Copy link
Member

srowen commented May 6, 2016

@nraychaudhuri can you update or close this PR then?

@tnachen
Copy link
Contributor

tnachen commented May 10, 2016

Seems like @nraychaudhuri is busy, I'll take this PR and update it myself. We definitely need this to be merged as it's quite useful for testing.

vanzin pushed a commit to vanzin/spark that referenced this pull request Aug 4, 2016
Closing the following PRs due to requests or unresponsive users.

Closes apache#13923
Closes apache#14462
Closes apache#13123
Closes apache#14423 (requested by srowen)
Closes apache#14424 (requested by srowen)
Closes apache#14101 (requested by jkbradley)
Closes apache#10676 (requested by srowen)
Closes apache#10943 (requested by yhuai)
Closes apache#9936
Closes apache#10701
@asfgit asfgit closed this in 53e766c Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants