[SPARK-3029] Disable local execution of Spark jobs by default #1321

aarondav · 2014-07-07T21:11:15Z

Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.

Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.

This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.

AmplabJenkins · 2014-07-07T21:16:05Z

Merged build triggered.

AmplabJenkins · 2014-07-07T21:16:12Z

Merged build started.

AmplabJenkins · 2014-07-07T22:35:44Z

Merged build finished.

AmplabJenkins · 2014-07-07T22:35:44Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16384/

rxin · 2014-07-08T07:37:00Z

Maybe we should also solve the problem that local execution should not transfer the whole in-memory block (as a matter of fact, perhaps local execution should just bypass the in-memory data)?

pwendell · 2014-07-15T07:28:57Z

@rxin is there a case where you think local execution will yield a relevant performance improvement? I don't see why shipping a task for a few milliseconds is a bit deal. The main use case I see for this is people running take in a repl... in this case the cluster scheduler is not backlogged because they can't access the repl at all until the prior command has finished anyways.

rxin · 2014-07-15T07:30:25Z

When the cluster is busy and backlogged ...

aarondav · 2014-07-15T08:07:15Z

I think it makes more sense if you can't run a command than certain commands happen to be runnable while there are no cluster resources. This sort of execution puts more stress on the driver, as well, and things like OutOfMemoryErrors on the driver are far more serious than on an Executor (for example, this issue).

My hypothesis is that this feature is rarely useful, and often leads to more confusion for users and potentially less stability.

rxin · 2014-08-14T05:52:51Z

Now I think about it more. LGTM.

mengxr · 2014-08-14T05:52:58Z

:)

mengxr · 2014-08-14T05:53:25Z

Jenkins, retest this please.

Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or at least, reliance upon the feature) is hopefully minimal.

SparkQA · 2014-08-14T06:00:22Z

QA tests have started for PR 1321. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18523/consoleFull

SparkQA · 2014-08-14T06:44:31Z

QA results for PR 1321:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18523/consoleFull

rxin · 2014-08-14T06:49:20Z

You need to update the test suites.

SparkQA · 2014-08-14T07:44:56Z

QA tests have started for PR 1321. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18531/consoleFull

SparkQA · 2014-08-14T08:35:39Z

QA results for PR 1321:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18531/consoleFull

rxin · 2014-08-14T08:36:38Z

Merging this in master and branch-1.1. Thanks!

Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal. Author: Aaron Davidson <[email protected]> Closes #1321 from aarondav/allowlocal and squashes the following commits: 136b253 [Aaron Davidson] Fix DAGSchedulerSuite 5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default (cherry picked from commit d069c5d) Signed-off-by: Reynold Xin <[email protected]>

Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal. Author: Aaron Davidson <[email protected]> Closes apache#1321 from aarondav/allowlocal and squashes the following commits: 136b253 [Aaron Davidson] Fix DAGSchedulerSuite 5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default

…e being rolled (apache#1321) ### What changes were proposed in this pull request? This PR aims to support a new configuration to support the minimum number of tasks per executor before being selected as the executor rolling target. ### Why are the changes needed? Newly created executors might have a long initial setup time during its initial tasks. In this case, some rolling policies like `AVERAGE_DURATION` might kill those newly created executors. This PR aims to protect newly created executors until `totalTasks` reaches the minimum number of tasks. ### Does this PR introduce _any_ user-facing change? No. The default value is 0. ### How was this patch tested? Pass the CIs with the newly added test case.

aarondav changed the title ~~[RFC] Disable local execution of Spark jobs by default~~ [SPARK-3029] Disable local execution of Spark jobs by default Aug 14, 2014

Fix DAGSchedulerSuite

136b253

asfgit closed this in d069c5d Aug 14, 2014

[SPARK-3029] Disable local execution of Spark jobs by default #1321

[SPARK-3029] Disable local execution of Spark jobs by default #1321

Uh oh!

Conversation

aarondav commented Jul 7, 2014

Uh oh!

AmplabJenkins commented Jul 7, 2014

Uh oh!

AmplabJenkins commented Jul 7, 2014

Uh oh!

AmplabJenkins commented Jul 7, 2014

Uh oh!

AmplabJenkins commented Jul 7, 2014

Uh oh!

rxin commented Jul 8, 2014

Uh oh!

pwendell commented Jul 15, 2014

Uh oh!

rxin commented Jul 15, 2014

Uh oh!

aarondav commented Jul 15, 2014

Uh oh!

rxin commented Aug 14, 2014

Uh oh!

mengxr commented Aug 14, 2014

Uh oh!

mengxr commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

rxin commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

SparkQA commented Aug 14, 2014

Uh oh!

rxin commented Aug 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants