Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/job-scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,9 @@ pool), but inside each pool, jobs run in FIFO order. For example, if you create
means that each user will get an equal share of the cluster, and that each user's queries will run in
order instead of later queries taking resources from that user's earlier ones.

If jobs are not explicitely set to use a given pool, they end up in the default pool. This means that even if

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Alexis-D , there are a few minor typos here;
'explicitely' -> 'explicitly'.
'ran' -> 'run'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right my bad -- I updated the PR

`spark.scheduler.mode` is set to `FAIR` those jobs will be ran in `FIFO` order (within the default pool).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not actually correct. There is no reason why you can't define a default pool that uses FAIR scheduling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean that the second sentence is incorrect? I drew that conclusion based from empirical observations +

private def buildDefaultPool() {
if (rootPool.getSchedulableByName(DEFAULT_POOL_NAME) == null) {
val pool = new Pool(DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE,
DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT)
rootPool.addSchedulable(pool)
logInfo("Created default pool: %s, schedulingMode: %s, minShare: %d, weight: %d".format(
DEFAULT_POOL_NAME, DEFAULT_SCHEDULING_MODE, DEFAULT_MINIMUM_SHARE, DEFAULT_WEIGHT))
}
}

However, I might very well be missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be missing a few somethings: 1) You can define your own default pool that does FAIR scheduling within that pool, so blanket statements about "the" default pool are dangerous; 2) spark.scheduler.mode controls the setup of the rootPool, not the scheduling within any pool; 3) If you don't define your own pool with a name corresponding to the DEFAULT_POOL_NAME (i.e. "default"), then you are going to get a default construction of "default", which does use FIFO scheduling within that pool.

So, item 2) effectively means that spark.scheduler.mode controls whether fair scheduling is possible at all, and it also defines the kind of scheduling that is used among the shedulable entities contained in the root pool -- i.e. among the scheduling pools nested within rootPool. One of those nested pools will be DEFAULT_POOL_NAME/"default", which will use FIFO scheduling for schedulable entities within that pool if you haven't defined it to use fair scheduling.

If you just want one scheduling pool that does fair scheduling among its schedulable entities, then you need to set spark.scheduler.mode to "FAIR" in your SparkConf and also define in the pool configuration file a "default" pool to use schedulingMode FAIR. You could alternatively define such a fair-scheduling-inside pool named something other than "default" and then make sure that all of your jobs get assigned to that pool.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks @markhamstra I think I grasp what's going on now. Some form of your comment would be a useful addition to the documentation; rationale being that there seems to be a (common?) misunderstanding about how to schedule jobs in a FAIR way, e.g. https://stackoverflow.com/a/37882686/2813687, or myself trying to do this leading to this very PR. After reading your comment, the current documentation makes sense, and obviously this PR is incorrect (at the very least it doesn't underscore all the caveats/config knobs at play here). I'll take another look at improving the doc such that the actual behavior is obvious to Spark users not familiar with Spark scheduling nitty gritty who merely want to run a few jobs concurrently.

## Configuring Pool Properties

Specific pools' properties can also be modified through a configuration file. Each pool supports three
Expand Down