Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-19755][Mesos] adding comment regarding failures of mesos task …
…failures and linking to relevant jira
  • Loading branch information
IgorBerman committed Jun 15, 2018
commit 2c47271176b82e4859667ede9bb02b28b8fba50e
Original file line number Diff line number Diff line change
Expand Up @@ -568,6 +568,10 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
cpus + totalCoresAcquired <= maxCores &&
mem <= offerMem &&
numExecutors < executorLimit &&
// nodeBlacklist() currently only gets updated based on failures in spark tasks.
// If a mesos task fails to even start -- that is,
// if a spark executor fails to launch on a node -- nodeBlacklist does not get updated
// see SPARK-24567 for details
!scheduler.nodeBlacklist().contains(offerHostname) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make really sure everybody understands the big change in behavior here -- nodeBlacklist() currently only gets updated based on failures in spark tasks. If a mesos task fails to even start -- that is, if a spark executor fails to launch on a node -- nodeBlacklist does not get updated. So you could have a node that is misconfigured somehow, and you might end up repeatedly trying to launch executors on it after this changed, with the executor even failing to start each time. That is even if you have blacklisting on.

This is SPARK-16630 for the non-mesos case. That is being actively worked on now -- however the work there will probably have to be yarn-specific, so there will still be followup work to get the same thing for mesos after that is in.

Copy link
Contributor

@skonto skonto May 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squito sounds reasonable. In the mean time we have to deal with a limitation at the mesos side where the value is hardcoded. So we can move with this incrementally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe comment on this in the code here and add a JIRA for tracking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checking looks a little late. Can we decline more faster without calculating everything?

meetsPortRequirements &&
satisfiesLocality(offerHostname)
Expand Down