Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
c8e8abe
SPARK-23429: Add executor memory metrics to heartbeat and expose in e…
edwinalu Mar 9, 2018
5d6ae1c
modify MimaExcludes.scala to filter changes to SparkListenerExecutorM…
edwinalu Apr 2, 2018
ad10d28
Address code review comments, change event logging to stage end.
edwinalu Apr 22, 2018
10ed328
Add configuration parameter spark.eventLog.logExecutorMetricsUpdates.…
edwinalu May 15, 2018
2d20367
wip on enum based metrics
squito May 23, 2018
f904f1e
wip ... has both enum and non-enum version
squito May 23, 2018
c502ec4
case objects, mostly complete
squito May 23, 2018
7879e66
Merge pull request #1 from squito/metric_enums
edwinalu Jun 3, 2018
2662f6f
Address comments (move heartbeater from DAGScheduler to SparkContext,…
edwinalu Jun 10, 2018
2871335
SPARK-23429: Add executor memory metrics to heartbeat and expose in e…
edwinalu Mar 9, 2018
da83f2e
modify MimaExcludes.scala to filter changes to SparkListenerExecutorM…
edwinalu Apr 2, 2018
f25a44b
Address code review comments, change event logging to stage end.
edwinalu Apr 22, 2018
ca85c82
Add configuration parameter spark.eventLog.logExecutorMetricsUpdates.…
edwinalu May 15, 2018
8b74ba8
wip on enum based metrics
squito May 23, 2018
036148c
wip ... has both enum and non-enum version
squito May 23, 2018
91fb1db
case objects, mostly complete
squito May 23, 2018
2d8894a
Address comments (move heartbeater from DAGScheduler to SparkContext,…
edwinalu Jun 10, 2018
99044e6
Merge branch 'SPARK-23429.2' of https://github.com/edwinalu/spark int…
edwinalu Jun 14, 2018
263c8c8
code review comments
edwinalu Jun 14, 2018
812fdcf
code review comments:
edwinalu Jun 22, 2018
7ed42a5
Address code review comments. Also make executorUpdates in SparkListe…
edwinalu Jun 28, 2018
8d9acdf
Revert and make executorUpdates in SparkListenerExecutorMetricsUpdate…
edwinalu Jun 29, 2018
20799d2
code review comments: hid array implementation of executor metrics, a…
edwinalu Jul 25, 2018
8905d23
merge with master
edwinalu Jul 25, 2018
a0eed11
address code review comments
edwinalu Aug 5, 2018
03cd5bc
code review comments
edwinalu Aug 13, 2018
10e7f15
Merge branch 'master' into SPARK-23429.2
edwinalu Aug 14, 2018
a14b82a
merge conflicts
edwinalu Aug 14, 2018
2897281
disable stage executor metrics logging by default
edwinalu Aug 16, 2018
ee4aa1d
Merge branch 'master' into SPARK-23429.2
edwinalu Sep 6, 2018
571285b
fix indentation
edwinalu Sep 7, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add configuration parameter spark.eventLog.logExecutorMetricsUpdates.…
…enabled to enable/disable executor metrics update logging.

Code review comments.
  • Loading branch information
edwinalu committed Jun 14, 2018
commit ca85c8219f46e3265b8191e82a4017c2cb97fc49
5 changes: 3 additions & 2 deletions core/src/main/scala/org/apache/spark/Heartbeater.scala
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,12 @@ import org.apache.spark.util.{ThreadUtils, Utils}
* intervals of intervalMs.
*
* @param reportHeartbeat the heartbeat reporting function to call.
* @param name the thread name for the heartbeater.
* @param intervalMs the interval between heartbeats.
*/
private[spark] class Heartbeater(reportHeartbeat: () => Unit, intervalMs: Long) {
private[spark] class Heartbeater(reportHeartbeat: () => Unit, name: String, intervalMs: Long) {
// Executor for the heartbeat task
private val heartbeater = ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-heartbeater")
private val heartbeater = ThreadUtils.newDaemonSingleThreadScheduledExecutor(name)

/** Schedules a task to report a heartbeat. */
private[spark] def start(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whole class is private[spark] so you don't need to add this to individual methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ private[spark] class Executor(
private val runningTasks = new ConcurrentHashMap[Long, TaskRunner]

// Executor for the heartbeat task.
private val heartbeater = new Heartbeater(reportHeartBeat,
private val heartbeater = new Heartbeater(reportHeartBeat, "executor-heartbeater",
conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s"))

// must be initialized before running startDriverHeartbeat()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,11 @@ package object config {
.bytesConf(ByteUnit.KiB)
.createWithDefaultString("100k")

private[spark] val EVENT_LOG_EXECUTOR_METRICS_UPDATES =
ConfigBuilder("spark.eventLog.logExecutorMetricsUpdates.enabled")
.booleanConf
.createWithDefault(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be "false" for now until we could test this out more, just to be on the safe side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be safer. I'll change to false, and we can change change to true after people have had a chance to test it out.


private[spark] val EVENT_LOG_OVERWRITE =
ConfigBuilder("spark.eventLog.overwrite").booleanConf.createWithDefault(false)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ class DAGScheduler(
taskScheduler.setDAGScheduler(this)

/** driver heartbeat for collecting metrics */
private val heartbeater: Heartbeater = new Heartbeater(reportHeartBeat,
private val heartbeater: Heartbeater = new Heartbeater(reportHeartBeat, "driver-heartbeater",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not put this in the DAGScheduler please -- this class is fragile enough as it is :)

I think this should just go in SparkContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

sc.conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s"))

/** BufferPoolMXBean for direct memory */
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ import org.apache.spark.util.{JsonProtocol, Utils}
* spark.eventLog.overwrite - Whether to overwrite any existing files.
* spark.eventLog.dir - Path to the directory in which events are logged.
* spark.eventLog.buffer.kb - Buffer size to use when writing to output streams
* spark.eventLog.logExecutorMetricsUpdates.enabled - Whether to log executor metrics updates
*/
private[spark] class EventLoggingListener(
appId: String,
Expand All @@ -70,6 +71,7 @@ private[spark] class EventLoggingListener(
private val shouldCompress = sparkConf.get(EVENT_LOG_COMPRESS)
private val shouldOverwrite = sparkConf.get(EVENT_LOG_OVERWRITE)
private val shouldLogBlockUpdates = sparkConf.get(EVENT_LOG_BLOCK_UPDATES)
private val shouldLogExecutorMetricsUpdates = sparkConf.get(EVENT_LOG_EXECUTOR_METRICS_UPDATES)
private val testing = sparkConf.get(EVENT_LOG_TESTING)
private val outputBufferSize = sparkConf.get(EVENT_LOG_OUTPUT_BUFFER_SIZE).toInt
private val fileSystem = Utils.getHadoopFileSystem(logBaseDir, hadoopConf)
Expand All @@ -82,7 +84,7 @@ private[spark] class EventLoggingListener(
private val compressionCodecName = compressionCodec.map { c =>
CompressionCodec.getShortName(c.getClass.getName)
}

logInfo("spark.eventLog.logExecutorMetricsUpdates.enabled is " + shouldLogExecutorMetricsUpdates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really seem necessary at all, definitely not at INFO level (and indentation is wrong).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks, I hadn't meant to push that.

// Only defined if the file system scheme is not local
private var hadoopDataStream: Option[FSDataOutputStream] = None

Expand Down Expand Up @@ -162,9 +164,11 @@ private[spark] class EventLoggingListener(
// Events that do not trigger a flush
override def onStageSubmitted(event: SparkListenerStageSubmitted): Unit = {
logEvent(event)
// clear the peak metrics when a new stage starts
liveStageExecutorMetrics.put((event.stageInfo.stageId, event.stageInfo.attemptNumber()),
new mutable.HashMap[String, PeakExecutorMetrics]())
if (shouldLogExecutorMetricsUpdates) {
// record the peak metrics for the new stage
liveStageExecutorMetrics.put((event.stageInfo.stageId, event.stageInfo.attemptNumber()),
new mutable.HashMap[String, PeakExecutorMetrics]())
}
}

override def onTaskStart(event: SparkListenerTaskStart): Unit = logEvent(event)
Expand All @@ -179,22 +183,30 @@ private[spark] class EventLoggingListener(

// Events that trigger a flush
override def onStageCompleted(event: SparkListenerStageCompleted): Unit = {
// log the peak executor metrics for the stage, for each executor
val accumUpdates = new ArrayBuffer[(Long, Int, Int, Seq[AccumulableInfo])]()
val executorMap = liveStageExecutorMetrics.remove(
(event.stageInfo.stageId, event.stageInfo.attemptNumber()))
executorMap.foreach {
executorEntry => {
for ((executorId, peakExecutorMetrics) <- executorEntry) {
val executorMetrics = new ExecutorMetrics(-1, peakExecutorMetrics.jvmUsedHeapMemory,
peakExecutorMetrics.jvmUsedNonHeapMemory, peakExecutorMetrics.onHeapExecutionMemory,
peakExecutorMetrics.offHeapExecutionMemory, peakExecutorMetrics.onHeapStorageMemory,
peakExecutorMetrics.offHeapStorageMemory, peakExecutorMetrics.onHeapUnifiedMemory,
peakExecutorMetrics.offHeapUnifiedMemory, peakExecutorMetrics.directMemory,
peakExecutorMetrics.mappedMemory)
val executorUpdate = new SparkListenerExecutorMetricsUpdate(
executorId, accumUpdates, Some(executorMetrics))
logEvent(executorUpdate)
if (shouldLogExecutorMetricsUpdates) {
// clear out any previous attempts, that did not have a stage completed event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one potential issue here -- even though there is a stage completed event, you can still have tasks running from stage attempt (when there is a fetch failure, all existing tasks keep running). Those leftover tasks will effect the memory usage for other tasks which run on those executors.

that said, I dunno if we can do much better here. the alternative would be to track the task start & end events for each stage attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking task start and end would be some amount of overhead. If it's a relatively unlikely corner case, and unlikely to have much impact on the numbers, it may be better to leave as is.

val prevAttemptId = event.stageInfo.attemptNumber() - 1
for (attemptId <- 0 to prevAttemptId) {
liveStageExecutorMetrics.remove((event.stageInfo.stageId, attemptId))
}

// log the peak executor metrics for the stage, for each executor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment here that this will log metrics for all executors that were alive while the stage was running, whether or not they ran any tasks for that stage (I think that's what it will do here, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's all running executors, and does not filter based on if they have tasks for the stage. I've updated the comment.

val accumUpdates = new ArrayBuffer[(Long, Int, Int, Seq[AccumulableInfo])]()
val executorMap = liveStageExecutorMetrics.remove(
(event.stageInfo.stageId, event.stageInfo.attemptNumber()))
executorMap.foreach {
executorEntry => {
for ((executorId, peakExecutorMetrics) <- executorEntry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of minor things -- I wouldn't use executorMap for the Option and then executorEntry for the Map. Also, we tend to prefer using foreach over scala's for loop, the one exception is that it can clean up extra nesting when you've got a bunch of loops (which you actually do here).

So I'd go with

val executorOpt = liveStageExecutorMetrics.remove(
  (event.stageInfo.stageId, event.stageInfo.attemptNumber()))
executorOpt.foreach { execMap =>
  execMap.foreach { case (executorId, peakExecutorMetrics) =>
    ...
  }
}

or if you want to use the for loop, use it around everything:

val execOpt = liveStageExecutorMetrics.remove(
    (event.stageInfo.stageId, event.stageInfo.attemptNumber()))
for {
  execMap <- execOpt
  (executorId, peakExecutorMetrics) <- execMap
} {
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the naming is confusing. Changed to the 1st option.

val executorMetrics = new ExecutorMetrics(-1, peakExecutorMetrics.jvmUsedHeapMemory,
peakExecutorMetrics.jvmUsedNonHeapMemory, peakExecutorMetrics.onHeapExecutionMemory,
peakExecutorMetrics.offHeapExecutionMemory, peakExecutorMetrics.onHeapStorageMemory,
peakExecutorMetrics.offHeapStorageMemory, peakExecutorMetrics.onHeapUnifiedMemory,
peakExecutorMetrics.offHeapUnifiedMemory, peakExecutorMetrics.directMemory,
peakExecutorMetrics.mappedMemory)
val executorUpdate = new SparkListenerExecutorMetricsUpdate(
executorId, accumUpdates, Some(executorMetrics))
logEvent(executorUpdate)
}
}
}
}
Expand Down Expand Up @@ -266,12 +278,14 @@ private[spark] class EventLoggingListener(
}

override def onExecutorMetricsUpdate(event: SparkListenerExecutorMetricsUpdate): Unit = {
// For the active stages, record any new peak values for the memory metrics for the executor
event.executorUpdates.foreach { executorUpdates =>
liveStageExecutorMetrics.values.foreach { peakExecutorMetrics =>
val peakMetrics = peakExecutorMetrics.getOrElseUpdate(
event.execId, new PeakExecutorMetrics())
peakMetrics.compareAndUpdate(executorUpdates)
if (shouldLogExecutorMetricsUpdates) {
// For the active stages, record any new peak values for the memory metrics for the executor
event.executorUpdates.foreach { executorUpdates =>
liveStageExecutorMetrics.values.foreach { peakExecutorMetrics =>
val peakMetrics = peakExecutorMetrics.getOrElseUpdate(
event.execId, new PeakExecutorMetrics())
peakMetrics.compareAndUpdate(executorUpdates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you get the right timestamp here to log, as you do for updating the live entity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the right timestamp? Peaks for different metrics could have different timestamps.

}
}
}
}
Expand Down