diff --git a/docs/monitoring.md b/docs/monitoring.md index 2717dd091c751..f6d52ef4597e9 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -388,6 +388,158 @@ value triggering garbage collection on jobs, and `spark.ui.retainedStages` that Note that the garbage collection takes place on playback: it is possible to retrieve more entries by increasing these values and restarting the history server. +### Executor Task Metrics + +The REST API exposes the values of the Task Metrics collected by Spark executors with the granularity +of task execution. The metrics can be used for performance troubleshooting and workload characterization. +A list of the available metrics, with a short description: + +
| Spark Executor Task Metric name | +Short description | +
|---|---|
| executorRunTime | +Elapsed time the executor spent running this task. This includes time fetching shuffle data. + The value is expressed in milliseconds. | +
| executorCpuTime | +CPU time the executor spent running this task. This includes time fetching shuffle data. + The value is expressed in nanoseconds. | +
| executorDeserializeTime | +Elapsed time spent to deserialize this task. The value is expressed in milliseconds. | +
| executorDeserializeCpuTime | +CPU time taken on the executor to deserialize this task. The value is expressed + in nanoseconds. | +
| resultSize | +The number of bytes this task transmitted back to the driver as the TaskResult. | +
| jvmGCTime | +Elapsed time the JVM spent in garbage collection while executing this task. + The value is expressed in milliseconds. | +
| resultSerializationTime | +Elapsed time spent serializing the task result. The value is expressed in milliseconds. | +
| memoryBytesSpilled | +The number of in-memory bytes spilled by this task. | +
| diskBytesSpilled | +The number of on-disk bytes spilled by this task. | +
| peakExecutionMemory | +Peak memory used by internal data structures created during shuffles, aggregations and + joins. The value of this accumulator should be approximately the sum of the peak sizes + across all such data structures created in this task. For SQL jobs, this only tracks all + unsafe operators and ExternalSort. | +
| inputMetrics.* | +Metrics related to reading data from [[org.apache.spark.rdd.HadoopRDD]] + or from persisted data. | +
| .bytesRead | +Total number of bytes read. | +
| .recordsRead | +Total number of records read. | +
| outputMetrics.* | +Metrics related to writing data externally (e.g. to a distributed filesystem), + defined only in tasks with output. | +
| .bytesWritten | +Total number of bytes written | +
| .recordsWritten | +Total number of records written | +
| shuffleReadMetrics.* | +Metrics related to shuffle read operations. | +
| .recordsRead | +Number of records read in shuffle operations | +
| .remoteBlocksFetched | +Number of remote blocks fetched in shuffle operations | +
| .localBlocksFetched | +Number of local (as opposed to read from a remote executor) blocks fetched + in shuffle operations | +
| .totalBlocksFetched | +Number of blocks fetched in shuffle operations (both local and remote) | +
| .remoteBytesRead | +Number of remote bytes read in shuffle operations | +
| .localBytesRead | +Number of bytes read in shuffle operations from local disk (as opposed to + read from a remote executor) | +
| .totalBytesRead | +Number of bytes read in shuffle operations (both local and remote) | +
| .remoteBytesReadToDisk | +Number of remote bytes read to disk in shuffle operations. + Large blocks are fetched to disk in shuffle read operations, as opposed to + being read into memory, which is the default behavior. | +
| .fetchWaitTime | +Time the task spent waiting for remote shuffle blocks. + This only includes the time blocking on shuffle input data. + For instance if block B is being fetched while the task is still not finished + processing block A, it is not considered to be blocking on block B. + The value is expressed in milliseconds. | +
| shuffleWriteMetrics.* | +Metrics related to operations writing shuffle data. | +
| .bytesWritten | +Number of bytes written in shuffle operations | +
| .recordsWritten | +Number of records written in shuffle operations | +
| .writeTime | +Time spent blocking on writes to disk or buffer cache. The value is expressed + in nanoseconds. | +