-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53157][CORE] Decouple driver and executor polling intervals #51885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Scratch that - added a comment below
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+CC @dongjoon-hyun for thoughts as well.
| .createWithDefaultString("10s") | ||
|
|
||
| private[spark] val DRIVER_HEARTBEAT_INTERVAL = | ||
| ConfigBuilder("spark.driver.heartbeatInterval") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is mostly just for reporting metrics, I am more inclined to rename this from heartbeat to something which more clearly conveys the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about spark.driver.metricsReportingInterval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @mridulm 's comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see we already differentiate, spark.executor.metrics.pollingInterval, I will call this spark.driver.metrics.pollingInterval ?
|
Thank you for pinging me, @mridulm . I'm catching up the community reviews this week. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the background of proposal sounds reasonable to me. I'll revisit later when the PR has the final code.
| private[spark] val DRIVER_METRICS_POLLING_INTERVAL = | ||
| ConfigBuilder("spark.driver.metrics.pollingInterval") | ||
| .version("4.1.0") | ||
| .fallbackConf(EXECUTOR_HEARTBEAT_INTERVAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more fields like spark.executor.metrics.pollingInterval, e.g., .doc and .timeConf? In addition, it would be great if we can place this new configuration to somewhere near the driver configs live instead of here.
spark/core/src/main/scala/org/apache/spark/internal/config/package.scala
Lines 369 to 376 in 86dad83
| private[spark] val EXECUTOR_METRICS_POLLING_INTERVAL = | |
| ConfigBuilder("spark.executor.metrics.pollingInterval") | |
| .doc("How often to collect executor metrics (in milliseconds). " + | |
| "If 0, the polling is done on executor heartbeats. " + | |
| "If positive, the polling is done at this interval.") | |
| .version("3.0.0") | |
| .timeConf(TimeUnit.MILLISECONDS) | |
| .createWithDefaultString("0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added doc string, but we can't add both .timeConf and .fallbackConf and if user has non-default EXECUTOR_HEARTBEAT_INTERVAL, we should fallback to matching behavior, thoughts?
core/src/main/scala/org/apache/spark/internal/config/package.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To @mridulm , sorry but I'm not sure this PR is going into the right direction. I'll leave this to you for the rest of reviews because it seems that LinkedIn requires this feature.
|
Would love to hear your thoughts on the PR @dongjoon-hyun ! |
pan3793
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing the heartbeat reporting frequency of the driver alone will improve the accuracy of driver metrics without introducing much overhead, and benefit diagnostic/automatic tuning tools.
|
Thanks @pan3793. |
|
Merged to master. |
### What changes were proposed in this pull request? This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation. ### Why are the changes needed? To help the users use new features easily. - #47856 - #51130 - #51163 - #51604 - #51630 - #51708 - #51885 - #52091 - #52382 ### Does this PR introduce _any_ user-facing change? No behavior change because this is a documentation update. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52626 from dongjoon-hyun/SPARK-53926. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Add a config `spark.driver.metrics.pollingInterval`, and schedule driver polling interval / heartbeat at that schedule. ### Why are the changes needed? Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application. Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies. ### Does this PR introduce _any_ user-facing change? Yes, introduces a spark config ### How was this patch tested? Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed. Methodology for validating that increased driver heartbeat intervals would improve memory collection: 1. Using a 6gb driver heap, wrote a job to broadcast a table, gradually increasing the size of the table until OOM. 2. Increased driver memory to 10gb, large enough for the same broadcast to succeed. 3. Repeated this job and tracked the peak memory usage that was written to event log. 4. After repeated experiments, witnessed that the median peak heap typical usage was tracked at <=5GiB. 5. Added my change, and decreased the heartbeat interval. 6. Re-ran same jobs with 10gb heap, and saw that the typical peak memory usage tracked was ~8GiB, more accurately reflecting the increased memory needs. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51885 from ForVic/vsunderl/driver_polling_interval. Authored-by: ForVic <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request? This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation. ### Why are the changes needed? To help the users use new features easily. - apache#47856 - apache#51130 - apache#51163 - apache#51604 - apache#51630 - apache#51708 - apache#51885 - apache#52091 - apache#52382 ### Does this PR introduce _any_ user-facing change? No behavior change because this is a documentation update. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52626 from dongjoon-hyun/SPARK-53926. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Add a config
spark.driver.metrics.pollingInterval, and schedule driver polling interval / heartbeat at that schedule.Why are the changes needed?
Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application.
Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies.
Does this PR introduce any user-facing change?
Yes, introduces a spark config
How was this patch tested?
Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed.
Methodology for validating that increased driver heartbeat intervals would improve memory collection:
Was this patch authored or co-authored using generative AI tooling?
No