Skip to content

Conversation

@ForVic
Copy link
Contributor

@ForVic ForVic commented Aug 6, 2025

What changes were proposed in this pull request?

Add a config spark.driver.metrics.pollingInterval, and schedule driver polling interval / heartbeat at that schedule.

Why are the changes needed?

Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application.

Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies.

Does this PR introduce any user-facing change?

Yes, introduces a spark config

How was this patch tested?

Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed.

Methodology for validating that increased driver heartbeat intervals would improve memory collection:

  1. Using a 6gb driver heap, wrote a job to broadcast a table, gradually increasing the size of the table until OOM.
  2. Increased driver memory to 10gb, large enough for the same broadcast to succeed.
  3. Repeated this job and tracked the peak memory usage that was written to event log.
  4. After repeated experiments, witnessed that the median peak heap typical usage was tracked at <=5GiB.
  5. Added my change, and decreased the heartbeat interval.
  6. Re-ran same jobs with 10gb heap, and saw that the typical peak memory usage tracked was ~8GiB, more accurately reflecting the increased memory needs.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Aug 6, 2025
@ForVic ForVic marked this pull request as ready for review August 6, 2025 21:40
@ForVic
Copy link
Contributor Author

ForVic commented Aug 6, 2025

cc @mridulm @robreeves @shardulm94

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Scratch that - added a comment below

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+CC @dongjoon-hyun for thoughts as well.

.createWithDefaultString("10s")

private[spark] val DRIVER_HEARTBEAT_INTERVAL =
ConfigBuilder("spark.driver.heartbeatInterval")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is mostly just for reporting metrics, I am more inclined to rename this from heartbeat to something which more clearly conveys the intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about spark.driver.metricsReportingInterval

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @mridulm 's comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see we already differentiate, spark.executor.metrics.pollingInterval, I will call this spark.driver.metrics.pollingInterval ?

@mridulm mridulm self-requested a review August 24, 2025 20:36
@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @mridulm . I'm catching up the community reviews this week.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the background of proposal sounds reasonable to me. I'll revisit later when the PR has the final code.

@ForVic ForVic changed the title [SPARK-53157][CORE] Decouple driver and executor heartbeat polling intervals [SPARK-53157][CORE] Decouple driver and executor polling intervals Aug 26, 2025
private[spark] val DRIVER_METRICS_POLLING_INTERVAL =
ConfigBuilder("spark.driver.metrics.pollingInterval")
.version("4.1.0")
.fallbackConf(EXECUTOR_HEARTBEAT_INTERVAL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more fields like spark.executor.metrics.pollingInterval, e.g., .doc and .timeConf? In addition, it would be great if we can place this new configuration to somewhere near the driver configs live instead of here.

private[spark] val EXECUTOR_METRICS_POLLING_INTERVAL =
ConfigBuilder("spark.executor.metrics.pollingInterval")
.doc("How often to collect executor metrics (in milliseconds). " +
"If 0, the polling is done on executor heartbeats. " +
"If positive, the polling is done at this interval.")
.version("3.0.0")
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("0")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added doc string, but we can't add both .timeConf and .fallbackConf and if user has non-default EXECUTOR_HEARTBEAT_INTERVAL, we should fallback to matching behavior, thoughts?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @mridulm , sorry but I'm not sure this PR is going into the right direction. I'll leave this to you for the rest of reviews because it seems that LinkedIn requires this feature.

@mridulm
Copy link
Contributor

mridulm commented Sep 1, 2025

Would love to hear your thoughts on the PR @dongjoon-hyun !
If this is not beneficial for the community, I would rather keep this internal than make a change which will not benefits all users.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the heartbeat reporting frequency of the driver alone will improve the accuracy of driver metrics without introducing much overhead, and benefit diagnostic/automatic tuning tools.

@mridulm
Copy link
Contributor

mridulm commented Sep 10, 2025

Thanks @pan3793.
I am inclined to merge this Dongjoon, unless you think it is not a great idea !
Will circle back to this later this week, based on inputs. Thanks !

@mridulm mridulm closed this in 04a40ab Sep 12, 2025
@mridulm
Copy link
Contributor

mridulm commented Sep 12, 2025

Merged to master.
Thanks for working on this @ForVic !
Thanks for the reviews and feedback @dongjoon-hyun , @pan3793 :-)

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-53157][CORE] Decouple driver and executor polling intervals [SPARK-53157][CORE] Decouple driver and executor polling intervals Oct 15, 2025
dongjoon-hyun added a commit that referenced this pull request Oct 16, 2025
### What changes were proposed in this pull request?

This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation.

### Why are the changes needed?

To help the users use new features easily.

- #47856
- #51130
- #51163
- #51604
- #51630
- #51708
- #51885
- #52091
- #52382

### Does this PR introduce _any_ user-facing change?

No behavior change because this is a documentation update.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #52626 from dongjoon-hyun/SPARK-53926.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?
Add a config `spark.driver.metrics.pollingInterval`, and schedule driver polling interval / heartbeat at that schedule.

### Why are the changes needed?
Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application.

Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies.

### Does this PR introduce _any_ user-facing change?
Yes, introduces a spark config

### How was this patch tested?
Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed.

Methodology for validating that increased driver heartbeat intervals would improve memory collection:
1. Using a 6gb driver heap, wrote a job to broadcast a table, gradually increasing the size of the table until OOM.
2. Increased driver memory to 10gb, large enough for the same broadcast to succeed.
3. Repeated this job and tracked the peak memory usage that was written to event log.
4. After repeated experiments, witnessed that the median peak heap typical usage was tracked at <=5GiB.
5. Added my change, and decreased the heartbeat interval.
6. Re-ran same jobs with 10gb heap, and saw that the typical peak memory usage tracked was ~8GiB, more accurately reflecting the increased memory needs.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51885 from ForVic/vsunderl/driver_polling_interval.

Authored-by: ForVic <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation.

### Why are the changes needed?

To help the users use new features easily.

- apache#47856
- apache#51130
- apache#51163
- apache#51604
- apache#51630
- apache#51708
- apache#51885
- apache#52091
- apache#52382

### Does this PR introduce _any_ user-facing change?

No behavior change because this is a documentation update.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52626 from dongjoon-hyun/SPARK-53926.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants