[SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill #24618

amuraru · 2019-05-15T22:07:54Z

What changes were proposed in this pull request?

When running large shuffles (700TB input data, 200k map tasks, 50k reducers on a 300 nodes cluster) the job is regularly OOMing in map and reduce phase.

IIUC ShuffleExternalSorter (map side) and ExternalAppendOnlyMap and ExternalSorter (reduce side) are trying to max out the available execution memory. This in turn doesn't play nice with the Garbage Collector and executors are failing with OutOfMemoryError when the memory allocation from these in-memory structure is maxing out the available heap size (in our case we are running with 9 cores/executor, 32G per executor)

To mitigate this, one can set spark.shuffle.spill.numElementsForceSpillThreshold to force the spill on disk. While this config works, it is not flexible enough as it's expressed in number of elements, and in our case we run multiple shuffles in a single job and element size is different from one stage to another.

This patch extends the spill threshold behaviour and adds two new parameters to control the spill based on memory usage:

spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold
spark.shuffle.spill.reduce.maxRecordsSizeForSpillThreshold

How was this patch tested?

internal e2e testing using large jobs
First, in our case for heavy RDD heavy shuffle jobs without setting the existing records based spill threshold (spark.shuffle.spill.numElementsForceSpillThreshold) the job in unstable and fails consistently with OOME in executors.
Trying to find the right value for numElementsForceSpillThreshold proved to be impossible.
Trying to maximize the job throughput (e.g. memory usage) while ensuring stability rendered to us an unbalanced usage across multiple stages of the job (in memory cached "elements" vary in size on map and reduce side combined with multiple map-reduce shuffles where "elements" are different)
Overall the best we could get in terms of memory usage is depicted in this snapshot:

Working from here, and using spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold (map side, size-based spill) and spark.shuffle.spill.numElementsForceSpillThreshold (reduce side, size-based spill) we could maximize the memory usage (and in turn job runtime) while still keeping the job stable:

Running existing unit-tests

holdenk · 2019-06-12T13:43:11Z

Jenkins, ok to test.
@amuraru consider putting [WIP] back in the title if you still have outstanding TODOs.

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

dacort · 2019-06-12T20:32:54Z

Unit tests that need fixing/extending in spark-sql module:

UnsafeKVExternalSorterSuite
ExternalAppendOnlyUnsafeRowArraySuite
ExternalAppendOnlyUnsafeRowArrayBenchmark

github-actions · 2019-12-29T00:06:06Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

amuraru · 2020-01-14T10:53:43Z

Removed the WIP tag - the PR is still valid IMO,
Can a committer have a look please?

amuraru · 2020-01-14T10:54:46Z

/cc @dongjoon-hyun as a committer

dongjoon-hyun · 2020-01-14T17:55:03Z

Thank you for pinging me, @amuraru .
What about TODO: extend existing unit-tests?
Do you have a plan to finish that, too?

dongjoon-hyun · 2020-01-14T17:55:43Z

BTW, this PR is abandoned too long. Please resolve the conflict and see the Jenkins result.

amuraru · 2020-02-17T17:47:19Z

@dongjoon-hyun sorry for dropping the ball here.
We have been running this patch in prod with very good results for 1yr+ now - it would be helpful to have it integrated here mainstream..

I rebased on top of master and fixed all conflicts.
Also on

What about TODO: extend existing unit-tests?

I updated all unit-tests but not sure if net new UT are required - the changes are covered well by existing UTs. Let me know what you think.

When running large shuffles (700TB input data, 200k map tasks, 50k reducers on a 300 nodes cluster) the job is regularly OOMing in map and reduce phase. IIUC ShuffleExternalSorter (map side) and ExternalAppendOnlyMap and ExternalSorter (reduce side) are trying to max out the available execution memory. This in turn doesn't play nice with the Garbage Collector and executors are failing with OutOfMemoryError when the memory allocation from these in-memory structure is maxing out the available heap size (in our case we are running with 9 cores/executor, 32G per executor) To mitigate this, one can set spark.shuffle.spill.numElementsForceSpillThreshold to force the spill on disk. While this config works, it is not flexible enough as it's expressed in number of elements, and in our case we run multiple shuffles in a single job and element size is different from one stage to another. This patch extends the spill threshold behaviour and adds two new parameters to control the spill based on memory usage: - spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold - spark.shuffle.spill.reduce.maxRecordsSizeForSpillThreshold

gatorsmile · 2020-02-27T06:01:38Z

ok to test

gatorsmile · 2020-02-27T06:02:04Z

add to whitelist

SparkQA · 2020-02-27T08:05:01Z

Test build #119010 has finished for PR 24618 at commit ab410fc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-02-27T12:38:41Z

retest this please.

SparkQA · 2020-02-27T15:20:44Z

Test build #119031 has finished for PR 24618 at commit ab410fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2020-03-07T00:41:00Z

At quick glance the java codegen test failure would probably be unrelated, jenkins retest this please.

SparkQA · 2020-03-07T02:41:43Z

Test build #119496 has finished for PR 24618 at commit ab410fc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

amuraru · 2020-03-07T08:18:00Z

Looking into it

HyukjinKwon · 2020-03-10T11:20:35Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        "until we reach some limitations, like the max page size limitation for the pointer " +
+        "array in the sorter.")
+      .bytesConf(ByteUnit.BYTE)
+      .createWithDefault(Long.MaxValue)


@amanomer, you can make this configuration optional via createOptional to represent no limit.

Just a reminder, we need to attach version info for the new configuration now. Just use .version().

dongjoon-hyun · 2020-03-10T23:43:39Z

cc @dbtsai

jiangxb1987 · 2020-03-19T05:18:17Z

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java

    Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
    pageCursor += length;
    inMemSorter.insertRecord(recordAddress, partitionId);
+    inMemRecordsSize += length;


Should we also include the uaoSize?

+1, the pageCursor is also increased by uaoSize and length

jiangxb1987 · 2020-03-19T05:19:48Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

        numElementsForSpillThreshold);
      spill();
+    } else if (inMemRecordsSize >= maxRecordsSizeForSpillThreshold) {
+      logger.info("Spilling data because size of spilledRecords crossed the threshold " +


Should we also include the number of records and threshold here?

+1, let's follow https://github.com/apache/spark/pull/24618/files#diff-3eedc75de4787b842477138d8cc7f150R413

jiangxb1987 · 2020-03-19T05:20:53Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

      .createWithDefault(Integer.MAX_VALUE)

+  private[spark] val SHUFFLE_SPILL_MAP_MAX_SIZE_FORCE_SPILL_THRESHOLD =
+    ConfigBuilder("spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold")


What does the "map" mean inside this config name?

Why is it necessary to have different threshold between map task and reduce task?

I have the same question. What's the use case to separate them differently?

jiangxb1987 · 2020-03-19T05:29:37Z

This PR looks pretty good, it would be great if we can add some new benchmark case to ensure it doesn't bring in performance regression with config values properly chosen.

manuzhang · 2020-04-02T12:39:25Z

@amuraru
May I ask how would you set those thresholds in regard to spark.executor.memory ? Would, say, 0.8 * spark.executor.memory be a good candidate for those values ?

amuraru · 2020-04-02T16:57:21Z

Ack @manuzhang - that makes sense

manuzhang · 2020-04-03T03:16:23Z

@amuraru I forgot spark.memory.fraction and spark.memory.storageFaction. How will they play together with the new configurations ?

I've been testing Spark Adaptive Query Execution (AQE) recently, where contiguous shuffle partitions are coalesced to avoid too many small tasks. The problem is, IIUC, AQE makes coalesce decisions based on size of serialized map outputs. When data from multiple map tasks get deserialized into memory of one reduce task, it could easily blow up. I have to set extremely large spark.executor.memory to avoid being killed by YARN while wasting some resources. I think this patch is crucial for AQE to work steadily.

cc @cloud-fan @maryannxue @JkSelf

cloud-fan · 2020-04-03T07:48:52Z

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java

    Platform.copyMemory(recordBase, recordOffset, base, pageCursor, length);
    pageCursor += length;
    inMemSorter.insertRecord(recordAddress, prefix, prefixIsNull);
+    inMemRecordsSize += length;


ditto, uaoSize

cloud-fan · 2020-04-03T07:57:03Z

I'm not a big fan of having a static size limitation, can we follow the design of Spark memory management and make it more dynamic? e.g. these "memory consumers" should report its memory usage to spark memory manager and spill if the manager asks you to do so.

Ngone51 · 2020-04-07T09:18:23Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        "until we reach some limitations, like the max page size limitation for the pointer " +
+        "array in the sorter.")
+      .bytesConf(ByteUnit.BYTE)
+      .createWithDefault(Long.MaxValue)


Just a reminder, we need to attach version info for the new configuration now. Just use .version().

Ngone51 · 2020-04-07T09:18:32Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        "until we reach some limitations, like the max page size limitation for the pointer " +
+        "array in the sorter.")
+      .bytesConf(ByteUnit.BYTE)
+      .createWithDefault(Long.MaxValue)


Ngone51 · 2020-04-07T09:43:06Z

core/src/main/scala/org/apache/spark/util/collection/Spillable.scala

-    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
+    // Check number of elements or memory usage limits, whichever is hit first
+    if (_elementsRead > numElementsForceSpillThreshold
+      || currentMemory > maxSizeForceSpillThreshold) {


I just wonder if we need maxSizeForceSpillThreshold here since we've already have memory towards control here?

github-actions · 2020-07-17T00:31:08Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

cxzl25 · 2024-08-23T04:34:59Z

Backported this PR in Spark3 version, fixed some compilation issues and SMJ codegen support. Verified in the production environment, the task time is shortened, the number of spill disks is reduced, there is a better chance to compress the shuffle data, and the size of the spill to disk is also significantly reduced.

Current

24/08/19 07:02:54,947 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11490  times so far)
24/08/19 07:02:55,029 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11491  times so far)
24/08/19 07:02:55,093 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11492  times so far)
24/08/19 07:08:59,894 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO Executor: Finished task 0.0 in stage 53.0 (TID 1393). 7409 bytes result sent to driver

PR

mridulm · 2024-08-23T07:40:03Z

Can you create a new PR against master @cxzl25 ? We can evaluate it for inclusion in 4.0

cxzl25 · 2024-09-11T10:16:18Z

Can you create a new PR against master

#47856

Addressed some previous comments in new PR, and added tests.

inMemRecordsSize accumulates uaoSize and length
Remove separate configurations for map and reduce
Config added version
Support SMJ Codegen

Please help review, thanks in advance!

Original author: amuraru ### What changes were proposed in this pull request? This PR aims to support add memory based thresholds for shuffle spill. Introduce configuration - spark.shuffle.spill.maxRecordsSizeForSpillThreshold - spark.sql.windowExec.buffer.spill.size.threshold - spark.sql.sessionWindow.buffer.spill.size.threshold - spark.sql.sortMergeJoinExec.buffer.spill.size.threshold - spark.sql.cartesianProductExec.buffer.spill.size.threshold ### Why are the changes needed? #24618 We can only determine the number of spills by configuring `spark.shuffle.spill.numElementsForceSpillThreshold`. In some scenarios, the size of a row may be very large in the memory. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Verified in the production environment, the task time is shortened, the number of spill disks is reduced, there is a better chance to compress the shuffle data, and the size of the spill to disk is also significantly reduced. **Current** <img width="1281" alt="image" src="https://github.com/user-attachments/assets/b6e172b8-0da8-4b60-b456-024880d0987e"> ``` 24/08/19 07:02:54,947 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11490 times so far) 24/08/19 07:02:55,029 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11491 times so far) 24/08/19 07:02:55,093 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO ShuffleExternalSorter: Thread 126 spilling sort data of 62.0 MiB to disk (11492 times so far) 24/08/19 07:08:59,894 [Executor task launch worker for task 0.0 in stage 53.0 (TID 1393)] INFO Executor: Finished task 0.0 in stage 53.0 (TID 1393). 7409 bytes result sent to driver ``` **PR** <img width="1294" alt="image" src="https://github.com/user-attachments/assets/aedb83a4-c8a1-4ac9-a805-55ba44ebfc9e"> ### Was this patch authored or co-authored using generative AI tooling? No Closes #47856 from cxzl25/SPARK-27734. Lead-authored-by: sychen <[email protected]> Co-authored-by: Adi Muraru <[email protected]> Signed-off-by: attilapiros <[email protected]>

amuraru changed the title ~~[SPARK-27734][CORE][SQL][WIP] Add memory based thresholds for shuffle spill~~ [SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill Jun 6, 2019

dacort reviewed Jun 12, 2019

View reviewed changes

core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java Outdated Show resolved Hide resolved

amuraru changed the title ~~[SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill~~ [SPARK-27734][CORE][SQL][WIP] Add memory based thresholds for shuffle spill Jun 13, 2019

dongjoon-hyun added SPARK CORE SQL labels Jun 14, 2019

amuraru force-pushed the size_based_spill branch 6 times, most recently from 3ae6fa0 to 4b52db8 Compare September 15, 2019 19:43

github-actions bot added the Stale label Dec 29, 2019

github-actions bot closed this Dec 30, 2019

amuraru changed the title ~~[SPARK-27734][CORE][SQL][WIP] Add memory based thresholds for shuffle spill~~ [SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill Jan 14, 2020

dongjoon-hyun reopened this Jan 14, 2020

dongjoon-hyun removed the Stale label Jan 14, 2020

amuraru force-pushed the size_based_spill branch from 4b52db8 to 6c667cb Compare February 17, 2020 17:44

amuraru force-pushed the size_based_spill branch from 6c667cb to ab410fc Compare February 17, 2020 21:12

HyukjinKwon reviewed Mar 10, 2020

View reviewed changes

jiangxb1987 reviewed Mar 19, 2020

View reviewed changes

cloud-fan reviewed Apr 3, 2020

View reviewed changes

Ngone51 reviewed Apr 7, 2020

View reviewed changes

github-actions bot added the Stale label Jul 17, 2020

github-actions bot closed this Jul 18, 2020

cxzl25 mentioned this pull request Aug 23, 2024

[SPARK-49386][CORE][SQL] Add memory based thresholds for shuffle spill #47856

Closed

[SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill #24618

[SPARK-27734][CORE][SQL] Add memory based thresholds for shuffle spill #24618

Uh oh!

Conversation

amuraru commented May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdenk commented Jun 12, 2019

Uh oh!

Uh oh!

dacort commented Jun 12, 2019

Uh oh!

github-actions bot commented Dec 29, 2019

Uh oh!

amuraru commented Jan 14, 2020

Uh oh!

amuraru commented Jan 14, 2020

Uh oh!

dongjoon-hyun commented Jan 14, 2020

Uh oh!

dongjoon-hyun commented Jan 14, 2020

Uh oh!

amuraru commented Feb 17, 2020

Uh oh!

gatorsmile commented Feb 27, 2020

Uh oh!

gatorsmile commented Feb 27, 2020

Uh oh!

SparkQA commented Feb 27, 2020

Uh oh!

Ngone51 commented Feb 27, 2020

Uh oh!

SparkQA commented Feb 27, 2020

Uh oh!

holdenk commented Mar 7, 2020

Uh oh!

SparkQA commented Mar 7, 2020

Uh oh!

amuraru commented Mar 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manuzhang commented Apr 2, 2020

Uh oh!

amuraru commented Apr 2, 2020

Uh oh!

manuzhang commented Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

amuraru commented May 15, 2019 •

edited

Loading

jiangxb1987 commented Mar 19, 2020 •

edited

Loading

manuzhang commented Apr 3, 2020 •

edited

Loading

cloud-fan commented Apr 3, 2020 •

edited

Loading