[SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch #18487

dhruve · 2017-06-30T13:07:33Z

What changes were proposed in this pull request?

For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration spark.reducer.maxBlocksInFlightPerAddress , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.

How was this patch tested?

Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.)

SparkQA · 2017-06-30T16:02:22Z

Test build #78984 has finished for PR 18487 at commit d60a0be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dhruve · 2017-07-05T21:02:57Z

@rxin @cloud-fan Can you review this PR?

cloud-fan · 2017-07-06T12:46:15Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

      .createWithDefault(3)

+  private[spark] val REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS =
+    ConfigBuilder("spark.reducer.maxBlocksInFlightPerAddress")


I'm not sure if it's a good idea to do this at reducer side, because there may be a lot of reducers fetching data from one shuffle service at the same time, and you wouldn't know that at reducer side. cc @jinxing64

I agree this won't resolve all problems, but it is still good to add the limit to prevent fetching too much blocks from an address at a time.

When shuffle service gets OOM, there are always lots of (thousands of) reducers(maybe from different apps) fetching blocks. I'm not sure if it will help much to limit in-flight blocks from reducer.

Also we've already have maxReqsInFlight and maxBytesInFlight. Is it little bit redundant to to have maxBlocksInFlightPerAddress?

Sorry for this comment.

@jinxing64 After your fix for lazily loading the open blocks iterator, I am not seeing issues with the NM crashing on my end. However, in cases where a request was being made with a high no. of blocks which were under the max constraints caused increased load. This is an added layer of defense which can mitigate the issue.

jiangxb1987

I like the idea that we should prevent fetching too much blocks from an address at a time, left a few comments.

jiangxb1987 · 2017-07-06T14:16:20Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

      .createWithDefault(3)

+  private[spark] val REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS =
+    ConfigBuilder("spark.reducer.maxBlocksInFlightPerAddress")


I agree this won't resolve all problems, but it is still good to add the limit to prevent fetching too much blocks from an address at a time.

jiangxb1987 · 2017-07-06T14:18:45Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+  private[spark] val REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS =
+    ConfigBuilder("spark.reducer.maxBlocksInFlightPerAddress")
+      .doc("This configuration limits the number of remote blocks being fetched from a given " +
+        " host port at any given point. When external shuffle is enabled and a large number of " +


How does it affect the behavior whether the external shuffle service is enabled or not? AFAIK this should have little relation with external shuffle service.

In this case it would take down either the NM or the executor serving the map output tasks based on the shuffle mode. We emphasize the external shuffle case as crashing the NM is more severe than loosing an executor. I am open to re-wording it so that it is easier to understand.

At lease we should state the configuration doesn't necessarily go with external shuffle service.

jiangxb1987 · 2017-07-06T14:19:47Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        " could crash the Node Manager under increased load. You can mitigate this issue by " +
+        " setting it to a lower value.")
+      .intConf
+      .createWithDefault(Int.MaxValue)


nit: Should add checkValue to ensure this is above zero.

jiangxb1987 · 2017-07-06T14:30:58Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+
+    // Checks if sending a new fetch request will exceed the max no. of blocks being fetched from a
+    // given remote address.
+    def isRemoteAddressMaxedOut(remoteHost: BlockManagerId, request: FetchRequest): Boolean = {


Is this remoteHost or remoteAddress?

this should be remoteAddress

jiangxb1987 · 2017-07-06T14:45:32Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+    if (deferredFetchRequests.nonEmpty) {
+      for ((remoteAddress, defReqQueue) <- deferredFetchRequests) {
+        while (isRemoteBlockFetchable(defReqQueue) &&
+            !isRemoteAddressMaxedOut(remoteAddress, defReqQueue.front)) {


If the request.blocks.size is above the config value, then isRemoteAddressMaxedOut() will always return false, and thus we won't exit the while loop.

We check the no. of blocks being added to a fetch request. If it is larger than the configured no. we create a new request.

https://github.com/apache/spark/pull/18487/files/d60a0bef35e1c38108d3b18a40b7dc01ed8b814f#diff-27109eb30a77542d377c936e0d134420R284

SparkQA · 2017-07-06T19:06:52Z

Test build #79290 has finished for PR 18487 at commit ad11f64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-06T19:19:24Z

Test build #79292 has finished for PR 18487 at commit 6c12854.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dhruve · 2017-07-07T14:00:32Z

@jiangxb1987 I have made the changes requested. Can you have a look. Thanks.

jiangxb1987

Looks good overall, but we may need to add some test cases.

jiangxb1987 · 2017-07-07T14:39:05Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+  private[spark] val REDUCER_MAX_BLOCKS_IN_FLIGHT_PER_ADDRESS =
+    ConfigBuilder("spark.reducer.maxBlocksInFlightPerAddress")
+      .doc("This configuration limits the number of remote blocks being fetched from a given " +
+        " host port at any given point. When external shuffle is enabled and a large number of " +


At lease we should state the configuration doesn't necessarily go with external shuffle service.

jiangxb1987 · 2017-07-07T14:50:36Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+
+    // Process any outstanding deferred fetch requests if possible.
+    if (deferredFetchRequests.nonEmpty) {
+      for ((remoteAddress, defReqQueue) <- deferredFetchRequests) {


If the traffic is heavy, it may takes some time to finish the iterator, do you have any idea on reducing the effort?

I didn't get you. We are just iterating to check if there are requests that can be scheduled. And they are handled asynchronously by the send calls. What effort are you addressing to?

I was thinking maybe we can avoid adding extra deferredFetchRequests to handle deferred fetch requests, instead we can iterator over the fetchRequests to send the requests that are not maxed out, this way we may simplify the logic. Would you like to try?

Oh yes. That was the first choice and I gave it a try to avoid adding any extra bookkeeping. The are issues with that approach. Say you have a request which has to be deferred. You just remove it and push at the end and continue.

This is good as far as you don't meet the deferred request again.

Now if you meet the deferred request again, it may or may not be schedulable based on whether the remote finished processing earlier request. This would lead going up in circles (wasted effort). To avoid this we have to know when to stop. We would have to keep a marker for request which was already deferred so that we know we have to stop. But this marker would be only for a single request which corresponds to one remote. In the meanwhile other remotes could have finished processing their earlier requests and we can schedule requests to them. For this we can no longer stop at the first marker for a single address. We would have to check the requests again.

This makes it more complicated than scheduling all that's possible in a single shot and deferring what it encounters on its way. The next time we try to clear any backlog from previous run and after doing so proceed normally.

ah, make sense.

jiangxb1987 · 2017-07-09T13:45:05Z

This LGTM, @dhruve could you rebase it with the master branch please?

dhruve · 2017-07-10T16:42:41Z

@jiangxb1987 I have resolved the merge conflicts and reworded the config to make it more clear.

SparkQA · 2017-07-10T19:30:09Z

Test build #79469 has finished for PR 18487 at commit d92c05c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-07-10T19:36:45Z

Jenkins, test this please

SparkQA · 2017-07-10T22:25:54Z

Test build #79477 has finished for PR 18487 at commit d92c05c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-07-11T02:19:02Z

cc @cloud-fan

cloud-fan · 2017-07-11T02:36:19Z

Will this be convered by #18388 ? And another concern is how shall we expect users to tune this config? Can users just tune spark.reducer.maxReqsInFlight instead?

dhruve · 2017-07-11T14:02:14Z

@jiangxb1987 Thanks for the review.

@cloud-fan #18388, will reject any open block connections when the NM is under memory pressure. The changes proposed try to limit the concurrency, however it requires the user to know how it affects fetch failures as well. If a NM is under severe load, chances are that your fetch attempts would be closed multiple times and you would need to bump up the # of fetch retries or else your job could fail because of fetch failure.

'spark.reducer.maxReqsInFlight' can be used to control the overall requests being sent out. However all of them can still go out to a single host and max it out. If you reduce them, you loose out on throughput as it would take more time to fetch the results.

tgravescs · 2017-07-14T16:00:37Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        " from a given host port. When a large number of blocks are being requested from a given" +
+        " address in a single fetch or simultaneously, this could crash the serving executor or" +
+        " Node Manager. This is especially useful to reduce the load on the Node Manager when" +
+        "external shuffle is enabled. You can mitigate the issue by setting it to a lower value.")


space before external

tgravescs · 2017-07-14T16:02:45Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

  private[this] val fetchRequests = new Queue[FetchRequest]

+  /**
+   * Queue of fetch requests which could not be issued the first time they were dequed. These


s/dequed/dequeued/

SparkQA · 2017-07-14T20:51:00Z

Test build #79614 has finished for PR 18487 at commit e6e5f6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-07-14T23:18:45Z

maxReqsInFlight and maxBytesInFlight is hard to control the # of blocks in a single request. When # of map is very high, this change can alleviate the pressure of shuffle server.
@dhruve what is the proper value for maxBlocksInFlightPerAddress? I like this pr if there is no performance issue. It will be great if you can post some benchmark.

dhruve · 2017-07-17T15:54:52Z

@jinxing64 I performed few runs to see if we were observing any performance issues with the change. I ran a simple word count job over a random set of text - 3TB. I couldn't get 100's of executors hammering a single NM, however each NM ended up serving approximately a gig of shuffle data (This might be less in magnitude compared to the job that you are running). I ran it for different maxBlocks # - 10, 20 and 50 keeping the executors to be roughly same and didn't notice any performance difference in terms of the running time of the job across different runs.

For a proper value of maxBlocksInFlightPerAddress, my understanding is that we want to set it to a value as high as possible which doesn't max out the NM and this depends a lot on your workloads and how your data ends up being distributed. Setting it to a lower value mitigates the issue.

Since the changes are on the reducer side, I would recommend to run your job against the changes and set it to a low value and have couple of runs to see if you are seeing any performance hit with large no. of connections hitting a node, as I couldn't reproduce it with the 3TB word count.

tgravescs · 2017-07-17T19:37:01Z

@jinxing64 We have the default set the int max so that by default there is no performance penalty for users. We have done some testing as Dhruve mentioned but we don't regularly hit the issue. My plan was to set it to 20 just like mapreduce does as a starting point. The exact configuration can be configuration dependent though as it depends on how much memory you give to your NM's, how many requests in parallel you think you will have, etc. We do plan on doing some more testing but it probably won't be for a couple days

cloud-fan · 2017-07-18T05:35:14Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

          }
-          if (curRequestSize >= targetRequestSize) {
+          if (curRequestSize >= targetRequestSize ||
+              curBlocks.size >= maxBlocksInFlightPerAddress) {


We may have a lot of adjacent fetch requests in the queue, shall we shuffle the request queue before fetching?

We are already doing it here => https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L330

cloud-fan · 2017-07-19T03:28:25Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        " external shuffle is enabled. You can mitigate the issue by setting it to a lower value.")
+      .intConf
+      .checkValue(_ > 0, "The max no. of blocks in flight cannot be non-positive.")
+      .createWithDefault(Int.MaxValue)


cc @tgravescs shall we change the default value to 20 or something?

I'm fine leaving it maxvalue for now to not change current behavior just like we have done with some of these other related configs. I would like to get more runtime on this in production and then we can set it later. Perhaps in 2.3, it would be nice to pull this back into branch 2.2 as well master.

cloud-fan · 2017-07-19T14:19:03Z

retest this please

tgravescs · 2017-07-19T14:40:54Z

+1, pending jenkins build. if no further comments I'm going to commit this later today.

jiangxb1987

LGTM

SparkQA · 2017-07-19T17:24:01Z

Test build #79760 has finished for PR 18487 at commit e6e5f6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2017-07-19T21:01:44Z

cherry pick to 2.2 wasn't clean so can you please put up a separate PR against branch 2.2

rxin · 2017-07-19T22:45:39Z

hm is this a bug fix? if not we shouldn't cherry pick it.

cloud-fan · 2017-07-20T01:44:52Z

@rxin it's kind of a stability fix(make shuffle service more stable), so I'm ok to backport if the conflict is small.

cloud-fan · 2017-07-20T01:45:47Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .doc("This configuration limits the number of remote blocks being fetched per reduce task" +
+        " from a given host port. When a large number of blocks are being requested from a given" +
+        " address in a single fetch or simultaneously, this could crash the serving executor or" +
+        " Node Manager. This is especially useful to reduce the load on the Node Manager when" +


shall we say shuffle service instead of Node Manager?

If the shuffle service fails it can take down the Node Manager which is more severe and hence i have used it. And in the following sentence i have mentioned the external shuffle. If it is not clear, I am okay to change it.

I think Node Manager is for YARN only? Shuffle service is more general

cloud-fan · 2017-07-20T01:51:11Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

      result match {
        case r @ SuccessFetchResult(blockId, address, size, buf, isNetworkReqDone) =>
          if (address != blockManager.blockManagerId) {
+            numBlocksInFlightPerAddress(address) = numBlocksInFlightPerAddress(address) - 1


can we do this earlier? e.g. right after the fetch result is enqueued to results.

That is a good point. Infact, we could also move the other bookkeeping stuff right after the fetch result is enqueued.

I would also want to look at the initialization of the BlockFetchingListener to see the effects of this as it would increase the size of the closure. Can we have a separate JIRA filed for this?

@cloud-fan filed a JIRA for this => https://issues.apache.org/jira/browse/SPARK-21500

cloud-fan · 2017-07-20T01:59:55Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+            + s"${request.blocks.length} blocks")
+          send(remoteAddress, request)
+          if (defReqQueue.isEmpty) {
+            deferredFetchRequests -= remoteAddress


we can leave the empty queue here, as we may still have fetch requests to put in this queue.

We would have to unnecessarily iterate through the map for all the block manager ids for which we deferred fetch requests at an earlier point to check if they have any pending fetch requests when they don't.

cloud-fan · 2017-07-20T02:00:23Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+        logDebug(s"Deferring fetch request for $remoteAddress with ${request.blocks.size} blocks")
+        val defReqQueue = deferredFetchRequests.getOrElse(remoteAddress, new Queue[FetchRequest]())
+        defReqQueue.enqueue(request)
+        deferredFetchRequests(remoteAddress) = defReqQueue


the defReqQueue is mutable, so we don't need to do this.

If it is the first time that we want to defer a request, defReqQueue has to be associated with its corresponding 'remoteAddress

you are right

For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled. Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.) Author: Dhruve Ashar <[email protected]> Closes apache#18487 from dhruve/impr/SPARK-21243.

dhruve · 2017-07-20T17:03:45Z

@cloud-fan replied to your comments.

dhruve · 2017-07-20T19:00:48Z

@tgravescs Thanks for merging this. I have created a PR for 2.2 #18691

I had to remove a couple of newer config entries which landed while resolving a merge conflict.

For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled. Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.) Author: Dhruve Ashar <dhruveashargmail.com> Closes #18487 from dhruve/impr/SPARK-21243. Author: Dhruve Ashar <[email protected]> Closes #18691 from dhruve/branch-2.2.

For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled. Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.) Author: Dhruve Ashar <dhruveashargmail.com> Closes apache#18487 from dhruve/impr/SPARK-21243. Author: Dhruve Ashar <[email protected]> Closes apache#18691 from dhruve/branch-2.2.

[SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch

d60a0be

tgravescs mentioned this pull request Jul 5, 2017

[SPARK-21175] Reject OpenBlocks when memory shortage on shuffle service. #18388

Closed

cloud-fan reviewed Jul 6, 2017

View reviewed changes

jiangxb1987 reviewed Jul 6, 2017

View reviewed changes

dhruve added 2 commits July 6, 2017 11:17

Rename remoteHost to remoteAddress

ad11f64

Add a config check for sanity

6c12854

jiangxb1987 reviewed Jul 7, 2017

View reviewed changes

dhruve added 2 commits July 10, 2017 09:30

Resolve Merge Conflict

1fa6b6c

Reword the config description

d92c05c

tgravescs reviewed Jul 14, 2017

View reviewed changes

Fix spacing and spelling.

e6e5f6b

cloud-fan reviewed Jul 18, 2017

View reviewed changes

cloud-fan reviewed Jul 19, 2017

View reviewed changes

jiangxb1987 approved these changes Jul 19, 2017

View reviewed changes

asfgit closed this in ef61775 Jul 19, 2017

cloud-fan reviewed Jul 20, 2017

View reviewed changes

dhruve mentioned this pull request Jul 20, 2017

[SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch #18691

Closed

[SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch #18487

[SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch #18487

Uh oh!

Conversation

dhruve commented Jun 30, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 30, 2017

Uh oh!

dhruve commented Jul 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Jul 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Jul 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 6, 2017

Uh oh!

SparkQA commented Jul 6, 2017

Uh oh!

dhruve commented Jul 7, 2017

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jul 9, 2017

Uh oh!

dhruve commented Jul 10, 2017

Uh oh!

SparkQA commented Jul 10, 2017

Uh oh!

tgravescs commented Jul 10, 2017

jiangxb1987 Jul 6, 2017 •

edited

Loading

jiangxb1987 left a comment •

edited

Loading

jiangxb1987 Jul 6, 2017 •

edited

Loading

dhruve commented Jul 11, 2017 •

edited

Loading