Skip to content

Conversation

@jinxing64
Copy link

What changes were proposed in this pull request?

When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(FileSegmentManagedBuffer) in StreamManager. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization of FileSegmentManagedBuffer in shuffle service.

How was this patch tested?

Manually test.

@jinxing64
Copy link
Author

Spark jobs are running on yarn cluster in my warehouse. We enabled the external shuffle service(--conf spark.shuffle.service.enabled=true). Recently NodeManager runs OOM now and then. Dumping heap memory, we find that OneFroOneStreamManager's footprint is huge. NodeManager is configured with 5G heap memory. While OneForOneManager costs 2.5G and there are 5503233 FileSegmentManagedBuffer objects.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76105 has finished for PR 17744 at commit 5fb91bf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2017

Test build #76131 has finished for PR 17744 at commit e38f0e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2017

Test build #76132 has finished for PR 17744 at commit a0a6594.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Override
public ManagedBuffer next() {
final ManagedBuffer block = blockManager.getBlockData(msg.appId, msg.execId,
msg.blockIds[index]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look more to verify, but I don't think you can hang onto the msg here without duplicating it. TransportRequestHandler.processRpcRequest is going to release the request so I think it could get reused. @rxin can perhaps verify.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgravescs
Thanks a lot for taking time looking into this :)
In my understanding, the OpenBlocks will be kept in heap after initialization(https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java#L84).
Yes, TransportRequestHandler.processRpcRequest will release the ByteBuf, but the OpenBlocks will not be released.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right I missed that when I did a quick skim originally of this.

@tgravescs
Copy link
Contributor

We see the same issue on some of our clusters. I was planning on doing 2 things. Something like this to reduce that memory usage and then on the other side you could change the shuffle fetcher to limit the # of blocks it fetches in one try. The second is a separate jira though and could have the affect of slowing down the shuffle and could also not be as effective if its a ton of different reducers fetching but still a good option to have.

I think this approach is good for now. I ran a bunch of manual tests on the cluster and memory has greatly improved.

+1. Thanks @jinxing64

@asfgit asfgit closed this in 85c6ce6 Apr 27, 2017
asfgit pushed a commit that referenced this pull request Apr 27, 2017
…ffle service.

## What changes were proposed in this pull request?
When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(`FileSegmentManagedBuffer`) in `StreamManager`. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization of `FileSegmentManagedBuffer` in shuffle service.

## How was this patch tested?

Manually test.

Author: jinxing <[email protected]>

Closes #17744 from jinxing64/SPARK-20426.

(cherry picked from commit 85c6ce6)
Signed-off-by: Tom Graves <[email protected]>
@jinxing64
Copy link
Author

@tgravescs
Thanks a lot for merging.
I proposed to resolve this by "Lazy initialization of FileSegmentManagedBuffer" and simplify the change. But after checking the code, could we remove OpenBlocks. In my understanding, OpenBlocks is just a handshake before fetching remote blocks and could be removed conceptually(did I miss something?). This is just a brainstorming and will be a bigger change.

@tgravescs
Copy link
Contributor

Yes conceptually it could be removed but as you say is a bigger change. Are you still seeing memory issues after this change?

@jinxing64
Copy link
Author

Thanks again for help review this pr. Currently I'm not seeing memory issue on my nodemanagers. I'd report to community if there's new finding :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants