-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20426] Lazy initialization of FileSegmentManagedBuffer for shuffle service. #17744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Spark jobs are running on yarn cluster in my warehouse. We enabled the external shuffle service(--conf spark.shuffle.service.enabled=true). Recently NodeManager runs OOM now and then. Dumping heap memory, we find that OneFroOneStreamManager's footprint is huge. NodeManager is configured with 5G heap memory. While OneForOneManager costs 2.5G and there are 5503233 FileSegmentManagedBuffer objects. |
|
Test build #76105 has finished for PR 17744 at commit
|
|
Test build #76131 has finished for PR 17744 at commit
|
|
Test build #76132 has finished for PR 17744 at commit
|
| @Override | ||
| public ManagedBuffer next() { | ||
| final ManagedBuffer block = blockManager.getBlockData(msg.appId, msg.execId, | ||
| msg.blockIds[index]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look more to verify, but I don't think you can hang onto the msg here without duplicating it. TransportRequestHandler.processRpcRequest is going to release the request so I think it could get reused. @rxin can perhaps verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgravescs
Thanks a lot for taking time looking into this :)
In my understanding, the OpenBlocks will be kept in heap after initialization(https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/OpenBlocks.java#L84).
Yes, TransportRequestHandler.processRpcRequest will release the ByteBuf, but the OpenBlocks will not be released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right I missed that when I did a quick skim originally of this.
|
We see the same issue on some of our clusters. I was planning on doing 2 things. Something like this to reduce that memory usage and then on the other side you could change the shuffle fetcher to limit the # of blocks it fetches in one try. The second is a separate jira though and could have the affect of slowing down the shuffle and could also not be as effective if its a ton of different reducers fetching but still a good option to have. I think this approach is good for now. I ran a bunch of manual tests on the cluster and memory has greatly improved. +1. Thanks @jinxing64 |
…ffle service. ## What changes were proposed in this pull request? When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(`FileSegmentManagedBuffer`) in `StreamManager`. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization of `FileSegmentManagedBuffer` in shuffle service. ## How was this patch tested? Manually test. Author: jinxing <[email protected]> Closes #17744 from jinxing64/SPARK-20426. (cherry picked from commit 85c6ce6) Signed-off-by: Tom Graves <[email protected]>
|
@tgravescs |
|
Yes conceptually it could be removed but as you say is a bigger change. Are you still seeing memory issues after this change? |
|
Thanks again for help review this pr. Currently I'm not seeing memory issue on my nodemanagers. I'd report to community if there's new finding :) |
What changes were proposed in this pull request?
When application contains large amount of shuffle blocks. NodeManager requires lots of memory to keep metadata(
FileSegmentManagedBuffer) inStreamManager. When the number of shuffle blocks is big enough. NodeManager can run OOM. This pr proposes to do lazy initialization ofFileSegmentManagedBufferin shuffle service.How was this patch tested?
Manually test.