[SPARK-11419][STREAMING] Parallel recovery for FileBasedWriteAheadLog + minor recovery tweaks#9373
[SPARK-11419][STREAMING] Parallel recovery for FileBasedWriteAheadLog + minor recovery tweaks#9373brkyvz wants to merge 21 commits intoapache:masterfrom
Conversation
|
Test build #44665 has finished for PR 9373 at commit
|
There was a problem hiding this comment.
NPE thrown when streaming context is stopped before recovery is complete
|
Test build #44666 has finished for PR 9373 at commit
|
|
@harishreedharan Here are some benchmark results:
|
|
Did you try HDFS? I am assuming we'd get similar speed ups there too but in What I am wondering is if we'd actually ever have to deal with that many If this adds only a small cost or if it becomes faster, then let's keep On Sunday, November 1, 2015, Burak Yavuz <notifications@github.com
Thanks, |
There was a problem hiding this comment.
Again should not use the default execution context. please make a execution context for this.
There was a problem hiding this comment.
the execution context was defined implicitly in the class definition. Made it non-implicit for better readability
|
@harishreedharan I've been trying to test this patch, but I just couldn't set up HDFS to work with Spark using the spark-ec2 scripts. Could you please help me set up a cluster with HDFS so that I can benchmark this? That looks like a Protobuf version incompatibility. I launched the ec2 instances using: I used to get the following when using |
|
@brkyvz I think there has been issues with Hadoop 2 related stuff in the master branch. Lets talk offline on how to fix it. |
|
@harishreedharan I couldn't test this on HDFS properly. Instead I enabled the parallelization only when |
|
@brkyvz Could you update this PR with master? The batching PR got merged, creating conflicts. |
|
@brkyvz Sounds good, sir. I think the issue you saw seems to be a protobuf incompatibility issue - did you compile and run against the same hadoop-2 version (2.2+ ?) |
|
Test build #45457 has finished for PR 9373 at commit
|
|
Test build #45467 has finished for PR 9373 at commit
|
|
Test build #45456 has finished for PR 9373 at commit
|
|
test this please |
|
Test build #45650 has finished for PR 9373 at commit
|
|
Test build #45648 has finished for PR 9373 at commit
|
There was a problem hiding this comment.
Can you make this 1000 instead of 8 * 8. Just to make sure that we are splitting things right.
|
Test build #45668 has finished for PR 9373 at commit
|
There was a problem hiding this comment.
Could you just change toSeq to toArray? toArray will drain the Iterator at once.
There was a problem hiding this comment.
nit: why rename this to walInfo?
There was a problem hiding this comment.
logInfo is Spark's logging method
|
Test build #45700 has finished for PR 9373 at commit
|
|
Test build #45712 has finished for PR 9373 at commit
|
|
Test build #45747 has finished for PR 9373 at commit
|
|
Test build #45781 has finished for PR 9373 at commit
|
|
LGTM. Merging this to master and 1.6. Thanks @brkyvz, @zsxwing and @harishreedharan |
… + minor recovery tweaks The support for closing WriteAheadLog files after writes was just merged in. Closing every file after a write is a very expensive operation as it creates many small files on S3. It's not necessary to enable it on HDFS anyway. However, when you have many small files on S3, recovery takes very long. In addition, files start stacking up pretty quickly, and deletes may not be able to keep up, therefore deletes can also be parallelized. This PR adds support for the two parallelization steps mentioned above, in addition to a couple more failures I encountered during recovery. Author: Burak Yavuz <brkyvz@gmail.com> Closes #9373 from brkyvz/par-recovery. (cherry picked from commit 7786f9c) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
… + minor recovery tweaks The support for closing WriteAheadLog files after writes was just merged in. Closing every file after a write is a very expensive operation as it creates many small files on S3. It's not necessary to enable it on HDFS anyway. However, when you have many small files on S3, recovery takes very long. In addition, files start stacking up pretty quickly, and deletes may not be able to keep up, therefore deletes can also be parallelized. This PR adds support for the two parallelization steps mentioned above, in addition to a couple more failures I encountered during recovery. Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#9373 from brkyvz/par-recovery.

The support for closing WriteAheadLog files after writes was just merged in. Closing every file after a write is a very expensive operation as it creates many small files on S3. It's not necessary to enable it on HDFS anyway.
However, when you have many small files on S3, recovery takes very long. In addition, files start stacking up pretty quickly, and deletes may not be able to keep up, therefore deletes can also be parallelized.
This PR adds support for the two parallelization steps mentioned above, in addition to a couple more failures I encountered during recovery.