-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5147][Streaming] Delete the received data WAL log periodically #4037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #25517 has started for PR 4037 at commit
|
|
Test build #25517 has finished for PR 4037 at commit
|
|
Test FAILed. |
|
Jenkins, retest this please. |
|
Test build #25531 has started for PR 4037 at commit
|
|
Test build #25531 has finished for PR 4037 at commit
|
|
Test PASSed. |
|
I'll just link to my comment on the JIRA for this PR. Edit: Not tested, but the code looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since each such attempt can delete multiple batches (any batches older than the threshold time), its better to name this DeleteOldBatches
|
I took your PR and augmented it to submit a new PR #4149 . It fixes a very subtle bug in this PR and adds unit test. |
This is a refactored fix based on jerryshao 's PR #4037 This enabled deletion of old WAL files containing the received block data. Improvements over #4037 - Respecting the rememberDuration of all receiver streams. In #4037, if there were two receiver streams with multiple remember durations, the deletion would have delete based on the shortest remember duration, thus deleting data prematurely for the receiver stream with longer remember duration. - Added unit test to test creation of receiver WAL, automatic deletion, and respecting of remember duration. jerryshao I am going to merge this ASAP to make it 1.2.1 Thanks for the initial draft of this PR. Made my job much easier. Author: Tathagata Das <[email protected]> Author: jerryshao <[email protected]> Closes #4149 from tdas/SPARK-5147 and squashes the following commits: 730798b [Tathagata Das] Added comments. c4cf067 [Tathagata Das] Minor fixes 2579b27 [Tathagata Das] Refactored the fix to make sure that the cleanup respects the remember duration of all the receiver streams 2736fd1 [jerryshao] Delete the old WAL log periodically (cherry picked from commit 3027f06) Signed-off-by: Tathagata Das <[email protected]>
This is a refactored fix based on jerryshao 's PR #4037 This enabled deletion of old WAL files containing the received block data. Improvements over #4037 - Respecting the rememberDuration of all receiver streams. In #4037, if there were two receiver streams with multiple remember durations, the deletion would have delete based on the shortest remember duration, thus deleting data prematurely for the receiver stream with longer remember duration. - Added unit test to test creation of receiver WAL, automatic deletion, and respecting of remember duration. jerryshao I am going to merge this ASAP to make it 1.2.1 Thanks for the initial draft of this PR. Made my job much easier. Author: Tathagata Das <[email protected]> Author: jerryshao <[email protected]> Closes #4149 from tdas/SPARK-5147 and squashes the following commits: 730798b [Tathagata Das] Added comments. c4cf067 [Tathagata Das] Minor fixes 2579b27 [Tathagata Das] Refactored the fix to make sure that the cleanup respects the remember duration of all the receiver streams 2736fd1 [jerryshao] Delete the old WAL log periodically
|
Mind closing this PR? |
|
OK. |
Currently received data WAL is not deleted after the timeout, this will make the data accumulated in HDFS. Here add an Akka message to notify the
ReceiverSupervisorImplto clean up the file accordingly.