-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26949][SS] Prevent 'purge' to remove needed batch files in CompactibleFileStreamLog #23850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pactibleFileStreamLog
|
While I think it is safest way to only let CompactibleFileStreamLog to maintain logs, we have alternative options here:
Suppose it has batch 0, 1, 2, 3, 4 which batch 2 is compacted. If 4 is given as parameter, purge will try to remove 0, 1, 2, 3 which removing batch 2 (latest compaction batch) would break the internal state. Instead of this, this method could be overridden to remove only batch 0 and 1 and silently ignore removing 2 and 3.
This would selectively throw exception - when it can break internal state of CompactibleFileStreamLog. Please let me know if alternative would make more sense. Thanks in advance! |
|
Test build #102575 has finished for PR 23850 at commit
|
|
Seems like this PR is facing the same jenkins issues like mine and doesn't start new build. |
|
As long as the functionality is not needed I'm fine with the As a potential user of this or any kind of function I expect to do what I want without any internal thinking. Because of this option 1 and 2 are having side effects from my perspective. If it would be really required I think the compacted file can be deserialized => remove the batches => serialized again with proper content (maybe you meant the same by |
|
Thanks @gaborgsomogyi for providing your opinion. Same here, and that's why I took this approach and left such approaches as alternatives.
What I meant by |
|
Btw, |
Yeah, that's true. Batch ID is not compacted into the file. |
|
Test build #102587 has finished for PR 23850 at commit
|
gaborgsomogyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
cc. @zsxwing @brkyvz @jerryshao since they've authored parts of the file. |
|
Ping. |
|
Also cc-ing @tdas and @jose-torres since CompactibleFileStreamLog is only used for SS. |
|
Kindly reminder. |
|
Ping again. |
|
Ping again, as Spark+AI Summit 2019 in SF is end. |
|
where are we on this? @tdas and @jose-torres |
|
Test build #106067 has finished for PR 23850 at commit
|
|
When this is revisited, please consider other PRs in mine as well: https://github.com/apache/spark/pulls/HeartSaVioR |
|
Retest this please. |
| * of given parameter, and let CompactibleFileStreamLog handles purging by itself. | ||
| */ | ||
| override def purge(thresholdBatchId: Long): Unit = throw new UnsupportedOperationException( | ||
| s"'purge' might break internal state of CompactibleFileStreamLog hence not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- nit.
s"->"? CompactibleFileStreamLog hence not supportedseems to need some revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @HeartSaVioR .
This looks like a mismatch between compactInterval and the parameter of purge.
I have a question. If CompactibleFileStreamLog calls purge only when isCompactionBatch returns true, does purge fail in that case?
|
Test build #106320 has finished for PR 23850 at commit
|
|
@dongjoon-hyun
Let me clear the issue - the condition which breaks internal state is, batches to purge contain the latest compaction batch, as further batches will refer the compaction batch. I've described alternatives as well, so please take a look at previous comment: #23850 (comment) Btw, even we could purge batches earlier than latest compaction batch, CompactibleFileStreamLog also does the clean up in |
|
Test build #106412 has finished for PR 23850 at commit
|
|
Test build #106422 has finished for PR 23850 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes the abstract class CompactibleFileStreamLog more robust by preventing error situations. The derived classes like FileStreamSinkLog should override this method with a safe and correct implementation.
@tdas , @jose-torres , @cloud-fan , @gatorsmile , @HyukjinKwon . This PR looks reasonable to me. I'm supporting @HeartSaVioR 's suggestion. Please give us your comments (for better alternatives or any regressions)
If there is no more comments, I'll proceed to merge this improvement to master for Apache Spark 3.0.0 in a few days.
|
Merged to master. Thank you, @HeartSaVioR and @gaborgsomogyi . |
…pactibleFileStreamLog ## What changes were proposed in this pull request? This patch proposes making `purge` in `CompactibleFileStreamLog` to throw `UnsupportedOperationException` to prevent purging necessary batch files, as well as adding javadoc to document its behavior. Actually it would only break when latest compaction batch is requested to be purged, but caller wouldn't be aware of this so safer to just prevent it. ## How was this patch tested? Added UT. Closes apache#23850 from HeartSaVioR/SPARK-26949. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Thanks all for reviewing and merging! |
What changes were proposed in this pull request?
This patch proposes making
purgeinCompactibleFileStreamLogto throwUnsupportedOperationExceptionto prevent purging necessary batch files, as well as adding javadoc to document its behavior. Actually it would only break when latest compaction batch is requested to be purged, but caller wouldn't be aware of this so safer to just prevent it.How was this patch tested?
Added UT.