Fix data lost when configured multiple ledger directories#3329
Conversation
|
@hangc0276 I found when syncThread do flush, there exists duplicate SyncThread flush: |
eolivelli
left a comment
There was a problem hiding this comment.
I think that you are on your way.
good work !
I left some minor feedback.
Waiting for tests and for more reviews.
This change should be picked in all active branches.
it is a serious bug
There was a problem hiding this comment.
I wonder if it would be better to have a constant for Long.MAX_VALUE
There was a problem hiding this comment.
Change should also be made at Journal.readLog, which curMark should be set to the min mark from ledgers dir.
There was a problem hiding this comment.
we should make a public constant here, otherwise magic numbers are easily forgotten.
also, I may understand why we are changing from 0 to MAX_VALUE, but...what is the impact ?
There was a problem hiding this comment.
When creating a Journal instance, it will initiate lastLogMark by reading each ledger directory's lastMark file and get the minimum position as the replay start point. So we should init lastLogMark with the MAX_VALUE.
In fact, the original logic of init lastLogMark with 0, and get the maximum position of all the ledger directory's lastMark file to init the lastLogMark. IMO, it will lose data.
There was a problem hiding this comment.
Maybe @merlimat has more context about init lastLogMark to 0
There was a problem hiding this comment.
so null means "all"...
we should document this in the javadocs
There was a problem hiding this comment.
what about adding a comment here ?
null means that the checkpoint is not started by a specific LedgersDirManager
There was a problem hiding this comment.
I have added the javadoc on the checkpointComplete method interface.
There was a problem hiding this comment.
we should add javadocs here and explain why we have this ledgerDirsManager and when it may be null
nicoloboschi
left a comment
There was a problem hiding this comment.
it's better to add proper test case to avoid future regressions
|
I also think the checkpoint is duplicated here |
There was a problem hiding this comment.
Change should also be made at Journal.readLog, which curMark should be set to the min mark from ledgers dir.
+1, changes should also be made at |
@nicoloboschi I have added the test to cover this change, please help take a look, thanks. |
|
@merlimat @eolivelli @nicoloboschi @gaozhangmin @lordcheng10 I have updated the code and added the test to cover this change, Please help take a look, thanks. |
|
rerun failure checks |
2 similar comments
|
rerun failure checks |
|
rerun failure checks |
@hangc0276 This problem is not going to be resolved here, right? |
@gaozhangmin Yes,we can use another PR to solve it. |
|
rerun failure checks |
1 similar comment
|
rerun failure checks |
I submit pr #3353 to solve this issue. @hangc0276 PTAL |
There was a problem hiding this comment.
Since we are now replaying the journal from the smallest LogMark, can we get the journal log mark position of the current entry to the JournalScanner and compare it with the checkpoint position on the ledger disk where the entry to be restored is located, and only restore the entry whose logMark position is greater than the checkpoint? So as to avoid repeatedly writing the data that has been flushed to the disk.
if (!isPaddingRecord) {
scanner.process(journalVersion, offset, recBuff, journalId, recLog.fc.position());
}There was a problem hiding this comment.
Since we are now replaying the journal from the smallest LogMark, can we get the journal log mark position of the current entry to the JournalScanner and compare it with the checkpoint position on the ledger disk where the entry to be restored is located, and only restore the entry whose logMark position is greater than the checkpoint? So as to avoid repeatedly writing the data that has been flushed to the ledger disk.
There was a problem hiding this comment.
It will introduce complex logic for this comparison.
- If the ledger directory expands or shrinks, the map logic of the ledgerId to the ledger directory (
logMark) also changed. It may be located on the wronglogMarkfile, and will lead to skipping the unflushed entries. - There are many kinds of storage implementation, such as dbLedgerStorage, SortedLedgerStorage, and InterleavedLedgerStorage, we should get the ledgerId related storage instance to check the
logMarkposition for each storage implementation. This operation will introduce complex logic. - For the comparison, we can only save the write ledger throughput. We also need to read the data from the journal log file out.
Based on the above reason, I prefer to replay all entries in the journal log file based on the min logMark position.
|
There are two places that can trigger
If removing the The scheduled task WDYT @hangc0276 |
fbf5c77 to
e2f673e
Compare
@aloyszhang Thanks for your suggestion. Yes, making the |
There was a problem hiding this comment.
Here may through exceptions when shutdown the syncThread which will call checkpoint of ledgerStoreage, since we have already shut down the ledgeerStorage before
There was a problem hiding this comment.
Yes, you are right. I updated the code, please help take a look, thanks.
|
@eolivelli @merlimat @dlg99 @zymap I updated the code, and need your eyes for this PR, thanks. |
| Checkpoint cp = checkpointSource.newCheckpoint(); | ||
| checkpoint(cp); | ||
| checkpointSource.checkpointComplete(cp, true); | ||
| if (singleLedgerDirs) { |
There was a problem hiding this comment.
Please add a small comment with a quick description of the motivation for this condition
(cherry picked from commit 8a76703)
(cherry picked from commit 8a76703)
(cherry picked from commit 8a76703)

Motivation
We found one place where the bookie may lose data even though we turn on fsync for the journal.
Condition:
Assume we write 100MB data into one bookie, 70MB data written into ledger1's write cache, and 30 MB data written into ledger2's write cache. Ledger1's write cache is full and triggers flush. In flushing the write cache, it will trigger a checkpoint to mark the journal’s lastMark position (100MB’s offset) and write the lastMark position into both ledger1 and ledger2's lastMark file.
At this time, this bookie shutdown without flush write cache, such as shutdown by
kill -9command, and ledger2's write cache (30MB) doesn’t flush into ledger disk. But ledger2's lastMark position which persisted into lastMark file has been updated to 100MB’s offset.When the bookie starts up, the journal reply position will be
min(ledger1's lastMark, ledger2's lastMark), and it will be 100MB’s offset. The ledger2's 30MB data won’t reply and that data will be lost.Discussion thread:
https://lists.apache.org/thread/zz5vvv2yd80vqy22fv8wg5s2lqtkrzh9
Solutions
The root cause of this bug is that EntryLogger1 triggers a checkpoint when its write cache is full, updating both EntryLogger1 and EntryLogger2's
lastMarkposition. However, EntryLogger2's data may still be in WriteCache, which may lead to data loss when the bookie shutdown willkill -9There are two solutions for this bug.
Update
lastMarkposition individually.lastMarkposition instead of updating EntryLogger2'slastMarkposition at the same time.lastMarkpositions.lastMarkposition among all the writeable EntryLoggers, and delete the journal files which less than the smallestlastMarkposition.lastMarkposition, and reply to the journal files with this position, otherwise, we will lose data.However, there is one case being hard to handle in replying to the journal stage.
When one ledger disk transfers from ReadOnly mode to Writeable mode, the
lastMarkposition is an old value. Using the old position to reply to the journal files will lead to a target journal file not found exception.Only update
lastMarkposition in SyncThreadThere are two places that can trigger a checkpoint.
The second way is the root cause of data loss if the ledger is configured with multiple directories.
We can turn off the second way's update
lastMarkposition operation and only make SyncThread update thelastMarkposition in a checkpoint when the ledger is configured with multiple directories.This is the simplest way to fix this bug, but it has two drawbacks.
lastMarkposition updates depend on SyncThread doing checkpoint intervals. In Pulsar, the default interval is 60s. It means the journal file expires with at least 60slastMarkposition. It means the journal will reply to at least 60s journal data before the start-up is complete. It may lead to the bookie start-up speed slowing down.IMO, compared to data loss, the above two drawbacks can be acceptable.
Changes
I choose the second solution to fix this bug.