Fix data lost when configured multiple ledger directories by hangc0276 · Pull Request #3329 · apache/bookkeeper

hangc0276 · 2022-06-12T15:16:22Z

Motivation

We found one place where the bookie may lose data even though we turn on fsync for the journal.
Condition:

One journal disk, and turn on fsync for the journal
Configure two ledger disks, ledger1, and ledger2

Assume we write 100MB data into one bookie, 70MB data written into ledger1's write cache, and 30 MB data written into ledger2's write cache. Ledger1's write cache is full and triggers flush. In flushing the write cache, it will trigger a checkpoint to mark the journal’s lastMark position (100MB’s offset) and write the lastMark position into both ledger1 and ledger2's lastMark file.

At this time, this bookie shutdown without flush write cache, such as shutdown by kill -9 command, and ledger2's write cache (30MB) doesn’t flush into ledger disk. But ledger2's lastMark position which persisted into lastMark file has been updated to 100MB’s offset.

When the bookie starts up, the journal reply position will be min(ledger1's lastMark, ledger2's lastMark), and it will be 100MB’s offset. The ledger2's 30MB data won’t reply and that data will be lost.

Discussion thread:
https://lists.apache.org/thread/zz5vvv2yd80vqy22fv8wg5s2lqtkrzh9

Solutions

The root cause of this bug is that EntryLogger1 triggers a checkpoint when its write cache is full, updating both EntryLogger1 and EntryLogger2's lastMark position. However, EntryLogger2's data may still be in WriteCache, which may lead to data loss when the bookie shutdown will kill -9

There are two solutions for this bug.

Update `lastMark` position individually.

When EntryLogger1 triggers the checkpoint, we only update EntryLogger1's lastMark position instead of updating EntryLogger2's lastMark position at the same time.
When SyncThread triggers the checkpoint, we update all the EntryLoggers' lastMark positions.
When determining whether a journal file can be deleted, we should get the smallest lastMark position among all the writeable EntryLoggers, and delete the journal files which less than the smallest lastMark position.
When replying to the journal on bookie startups, we need to get the smallest lastMark position, and reply to the journal files with this position, otherwise, we will lose data.

However, there is one case being hard to handle in replying to the journal stage.
When one ledger disk transfers from ReadOnly mode to Writeable mode, the lastMark position is an old value. Using the old position to reply to the journal files will lead to a target journal file not found exception.

Only update `lastMark` position in SyncThread

There are two places that can trigger a checkpoint.

The scheduled tasks in SyncThread.doCheckpoint
The ledgerDir write-cache full and then flush

The second way is the root cause of data loss if the ledger is configured with multiple directories.
We can turn off the second way's update lastMark position operation and only make SyncThread update the lastMark position in a checkpoint when the ledger is configured with multiple directories.

This is the simplest way to fix this bug, but it has two drawbacks.

The lastMark position updates depend on SyncThread doing checkpoint intervals. In Pulsar, the default interval is 60s. It means the journal file expires with at least 60s
The bookie startup replying journal depend on the lastMark position. It means the journal will reply to at least 60s journal data before the start-up is complete. It may lead to the bookie start-up speed slowing down.

IMO, compared to data loss, the above two drawbacks can be acceptable.

Changes

I choose the second solution to fix this bug.

gaozhangmin · 2022-06-13T06:32:05Z

@hangc0276 I found when syncThread do flush, there exists duplicate checkpointComplete invoke.

SyncThread flush:

private void flush() {
        Checkpoint checkpoint = checkpointSource.newCheckpoint();
        try {
            ledgerStorage.flush();
        } catch (NoWritableLedgerDirException e) {
            log.error("No writeable ledger directories", e);
            dirsListener.allDisksFull(true);
            return;
        } catch (IOException e) {
            log.error("Exception flushing ledgers", e);
            return;
        }

        if (disableCheckpoint) {
            return;
        }

        log.info("Flush ledger storage at checkpoint {}.", checkpoint);
        try {
            checkpointSource.checkpointComplete(null, checkpoint, false);
        } catch (IOException e) {
            log.error("Exception marking checkpoint as complete", e);
            dirsListener.allDisksFull(true);
        }
    }

@Override
    public void flush() throws IOException {
        Checkpoint cp = checkpointSource.newCheckpoint();
        checkpoint(cp);
        checkpointSource.checkpointComplete(ledgerDir, cp, true);
    }

eolivelli

I think that you are on your way.
good work !

I left some minor feedback.
Waiting for tests and for more reviews.

This change should be picked in all active branches.
it is a serious bug

eolivelli · 2022-06-13T07:15:54Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/BookieImpl.java

nit: ledgerDirsManager

eolivelli · 2022-06-13T07:16:28Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/BookieImpl.java

I wonder if it would be better to have a constant for Long.MAX_VALUE

Change should also be made at Journal.readLog, which curMark should be set to the min mark from ledgers dir.

eolivelli · 2022-06-13T07:18:12Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java

we should make a public constant here, otherwise magic numbers are easily forgotten.

also, I may understand why we are changing from 0 to MAX_VALUE, but...what is the impact ?

When creating a Journal instance, it will initiate lastLogMark by reading each ledger directory's lastMark file and get the minimum position as the replay start point. So we should init lastLogMark with the MAX_VALUE.

In fact, the original logic of init lastLogMark with 0, and get the maximum position of all the ledger directory's lastMark file to init the lastLogMark. IMO, it will lose data.

Maybe @merlimat has more context about init lastLogMark to 0

eolivelli · 2022-06-13T07:19:28Z

bookkeeper-server/src/test/java/org/apache/bookkeeper/bookie/BookieAccessor.java

so null means "all"...
we should document this in the javadocs

eolivelli · 2022-06-13T07:20:22Z

tools/perf/src/main/java/org/apache/bookkeeper/tools/perf/journal/JournalWriter.java

what about adding a comment here ?
null means that the checkpoint is not started by a specific LedgersDirManager

I have added the javadoc on the checkpointComplete method interface.

eolivelli · 2022-06-13T07:21:08Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/CheckpointSource.java

we should add javadocs here and explain why we have this ledgerDirsManager and when it may be null

nicoloboschi

it's better to add proper test case to avoid future regressions

lordcheng10 · 2022-06-14T03:12:05Z

2.When bookie startup, read the minimal lastMark instead of the maximal lastMark as current last mark.
@hangc0276 How is this logic implemented? Do you need to modify the following logic?:

lordcheng10 · 2022-06-14T03:19:10Z

@hangc0276 I found when syncThread do flush, there exists duplicate checkpointComplete invoke.

SyncThread flush:

private void flush() {
        Checkpoint checkpoint = checkpointSource.newCheckpoint();
        try {
            ledgerStorage.flush();
        } catch (NoWritableLedgerDirException e) {
            log.error("No writeable ledger directories", e);
            dirsListener.allDisksFull(true);
            return;
        } catch (IOException e) {
            log.error("Exception flushing ledgers", e);
            return;
        }

        if (disableCheckpoint) {
            return;
        }

        log.info("Flush ledger storage at checkpoint {}.", checkpoint);
        try {
            checkpointSource.checkpointComplete(null, checkpoint, false);
        } catch (IOException e) {
            log.error("Exception marking checkpoint as complete", e);
            dirsListener.allDisksFull(true);
        }
    }

@Override
    public void flush() throws IOException {
        Checkpoint cp = checkpointSource.newCheckpoint();
        checkpoint(cp);
        checkpointSource.checkpointComplete(ledgerDir, cp, true);
    }

I also think the checkpoint is duplicated here

gaozhangmin · 2022-06-15T02:29:12Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/BookieImpl.java

Change should also be made at Journal.readLog, which curMark should be set to the min mark from ledgers dir.

gaozhangmin · 2022-06-15T02:32:38Z

2.When bookie startup, read the minimal lastMark instead of the maximal lastMark as current last mark.
@hangc0276 How is this logic implemented? Do you need to modify the following logic?:

+1, changes should also be made at Journal.readLog

hangc0276 · 2022-06-18T11:39:37Z

it's better to add proper test case to avoid future regressions

@nicoloboschi I have added the test to cover this change, please help take a look, thanks.

hangc0276 · 2022-06-18T11:43:08Z

@merlimat @eolivelli @nicoloboschi @gaozhangmin @lordcheng10 I have updated the code and added the test to cover this change, Please help take a look, thanks.

hangc0276 · 2022-06-20T01:37:22Z

rerun failure checks

hangc0276 · 2022-06-20T02:17:06Z

rerun failure checks

hangc0276 · 2022-06-20T05:14:51Z

rerun failure checks

gaozhangmin · 2022-06-21T08:46:27Z

@hangc0276 I found when syncThread do flush, there exists duplicate checkpointComplete invoke.
SyncThread flush:

private void flush() {
        Checkpoint checkpoint = checkpointSource.newCheckpoint();
        try {
            ledgerStorage.flush();
        } catch (NoWritableLedgerDirException e) {
            log.error("No writeable ledger directories", e);
            dirsListener.allDisksFull(true);
            return;
        } catch (IOException e) {
            log.error("Exception flushing ledgers", e);
            return;
        }

        if (disableCheckpoint) {
            return;
        }

        log.info("Flush ledger storage at checkpoint {}.", checkpoint);
        try {
            checkpointSource.checkpointComplete(null, checkpoint, false);
        } catch (IOException e) {
            log.error("Exception marking checkpoint as complete", e);
            dirsListener.allDisksFull(true);
        }
    }

@Override
    public void flush() throws IOException {
        Checkpoint cp = checkpointSource.newCheckpoint();
        checkpoint(cp);
        checkpointSource.checkpointComplete(ledgerDir, cp, true);
    }

I also think the checkpoint is duplicated here

@hangc0276 This problem is not going to be resolved here, right?

hangc0276 · 2022-06-22T07:20:37Z

@hangc0276 I found when syncThread do flush, there exists duplicate checkpointComplete invoke.
SyncThread flush:

private void flush() {
        Checkpoint checkpoint = checkpointSource.newCheckpoint();
        try {
            ledgerStorage.flush();
        } catch (NoWritableLedgerDirException e) {
            log.error("No writeable ledger directories", e);
            dirsListener.allDisksFull(true);
            return;
        } catch (IOException e) {
            log.error("Exception flushing ledgers", e);
            return;
        }

        if (disableCheckpoint) {
            return;
        }

        log.info("Flush ledger storage at checkpoint {}.", checkpoint);
        try {
            checkpointSource.checkpointComplete(null, checkpoint, false);
        } catch (IOException e) {
            log.error("Exception marking checkpoint as complete", e);
            dirsListener.allDisksFull(true);
        }
    }

@Override
    public void flush() throws IOException {
        Checkpoint cp = checkpointSource.newCheckpoint();
        checkpoint(cp);
        checkpointSource.checkpointComplete(ledgerDir, cp, true);
    }

I also think the checkpoint is duplicated here

@hangc0276 This problem is not going to be resolved here, right?

@gaozhangmin Yes，we can use another PR to solve it.

hangc0276 · 2022-06-22T07:20:47Z

rerun failure checks

hangc0276 · 2022-06-22T15:27:54Z

rerun failure checks

gaozhangmin · 2022-06-23T09:56:34Z

@hangc0276 I found when syncThread do flush, there exists duplicate checkpointComplete invoke.
SyncThread flush:

private void flush() {
        Checkpoint checkpoint = checkpointSource.newCheckpoint();
        try {
            ledgerStorage.flush();
        } catch (NoWritableLedgerDirException e) {
            log.error("No writeable ledger directories", e);
            dirsListener.allDisksFull(true);
            return;
        } catch (IOException e) {
            log.error("Exception flushing ledgers", e);
            return;
        }

        if (disableCheckpoint) {
            return;
        }

        log.info("Flush ledger storage at checkpoint {}.", checkpoint);
        try {
            checkpointSource.checkpointComplete(null, checkpoint, false);
        } catch (IOException e) {
            log.error("Exception marking checkpoint as complete", e);
            dirsListener.allDisksFull(true);
        }
    }

@Override
    public void flush() throws IOException {
        Checkpoint cp = checkpointSource.newCheckpoint();
        checkpoint(cp);
        checkpointSource.checkpointComplete(ledgerDir, cp, true);
    }

I also think the checkpoint is duplicated here

@hangc0276 This problem is not going to be resolved here, right?

@gaozhangmin Yes，we can use another PR to solve it.

I submit pr #3353 to solve this issue. @hangc0276 PTAL

wenbingshen

Since we are now replaying the journal from the smallest LogMark, can we get the journal log mark position of the current entry to the JournalScanner and compare it with the checkpoint position on the ledger disk where the entry to be restored is located, and only restore the entry whose logMark position is greater than the checkpoint? So as to avoid repeatedly writing the data that has been flushed to the disk.

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java

Line 906 in 90d5501

scanner.process(journalVersion, offset, recBuff);

                 if (!isPaddingRecord) {
                    scanner.process(journalVersion, offset, recBuff, journalId, recLog.fc.position());
                }

wenbingshen · 2022-07-07T16:31:03Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java

Since we are now replaying the journal from the smallest LogMark, can we get the journal log mark position of the current entry to the JournalScanner and compare it with the checkpoint position on the ledger disk where the entry to be restored is located, and only restore the entry whose logMark position is greater than the checkpoint? So as to avoid repeatedly writing the data that has been flushed to the ledger disk.

It will introduce complex logic for this comparison.

If the ledger directory expands or shrinks, the map logic of the ledgerId to the ledger directory (logMark) also changed. It may be located on the wrong logMark file, and will lead to skipping the unflushed entries.

There are many kinds of storage implementation, such as dbLedgerStorage, SortedLedgerStorage, and InterleavedLedgerStorage, we should get the ledgerId related storage instance to check the logMark position for each storage implementation. This operation will introduce complex logic.

For the comparison, we can only save the write ledger throughput. We also need to read the data from the journal log file out.

Based on the above reason, I prefer to replay all entries in the journal log file based on the min logMark position.

aloyszhang · 2023-05-09T09:41:23Z

There are two places that can trigger checkpoint.

the scheduled tasks in SyncThread.doCheckpoint
the ledgerDir write-cache full and then flush
The second way is the root cause of data loss.

If removing the checkpointSource.checkpointComplete logic in flush, then at this point we will not delete journal files.

The scheduled task SyncThread.doCheckpoint will invoke checkpointSource.checkpointComplete and it's safe here to delete journal files since we have already flushed all write-caches for all ledger directories.

WDYT @hangc0276

hangc0276 · 2023-05-15T08:39:22Z

There are two places that can trigger checkpoint.

the scheduled tasks in SyncThread.doCheckpoint

the ledgerDir write-cache full and then flush
The second way is the root cause of data loss.

If removing the checkpointSource.checkpointComplete logic in flush, then at this point we will not delete journal files.

The scheduled task SyncThread.doCheckpoint will invoke checkpointSource.checkpointComplete and it's safe here to delete journal files since we have already flushed all write-caches for all ledger directories.

WDYT @hangc0276

@aloyszhang Thanks for your suggestion. Yes, making the SyncThread.doCheckpoint the only endpoint is the simplest way to solve this problem. I updated the PR description, please help take a look, thanks.

aloyszhang

LGTM

aloyszhang · 2023-05-16T08:47:17Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/BookieImpl.java

Here may through exceptions when shutdown the syncThread which will call checkpoint of ledgerStoreage, since we have already shut down the ledgeerStorage before

Yes, you are right. I updated the code, please help take a look, thanks.

hangc0276 · 2023-06-25T06:45:15Z

@eolivelli @merlimat @dlg99 @zymap I updated the code, and need your eyes for this PR, thanks.

eolivelli

+1

eolivelli · 2023-06-25T15:58:13Z

...r/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/SingleDirectoryDbLedgerStorage.java

        Checkpoint cp = checkpointSource.newCheckpoint();
        checkpoint(cp);
-        checkpointSource.checkpointComplete(cp, true);
+        if (singleLedgerDirs) {


Please add a small comment with a quick description of the motivation for this condition

(cherry picked from commit 8a76703)

eolivelli changed the title ~~Fix data lose when configured multiple ledger directories~~ Fix data lost when configured multiple ledger directories Jun 13, 2022

eolivelli requested changes Jun 13, 2022

View reviewed changes

nicoloboschi requested changes Jun 13, 2022

View reviewed changes

gaozhangmin suggested changes Jun 15, 2022

View reviewed changes

hangc0276 requested review from eolivelli and nicoloboschi June 18, 2022 11:40

gaozhangmin approved these changes Jun 23, 2022

View reviewed changes

wenbingshen reviewed Jul 7, 2022

View reviewed changes

hangc0276 mentioned this pull request Jul 20, 2022

Prs need to be included in 4.16.0 #3412

Closed

11 tasks

hangc0276 requested review from Vanlightly, dlg99, hezhangjian, merlimat and zymap July 25, 2022 01:23

hangc0276 self-assigned this Jul 25, 2022

hangc0276 added the release/4.14.7 label Nov 14, 2022

hangc0276 added release/4.14.8 and removed release/4.14.7 labels Feb 9, 2023

zymap added release/4.15.5 and removed release/4.15.4 labels Feb 21, 2023

gaozhangmin pushed a commit to gaozhangmin/bookkeeper that referenced this pull request Feb 23, 2023

Fix data lost when configured multiple ledger directories apache#3329

d945949

Fix lost data on multiple ledgers

e2f673e

hangc0276 force-pushed the chenhang/fix_lost_data_on_multiple_ledgers branch from fbf5c77 to e2f673e Compare May 15, 2023 08:07

aloyszhang approved these changes May 15, 2023

View reviewed changes

nicoloboschi approved these changes May 15, 2023

View reviewed changes

wenbingshen approved these changes May 16, 2023

View reviewed changes

aloyszhang reviewed May 16, 2023

View reviewed changes

hangc0276 added 2 commits May 22, 2023 15:26

address comments

20c8650

address comments

2d01e50

aloyszhang approved these changes May 30, 2023

View reviewed changes

zymap approved these changes Jun 25, 2023

View reviewed changes

eolivelli approved these changes Jun 25, 2023

View reviewed changes

eolivelli merged commit 8a76703 into apache:master Jun 25, 2023

hangc0276 added the release/4.16.2 label Jun 26, 2023

zymap pushed a commit that referenced this pull request Jun 26, 2023

Fix data lost when configured multiple ledger directories (#3329)

14dbfd2

(cherry picked from commit 8a76703)

hangc0276 added a commit to hangc0276/bookkeeper that referenced this pull request Jun 26, 2023

Fix data lost when configured multiple ledger directories (apache#3329)

60d56e6

(cherry picked from commit 8a76703)

hangc0276 mentioned this pull request Jun 27, 2023

Release note 4.16.2 #4002

Merged

hangc0276 added cherry-picked/branch-4.14 cherry-picked/branch-4.16 labels Jul 14, 2023

zymap pushed a commit that referenced this pull request Dec 6, 2023

Fix data lost when configured multiple ledger directories (#3329)

ae6b454

(cherry picked from commit 8a76703)

zymap added the cherry-picked/branch-4.15 label Dec 6, 2023

Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024

Fix data lost when configured multiple ledger directories (apache#3329)

c15fb8b

Conversation

hangc0276 commented Jun 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Solutions

Update lastMark position individually.

Only update lastMark position in SyncThread

Changes

Uh oh!

gaozhangmin commented Jun 13, 2022

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Uh oh!

lordcheng10 commented Jun 14, 2022

Uh oh!

lordcheng10 commented Jun 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaozhangmin commented Jun 15, 2022

Uh oh!

hangc0276 commented Jun 18, 2022

Uh oh!

hangc0276 commented Jun 18, 2022

Uh oh!

hangc0276 commented Jun 20, 2022

Uh oh!

hangc0276 commented Jun 20, 2022

Uh oh!

hangc0276 commented Jun 20, 2022

Uh oh!

gaozhangmin commented Jun 21, 2022

Uh oh!

hangc0276 commented Jun 22, 2022

Uh oh!

hangc0276 commented Jun 22, 2022

Uh oh!

hangc0276 commented Jun 22, 2022

Uh oh!

gaozhangmin commented Jun 23, 2022

Uh oh!

wenbingshen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenbingshen Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aloyszhang commented May 9, 2023

Uh oh!

hangc0276 commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hangc0276 commented Jun 12, 2022 •

edited

Loading

Update `lastMark` position individually.

Only update `lastMark` position in SyncThread

wenbingshen left a comment •

edited

Loading

wenbingshen Jul 7, 2022 •

edited

Loading

hangc0276 commented May 15, 2023 •

edited

Loading