Skip to content

Fix ledger replicated failed blocks bookie decommission process#3917

Merged
zymap merged 5 commits intoapache:masterfrom
hangc0276:chenhang/fix_ledger_replicate_failed_bug
Jun 19, 2023
Merged

Fix ledger replicated failed blocks bookie decommission process#3917
zymap merged 5 commits intoapache:masterfrom
hangc0276:chenhang/fix_ledger_replicate_failed_bug

Conversation

@hangc0276
Copy link
Contributor

@hangc0276 hangc0276 commented Apr 13, 2023

Motivation

When I decommission one bookie (bk3), one ledger replicate failed and blocked decommission process.

This is the auto-recovery log:

2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.client.LedgerHandle - ReadEntries exception on ledgerId:904368 firstEntry:14 lastEntry:14 lastAddConfirmed:13
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Received error: -1 while trying to read entry: 14 of ledger: 904368 in ReplicationWorker
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Failed to read faulty entries, so giving up replicating ledgerFragment Fragment(LedgerID: 904368, FirstEntryID: 0[0], LastKnownEntryID: 14[14], Host: [betausc1-bk-10.betausc1-bk-headless.o-vaxkx.svc.cluster.local:3181], Closed: true)
2023-03-29T06:29:22,644+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker failed to replicate Ledger : 904368 for 6 number of times, so deferring the ledger lock release by 300000 msecs

The ledger's metadata:

ledgerID: 904368
2023-03-29T06:47:56,511+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.
client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=3, writeQuorumSize=3, 
ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, 
ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]},...}

The ledger (904368) has two ensembles, ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]}. However, the replication worker got the ledger's LAC is 13, but it got the replication fragment entry range is [0, 14]. When reading entry 14, it failed.

One question

Why the ensembles created a new ensemble starting with entryId = 15, but the ledger's lastAddConfirm is 13.

This question is related to two parts, one is how the new ensemble was created and the other is how the lastAddConfirm was generated.

1. How the new ensemble was created

The ensemble change is controlled on the bookie client side.

When one entry is ready to send to the bookie server, the bookie client will check whether need to do the ensemble change.

For the above case, when writing entry 15, one bookie is lost, it will trigger the ensemble change and generate the new ensemble: 15=[bk1:3181, bk2:3181, bk4:3181]. However, entry 15 write failed, such as timeout or bookie server rejected the write.

For now, entry 14 is written succeed.

2. How the lastAddConfirm was generated

Due to the ledger being in the OPEN state, the ledger handle will send a readLAC request according to the last ensemble to get the ledger's lastAddConfirm.

For the above case, the readLAC request will send to bk1, bk2, and bk4.

For the V2 protocol (Pulsar uses the V2 protocol to interact with the BookKeeper cluster), the bookie client put the lastAddConfirm EntryId in the next Entry's metadata.

When we use the V2 protocol to open an OPEN state ledger to read, it will send a readLastAddConfirm request to the bookie server, and the bookie server gets the last entry of this ledger and return to the client.

RecoveryData recoveryData = digestManager.verifyDigestAndReturnLastConfirmed(buffer);

However, the bookie client will parse the response entry and get the lastAddConfirm from the entry's metadata. Due to the entry just recording the previous EntryId as the lastAddConfirm, the LedgerHandle got the lastAddConfirm will be the penultimate EntryId of the ledger.

For the above case, the bk1 holds the max entry 14, bk2 holds the max entry 14, and bk4 returns NoSuchEntryException, LedgerHandle gets lastAddConfirm will be 14 - 1 = 13, not 14.

When the replicator tries to recover the first ensemble 0=[bk1:3181, bk2:3181, bk3:3181] with entry range [0, 14], reading entry 14 will throw a ReadEntryException due to the lastAddConfirm is 13.

if (lastEntry > lastAddConfirmed) {
LOG.error("ReadEntries exception on ledgerId:{} firstEntry:{} lastEntry:{} lastAddConfirmed:{}",
ledgerId, firstEntry, lastEntry, lastAddConfirmed);
cb.readComplete(BKException.Code.ReadException, this, null, ctx);
return;
}

Solution

When encountered that case that

  • The ledger is OPEN
  • The ledger has multiple ensembles
  • The ledger's last ensemble doesn't have any entries, which means lastAddConfirm < last ensemble key - 1

We should treat the penultimate segment/ensemble of the ledger as an OPEN state instead of a closed state.

this.isLedgerClosed = lh.getLedgerMetadata().isClosed()
|| !ensemble.equals(ensembles.get(ensembles.lastKey()));

After we treat the segment/ensemble as OPEN state, the replicator will close the ledger first and replicate it.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder of we risk to lose one entry in case of AQ < WQ.

I have to think more

@horizonzy
Copy link
Member

horizonzy commented Apr 13, 2023

How the lastAddConfirm was generated
Due to the ledger being in the OPEN state, the ledger handle will send a readLAC request to get the ledger's lastAddConfirm.
For the above case, if bk1 holds the max entry 14, bk2 holds the max entry 13 and bk3 holds the max entry 14 but it is lost, the LedgerHandle get lastAddConfirm will be 13, not 14.

I have a question about this.

public void checkLedger(final LedgerHandle lh,
                            final GenericCallback<Set<LedgerFragment>> cb,
                            long percentageOfLedgerFragmentToBeVerified)

There are two places to invoke LedgerChecker#checkLedger.

  1. localAdmin.asyncOpenLedgerNoRecovery(ledgerId, (rc, lh, ctx) -> {
    openLedgerNoRecoverySemaphore.release();
    if (BKException.Code.OK == rc) {
    // BookKeeperClientWorker-OrderedExecutor threads should not execute LedgerChecker#checkLedger
    // as this can lead to deadlocks
    ledgerCheckerExecutor.execute(() -> {
    checker.checkLedger(lh,
    // the ledger handle will be closed after checkLedger is done.
    new ProcessLostFragmentsCb(lh, callback),
    conf.getAuditorLedgerVerificationPercentage());
    // we collect the following stats to get a measure of the
    // distribution of a single ledger within the bk cluster
    // the higher the number of fragments/bookies, the more distributed it is
    auditorStats.getNumFragmentsPerLedger().registerSuccessfulValue(lh.getNumFragments());
    auditorStats.getNumBookiesPerLedger().registerSuccessfulValue(lh.getNumBookies());
    auditorStats.getNumLedgersChecked().inc();
    });

2.

try (LedgerHandle lh = admin.openLedgerNoRecovery(ledgerIdToReplicate)) {
Set<LedgerFragment> fragments = getUnderreplicatedFragments(lh,
conf.getAuditorLedgerVerificationPercentage());

Both places use the openLedgerNoRecovery to open the LedgerHandle.

It will use lh.asyncReadLastConfirmed() to get the LAC.

lh.asyncReadLastConfirmed(new ReadLastConfirmedCallback() {
@Override
public void readLastConfirmedComplete(int rc,
long lastConfirmed, Object ctx) {
if (rc == BKException.Code.TimeoutException) {
closeLedgerHandleAsync().whenComplete((r, ex) -> {
if (ex != null) {
LOG.error("Ledger {} close failed", ledgerId, ex);
}
openComplete(bk.getReturnRc(rc), null);
});
} else if (rc != BKException.Code.OK) {
closeLedgerHandleAsync().whenComplete((r, ex) -> {
if (ex != null) {
LOG.error("Ledger {} close failed", ledgerId, ex);
}
openComplete(bk.getReturnRc(BKException.Code.ReadException), null);
});
} else {
lh.lastAddConfirmed = lh.lastAddPushed = lastConfirmed;
openComplete(BKException.Code.OK, lh);
}
}
}, null);

And we use the V2 protocol, it invokes asyncReadPiggybackLastConfirmed.

public void asyncReadLastConfirmed(final ReadLastConfirmedCallback cb, final Object ctx) {
if (clientCtx.getConf().useV2WireProtocol) {
// in v2 protocol we don't support readLAC RPC
asyncReadPiggybackLastConfirmed(cb, ctx);
} else {
asyncReadExplicitLastConfirmed(cb, ctx);
}
}

In asyncReadPiggybackLastConfirmed, it invokes ReadLastConfirmedOp#initiate() to get the LAC.

new ReadLastConfirmedOp(clientCtx.getBookieClient(),
distributionSchedule,
macManager,
ledgerId,
getCurrentEnsemble(),
ledgerKey,
innercb).initiate();

In the ReadLastConfirmedOp, it will read all bookies in the currentEnsemble. The currentEnsemble is 15=[bk1:3181, bk2:3181, bk4:3181]. Not the {0=[bk1:3181, bk2:3181, bk3:3181].
So it will send readLac RPC to bk1, bk2, and bk4. Then get response from bk1, bk2, and bk4.
When ReadLastConfirmedOp#readEntryComplete, it will compare the response from different bookies, then pick the max lac response to override maxRecoveredData. Then invoke the callback using the maxRecoveredData.

So if the bk1 last entry is 14, the lac is 14-1 = 13

public synchronized void readEntryComplete(final int rc, final long ledgerId, final long entryId,
final ByteBuf buffer, final Object ctx) {
int bookieIndex = (Integer) ctx;
// add the response to coverage set
coverageSet.addBookie(bookieIndex, rc);
numResponsesPending--;
boolean heardValidResponse = false;
if (rc == BKException.Code.OK) {
try {
RecoveryData recoveryData = digestManager.verifyDigestAndReturnLastConfirmed(buffer);
if (recoveryData.getLastAddConfirmed() > maxRecoveredData.getLastAddConfirmed()) {
maxRecoveredData = recoveryData;
}
heardValidResponse = true;
} catch (BKDigestMatchException e) {
// Too bad, this bookie didn't give us a valid answer, we
// still might be able to recover though so continue
LOG.error("Mac mismatch for ledger: " + ledgerId + ", entry: " + entryId
+ " while reading last entry from bookie: "
+ currentEnsemble.get(bookieIndex));
}
}
if (rc == BKException.Code.NoSuchLedgerExistsException || rc == BKException.Code.NoSuchEntryException) {
// this still counts as a valid response, e.g., if the client crashed without writing any entry
heardValidResponse = true;
}
if (rc == BKException.Code.UnauthorizedAccessException && !completed) {
cb.readLastConfirmedDataComplete(rc, maxRecoveredData);
completed = true;
}
if (!heardValidResponse && BKException.Code.OK != rc) {
lastSeenError = rc;
}
// other return codes dont count as valid responses
if (heardValidResponse
&& coverageSet.checkCovered()
&& !completed) {
completed = true;
if (LOG.isDebugEnabled()) {
LOG.debug("Read Complete with enough validResponses for ledger: {}, entry: {}",
ledgerId, entryId);
}
cb.readLastConfirmedDataComplete(BKException.Code.OK, maxRecoveredData);
return;
}
if (numResponsesPending == 0 && !completed) {
LOG.error("While readLastConfirmed ledger: {} did not hear success responses from all quorums, {}",
ledgerId, coverageSet);
cb.readLastConfirmedDataComplete(lastSeenError, maxRecoveredData);
}

line_l08-line_112, it will pick the max lac to override maxRecoveredData.

line_137-line_148, use the maxRecoveredData to invoke callback.

@hangc0276
Copy link
Contributor Author

hangc0276 commented Apr 15, 2023

How the lastAddConfirm was generated
Due to the ledger being in the OPEN state, the ledger handle will send a readLAC request to get the ledger's lastAddConfirm.
For the above case, if bk1 holds the max entry 14, bk2 holds the max entry 13 and bk3 holds the max entry 14 but it is lost, the LedgerHandle get lastAddConfirm will be 13, not 14.

I have a question about this.

public void checkLedger(final LedgerHandle lh,
                            final GenericCallback<Set<LedgerFragment>> cb,
                            long percentageOfLedgerFragmentToBeVerified)

There are two places to invoke LedgerChecker#checkLedger.

  1. localAdmin.asyncOpenLedgerNoRecovery(ledgerId, (rc, lh, ctx) -> {
    openLedgerNoRecoverySemaphore.release();
    if (BKException.Code.OK == rc) {
    // BookKeeperClientWorker-OrderedExecutor threads should not execute LedgerChecker#checkLedger
    // as this can lead to deadlocks
    ledgerCheckerExecutor.execute(() -> {
    checker.checkLedger(lh,
    // the ledger handle will be closed after checkLedger is done.
    new ProcessLostFragmentsCb(lh, callback),
    conf.getAuditorLedgerVerificationPercentage());
    // we collect the following stats to get a measure of the
    // distribution of a single ledger within the bk cluster
    // the higher the number of fragments/bookies, the more distributed it is
    auditorStats.getNumFragmentsPerLedger().registerSuccessfulValue(lh.getNumFragments());
    auditorStats.getNumBookiesPerLedger().registerSuccessfulValue(lh.getNumBookies());
    auditorStats.getNumLedgersChecked().inc();
    });

try (LedgerHandle lh = admin.openLedgerNoRecovery(ledgerIdToReplicate)) {
Set<LedgerFragment> fragments = getUnderreplicatedFragments(lh,
conf.getAuditorLedgerVerificationPercentage());

Both places use the openLedgerNoRecovery to open the LedgerHandle.

It will use lh.asyncReadLastConfirmed() to get the LAC.

lh.asyncReadLastConfirmed(new ReadLastConfirmedCallback() {
@Override
public void readLastConfirmedComplete(int rc,
long lastConfirmed, Object ctx) {
if (rc == BKException.Code.TimeoutException) {
closeLedgerHandleAsync().whenComplete((r, ex) -> {
if (ex != null) {
LOG.error("Ledger {} close failed", ledgerId, ex);
}
openComplete(bk.getReturnRc(rc), null);
});
} else if (rc != BKException.Code.OK) {
closeLedgerHandleAsync().whenComplete((r, ex) -> {
if (ex != null) {
LOG.error("Ledger {} close failed", ledgerId, ex);
}
openComplete(bk.getReturnRc(BKException.Code.ReadException), null);
});
} else {
lh.lastAddConfirmed = lh.lastAddPushed = lastConfirmed;
openComplete(BKException.Code.OK, lh);
}
}
}, null);

And we use the V2 protocol, it invokes asyncReadPiggybackLastConfirmed.

public void asyncReadLastConfirmed(final ReadLastConfirmedCallback cb, final Object ctx) {
if (clientCtx.getConf().useV2WireProtocol) {
// in v2 protocol we don't support readLAC RPC
asyncReadPiggybackLastConfirmed(cb, ctx);
} else {
asyncReadExplicitLastConfirmed(cb, ctx);
}
}

In asyncReadPiggybackLastConfirmed, it invokes ReadLastConfirmedOp#initiate() to get the LAC.

new ReadLastConfirmedOp(clientCtx.getBookieClient(),
distributionSchedule,
macManager,
ledgerId,
getCurrentEnsemble(),
ledgerKey,
innercb).initiate();

In the ReadLastConfirmedOp, it will read all bookies in the currentEnsemble. The currentEnsemble is 15=[bk1:3181, bk2:3181, bk4:3181]. Not the {0=[bk1:3181, bk2:3181, bk3:3181]. So it will send readLac RPC to bk1, bk2, and bk4. Then get response from bk1, bk2, and bk4. When ReadLastConfirmedOp#readEntryComplete, it will compare the response from different bookies, then pick the max lac response to override maxRecoveredData. Then invoke the callback using the maxRecoveredData.

So if the bk1 last entry is 14, the lac is 14-1 = 13

public synchronized void readEntryComplete(final int rc, final long ledgerId, final long entryId,
final ByteBuf buffer, final Object ctx) {
int bookieIndex = (Integer) ctx;
// add the response to coverage set
coverageSet.addBookie(bookieIndex, rc);
numResponsesPending--;
boolean heardValidResponse = false;
if (rc == BKException.Code.OK) {
try {
RecoveryData recoveryData = digestManager.verifyDigestAndReturnLastConfirmed(buffer);
if (recoveryData.getLastAddConfirmed() > maxRecoveredData.getLastAddConfirmed()) {
maxRecoveredData = recoveryData;
}
heardValidResponse = true;
} catch (BKDigestMatchException e) {
// Too bad, this bookie didn't give us a valid answer, we
// still might be able to recover though so continue
LOG.error("Mac mismatch for ledger: " + ledgerId + ", entry: " + entryId
+ " while reading last entry from bookie: "
+ currentEnsemble.get(bookieIndex));
}
}
if (rc == BKException.Code.NoSuchLedgerExistsException || rc == BKException.Code.NoSuchEntryException) {
// this still counts as a valid response, e.g., if the client crashed without writing any entry
heardValidResponse = true;
}
if (rc == BKException.Code.UnauthorizedAccessException && !completed) {
cb.readLastConfirmedDataComplete(rc, maxRecoveredData);
completed = true;
}
if (!heardValidResponse && BKException.Code.OK != rc) {
lastSeenError = rc;
}
// other return codes dont count as valid responses
if (heardValidResponse
&& coverageSet.checkCovered()
&& !completed) {
completed = true;
if (LOG.isDebugEnabled()) {
LOG.debug("Read Complete with enough validResponses for ledger: {}, entry: {}",
ledgerId, entryId);
}
cb.readLastConfirmedDataComplete(BKException.Code.OK, maxRecoveredData);
return;
}
if (numResponsesPending == 0 && !completed) {
LOG.error("While readLastConfirmed ledger: {} did not hear success responses from all quorums, {}",
ledgerId, coverageSet);
cb.readLastConfirmedDataComplete(lastSeenError, maxRecoveredData);
}

line_l08-line_112, it will pick the max lac to override maxRecoveredData.

line_137-line_148, use the maxRecoveredData to invoke callback.

@horizonzy Yes, you are right. I updated the description, please help take a look, thanks.

@hangc0276
Copy link
Contributor Author

I wonder of we risk to lose one entry in case of AQ < WQ.

I have to think more

@eolivelli Sorry, I misunderstood the lastAddConfirm part. It doesn't have the risk of losing one entry. I updated the description, please help take a look, thanks.

this.isLedgerClosed = lh.getLedgerMetadata().isClosed()
|| !ensemble.equals(ensembles.get(ensembles.lastKey()));
|| (!ensemble.equals(ensembles.get(ensembles.lastKey()))
&& lh.getLastAddConfirmed() >= ensembles.lastKey() - 1);
Copy link
Member

@horizonzy horizonzy Apr 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should be `lh.getLastAddConfirmed() == ensembles.lastKey() - 2.
The lac maybe 100. The last ensemble key maybe 10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not. Take the following case for example:
0: [bk1, bk2, bk3]
10: [bk1, bk2, bk4]
LAC = 100
Ledger is OPEN

For the first ensemble: isLedgerClosed = (false || (true && 100 >= 9)) => true
For the last ensemble: isLedgerClosed = (false || (false && 100 >= 9)) => false
The result is expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

LedgerHandle lh = bkc.createLedger(3, 3, 2, BookKeeper.DigestType.CRC32, TESTPASSWD);
assertFalse(lh.getLedgerMetadata().isClosed());

List<BookieId> firstEnsemble = lh.getLedgerMetadata().getAllEnsembles().firstEntry().getValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

firstEnsemble is unused

* segment/ensemble has missing bookies.
*/
private boolean isLastSegmentOpenAndMissingBookies(LedgerHandle lh) throws BKException {
LedgerMetadata md = admin.getLedgerMetadata(lh);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md is unused.

SortedMap<Long, ? extends List<BookieId>> ensembles = admin.getLedgerMetadata(lh).getAllEnsembles();
List<BookieId> finalEnsemble = ensembles.get(ensembles.lastKey());
if (ensembles.size() > 1 && lh.getLastAddConfirmed() < ensembles.lastKey() - 1) {
finalEnsemble = new ArrayList<>(finalEnsemble);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalEnsemble = new ArrayList<>(finalEnsemble); may be meaningless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finalEnsemble is immutable in the ledger metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, get it.

@hangc0276
Copy link
Contributor Author

@merlimat @eolivelli @dlg99 Please help take a look, thanks.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

Copy link
Member

@wenbingshen wenbingshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zymap zymap merged commit eff38e4 into apache:master Jun 19, 2023
zymap pushed a commit that referenced this pull request Jun 19, 2023
### Motivation
When I decommission one bookie (bk3), one ledger replicate failed and blocked decommission process.

This is the auto-recovery log:
```
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.client.LedgerHandle - ReadEntries exception on ledgerId:904368 firstEntry:14 lastEntry:14 lastAddConfirmed:13
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Received error: -1 while trying to read entry: 14 of ledger: 904368 in ReplicationWorker
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Failed to read faulty entries, so giving up replicating ledgerFragment Fragment(LedgerID: 904368, FirstEntryID: 0[0], LastKnownEntryID: 14[14], Host: [betausc1-bk-10.betausc1-bk-headless.o-vaxkx.svc.cluster.local:3181], Closed: true)
2023-03-29T06:29:22,644+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker failed to replicate Ledger : 904368 for 6 number of times, so deferring the ledger lock release by 300000 msecs
```
The ledger's metadata:
```
ledgerID: 904368
2023-03-29T06:47:56,511+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.
client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=3, writeQuorumSize=3,
ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:,
ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]},...}
```

The ledger (904368) has two ensembles, `ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]}`. However, the replication worker got the ledger's LAC is 13, but it got the replication fragment entry range is [0, 14]. When reading entry 14, it failed.

### One question
**Why the ensembles created a new ensemble starting with entryId = 15, but the ledger's lastAddConfirm is 13.**

This question is related to two parts, one is how the new ensemble was created and the other is how the lastAddConfirm was generated.

#### 1. How the new ensemble was created
The ensemble change is controlled on the bookie client side.

When one entry is ready to send to the bookie server, the bookie client will check whether need to do the ensemble change.
https://github.com/apache/bookkeeper/blob/912896deb2e748389e15e74c37539b2ff36302c7/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L254

For the above case, when writing entry 15, one bookie is lost, it will trigger the ensemble change and generate the new ensemble: 15=[bk1:3181, bk2:3181, bk4:3181]. However, entry 15 write failed, such as timeout or bookie server rejected the write.

For now, entry 14 is written succeed.

#### 2. How the lastAddConfirm was generated
Due to the ledger being in the `OPEN` state, the ledger handle will send a readLAC request according to the last ensemble to get the ledger's lastAddConfirm.

For the above case, the readLAC request will send to bk1, bk2, and bk4.

For the `V2` protocol (Pulsar uses the V2 protocol to interact with the BookKeeper cluster), the bookie client put the lastAddConfirm EntryId in the next Entry's metadata.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L134

When we use the `V2` protocol to open an `OPEN` state ledger to read, it will send a readLastAddConfirm request to the bookie server, and the bookie server gets the last entry of this ledger and return to the client.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java#L108

However, the bookie client will parse the response entry and get the lastAddConfirm from the entry's metadata. Due to the entry just recording the previous EntryId as the lastAddConfirm, the LedgerHandle got the lastAddConfirm will be the penultimate EntryId of the ledger.

For the above case, the bk1 holds the max entry 14, bk2 holds the max entry 14, and bk4 returns NoSuchEntryException, LedgerHandle gets lastAddConfirm will be `14  - 1 = 13`, not 14.

When the replicator tries to recover the first ensemble 0=[bk1:3181, bk2:3181, bk3:3181] with entry range [0, 14],  reading entry 14 will throw a ReadEntryException due to the lastAddConfirm is 13.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java#L685-L690

### Solution
When encountered that case that
- The ledger is `OPEN`
- The ledger has multiple ensembles
- The ledger's last ensemble doesn't have any entries, which means `lastAddConfirm < last ensemble key - 1`

We should treat the penultimate segment/ensemble of the ledger as an `OPEN` state instead of a closed state.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerFragment.java#L56-L57

After we treat the segment/ensemble as `OPEN` state, the replicator will close the ledger first and replicate it.

(cherry picked from commit eff38e4)
hangc0276 added a commit to hangc0276/bookkeeper that referenced this pull request Jun 26, 2023
…he#3917)

When I decommission one bookie (bk3), one ledger replicate failed and blocked decommission process.

This is the auto-recovery log:
```
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.client.LedgerHandle - ReadEntries exception on ledgerId:904368 firstEntry:14 lastEntry:14 lastAddConfirmed:13
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Received error: -1 while trying to read entry: 14 of ledger: 904368 in ReplicationWorker
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Failed to read faulty entries, so giving up replicating ledgerFragment Fragment(LedgerID: 904368, FirstEntryID: 0[0], LastKnownEntryID: 14[14], Host: [betausc1-bk-10.betausc1-bk-headless.o-vaxkx.svc.cluster.local:3181], Closed: true)
2023-03-29T06:29:22,644+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker failed to replicate Ledger : 904368 for 6 number of times, so deferring the ledger lock release by 300000 msecs
```
The ledger's metadata:
```
ledgerID: 904368
2023-03-29T06:47:56,511+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.
client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=3, writeQuorumSize=3,
ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:,
ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]},...}
```

The ledger (904368) has two ensembles, `ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]}`. However, the replication worker got the ledger's LAC is 13, but it got the replication fragment entry range is [0, 14]. When reading entry 14, it failed.

**Why the ensembles created a new ensemble starting with entryId = 15, but the ledger's lastAddConfirm is 13.**

This question is related to two parts, one is how the new ensemble was created and the other is how the lastAddConfirm was generated.

The ensemble change is controlled on the bookie client side.

When one entry is ready to send to the bookie server, the bookie client will check whether need to do the ensemble change.
https://github.com/apache/bookkeeper/blob/912896deb2e748389e15e74c37539b2ff36302c7/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L254

For the above case, when writing entry 15, one bookie is lost, it will trigger the ensemble change and generate the new ensemble: 15=[bk1:3181, bk2:3181, bk4:3181]. However, entry 15 write failed, such as timeout or bookie server rejected the write.

For now, entry 14 is written succeed.

Due to the ledger being in the `OPEN` state, the ledger handle will send a readLAC request according to the last ensemble to get the ledger's lastAddConfirm.

For the above case, the readLAC request will send to bk1, bk2, and bk4.

For the `V2` protocol (Pulsar uses the V2 protocol to interact with the BookKeeper cluster), the bookie client put the lastAddConfirm EntryId in the next Entry's metadata.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L134

When we use the `V2` protocol to open an `OPEN` state ledger to read, it will send a readLastAddConfirm request to the bookie server, and the bookie server gets the last entry of this ledger and return to the client.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java#L108

However, the bookie client will parse the response entry and get the lastAddConfirm from the entry's metadata. Due to the entry just recording the previous EntryId as the lastAddConfirm, the LedgerHandle got the lastAddConfirm will be the penultimate EntryId of the ledger.

For the above case, the bk1 holds the max entry 14, bk2 holds the max entry 14, and bk4 returns NoSuchEntryException, LedgerHandle gets lastAddConfirm will be `14  - 1 = 13`, not 14.

When the replicator tries to recover the first ensemble 0=[bk1:3181, bk2:3181, bk3:3181] with entry range [0, 14],  reading entry 14 will throw a ReadEntryException due to the lastAddConfirm is 13.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java#L685-L690

When encountered that case that
- The ledger is `OPEN`
- The ledger has multiple ensembles
- The ledger's last ensemble doesn't have any entries, which means `lastAddConfirm < last ensemble key - 1`

We should treat the penultimate segment/ensemble of the ledger as an `OPEN` state instead of a closed state.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerFragment.java#L56-L57

After we treat the segment/ensemble as `OPEN` state, the replicator will close the ledger first and replicate it.

(cherry picked from commit eff38e4)
zymap pushed a commit that referenced this pull request Dec 6, 2023
### Motivation
When I decommission one bookie (bk3), one ledger replicate failed and blocked decommission process.

This is the auto-recovery log:
```
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.client.LedgerHandle - ReadEntries exception on ledgerId:904368 firstEntry:14 lastEntry:14 lastAddConfirmed:13
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Received error: -1 while trying to read entry: 14 of ledger: 904368 in ReplicationWorker
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Failed to read faulty entries, so giving up replicating ledgerFragment Fragment(LedgerID: 904368, FirstEntryID: 0[0], LastKnownEntryID: 14[14], Host: [betausc1-bk-10.betausc1-bk-headless.o-vaxkx.svc.cluster.local:3181], Closed: true)
2023-03-29T06:29:22,644+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker failed to replicate Ledger : 904368 for 6 number of times, so deferring the ledger lock release by 300000 msecs
```
The ledger's metadata:
```
ledgerID: 904368
2023-03-29T06:47:56,511+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.
client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=3, writeQuorumSize=3,
ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:,
ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]},...}
```

The ledger (904368) has two ensembles, `ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]}`. However, the replication worker got the ledger's LAC is 13, but it got the replication fragment entry range is [0, 14]. When reading entry 14, it failed.

### One question
**Why the ensembles created a new ensemble starting with entryId = 15, but the ledger's lastAddConfirm is 13.**

This question is related to two parts, one is how the new ensemble was created and the other is how the lastAddConfirm was generated.

#### 1. How the new ensemble was created
The ensemble change is controlled on the bookie client side.

When one entry is ready to send to the bookie server, the bookie client will check whether need to do the ensemble change.
https://github.com/apache/bookkeeper/blob/912896deb2e748389e15e74c37539b2ff36302c7/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L254

For the above case, when writing entry 15, one bookie is lost, it will trigger the ensemble change and generate the new ensemble: 15=[bk1:3181, bk2:3181, bk4:3181]. However, entry 15 write failed, such as timeout or bookie server rejected the write.

For now, entry 14 is written succeed.

#### 2. How the lastAddConfirm was generated
Due to the ledger being in the `OPEN` state, the ledger handle will send a readLAC request according to the last ensemble to get the ledger's lastAddConfirm.

For the above case, the readLAC request will send to bk1, bk2, and bk4.

For the `V2` protocol (Pulsar uses the V2 protocol to interact with the BookKeeper cluster), the bookie client put the lastAddConfirm EntryId in the next Entry's metadata.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L134

When we use the `V2` protocol to open an `OPEN` state ledger to read, it will send a readLastAddConfirm request to the bookie server, and the bookie server gets the last entry of this ledger and return to the client.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java#L108

However, the bookie client will parse the response entry and get the lastAddConfirm from the entry's metadata. Due to the entry just recording the previous EntryId as the lastAddConfirm, the LedgerHandle got the lastAddConfirm will be the penultimate EntryId of the ledger.

For the above case, the bk1 holds the max entry 14, bk2 holds the max entry 14, and bk4 returns NoSuchEntryException, LedgerHandle gets lastAddConfirm will be `14  - 1 = 13`, not 14.

When the replicator tries to recover the first ensemble 0=[bk1:3181, bk2:3181, bk3:3181] with entry range [0, 14],  reading entry 14 will throw a ReadEntryException due to the lastAddConfirm is 13.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java#L685-L690

### Solution
When encountered that case that
- The ledger is `OPEN`
- The ledger has multiple ensembles
- The ledger's last ensemble doesn't have any entries, which means `lastAddConfirm < last ensemble key - 1`

We should treat the penultimate segment/ensemble of the ledger as an `OPEN` state instead of a closed state.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerFragment.java#L56-L57

After we treat the segment/ensemble as `OPEN` state, the replicator will close the ledger first and replicate it.

(cherry picked from commit eff38e4)
Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
…he#3917)

### Motivation
When I decommission one bookie (bk3), one ledger replicate failed and blocked decommission process.

This is the auto-recovery log:
```
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.client.LedgerHandle - ReadEntries exception on ledgerId:904368 firstEntry:14 lastEntry:14 lastAddConfirmed:13
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Received error: -1 while trying to read entry: 14 of ledger: 904368 in ReplicationWorker
2023-03-29T06:29:22,642+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - Failed to read faulty entries, so giving up replicating ledgerFragment Fragment(LedgerID: 904368, FirstEntryID: 0[0], LastKnownEntryID: 14[14], Host: [betausc1-bk-10.betausc1-bk-headless.o-vaxkx.svc.cluster.local:3181], Closed: true)
2023-03-29T06:29:22,644+0000 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker failed to replicate Ledger : 904368 for 6 number of times, so deferring the ledger lock release by 300000 msecs
```
The ledger's metadata:
```
ledgerID: 904368
2023-03-29T06:47:56,511+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.
client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=3, writeQuorumSize=3, 
ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, 
ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]},...}
```

The ledger (904368) has two ensembles, `ensembles={0=[bk1:3181, bk2:3181, bk3:3181], 15=[bk1:3181, bk2:3181, bk4:3181]}`. However, the replication worker got the ledger's LAC is 13, but it got the replication fragment entry range is [0, 14]. When reading entry 14, it failed.


### One question
**Why the ensembles created a new ensemble starting with entryId = 15, but the ledger's lastAddConfirm is 13.**

This question is related to two parts, one is how the new ensemble was created and the other is how the lastAddConfirm was generated.

#### 1. How the new ensemble was created
The ensemble change is controlled on the bookie client side. 

When one entry is ready to send to the bookie server, the bookie client will check whether need to do the ensemble change.
https://github.com/apache/bookkeeper/blob/912896deb2e748389e15e74c37539b2ff36302c7/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L254

For the above case, when writing entry 15, one bookie is lost, it will trigger the ensemble change and generate the new ensemble: 15=[bk1:3181, bk2:3181, bk4:3181]. However, entry 15 write failed, such as timeout or bookie server rejected the write.

For now, entry 14 is written succeed.

#### 2. How the lastAddConfirm was generated
Due to the ledger being in the `OPEN` state, the ledger handle will send a readLAC request according to the last ensemble to get the ledger's lastAddConfirm. 

For the above case, the readLAC request will send to bk1, bk2, and bk4.

For the `V2` protocol (Pulsar uses the V2 protocol to interact with the BookKeeper cluster), the bookie client put the lastAddConfirm EntryId in the next Entry's metadata. 
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L134

When we use the `V2` protocol to open an `OPEN` state ledger to read, it will send a readLastAddConfirm request to the bookie server, and the bookie server gets the last entry of this ledger and return to the client.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java#L108

However, the bookie client will parse the response entry and get the lastAddConfirm from the entry's metadata. Due to the entry just recording the previous EntryId as the lastAddConfirm, the LedgerHandle got the lastAddConfirm will be the penultimate EntryId of the ledger.

For the above case, the bk1 holds the max entry 14, bk2 holds the max entry 14, and bk4 returns NoSuchEntryException, LedgerHandle gets lastAddConfirm will be `14  - 1 = 13`, not 14.

When the replicator tries to recover the first ensemble 0=[bk1:3181, bk2:3181, bk3:3181] with entry range [0, 14],  reading entry 14 will throw a ReadEntryException due to the lastAddConfirm is 13.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java#L685-L690

### Solution
When encountered that case that 
- The ledger is `OPEN`
- The ledger has multiple ensembles
- The ledger's last ensemble doesn't have any entries, which means `lastAddConfirm < last ensemble key - 1`

We should treat the penultimate segment/ensemble of the ledger as an `OPEN` state instead of a closed state.
https://github.com/apache/bookkeeper/blob/df4492012cc03682534cbc8dd68dd81163b0c947/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerFragment.java#L56-L57

After we treat the segment/ensemble as `OPEN` state, the replicator will close the ledger first and replicate it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments