-
Notifications
You must be signed in to change notification settings - Fork 14.9k
KAFKA-20058: Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing premature retries #21279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Manager causing premature retries
|
@chia7712 @jolshan Hi! While investigating a flaky test, I identified a race condition in the transaction code path (around Since you’re closest to the current context in this area, I’d really appreciate it if you could take a look and share any feedback when you have bandwidth 🙇♂️ |
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
|
A label of 'needs-attention' was automatically added to this PR in order to raise the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the bug!
I also found a second race which can cause premature retries, where maybeRequestNextBlock reads a stale backoffDeadlineMs and then the in-flight request fails.
maybeRequestNextBlock |
handleUnsuccessfulResponse |
|---|---|
var retryTimestamp = backoffDeadlineMs.get(); |
|
if (retryTimestamp == NO_RETRY || time.milliseconds() >= retryTimestamp) { |
|
backoffDeadlineMs.set(time.milliseconds() + RETRY_BACKOFF_MS); |
|
requestInFlight.set(false); |
|
requestInFlight.compareAndSet(false, true) |
Maybe we can fix this second race in a separate PR.
| sendRequest(); | ||
| // Reset backoff after a successful send. | ||
| backoffDeadlineMs.set(NO_RETRY); | ||
| backoffDeadlineMs.compareAndSet(retryTimestamp, NO_RETRY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing the bug!
Could we consider only updating backoffDeadlineMs together with the clearing of requestInFlight? That way we don't have to think about the race when setting backoffDeadlineMs at all, since it would be only set at the end of the in-flight request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments! Good Idea 👍
I made an commit based on your comment.
to preserve the existing semantics, I added code to set backoffDeadlineMs to NO_RETRY on the timeout path.
A more conservative approach could be to call handleUnsuccessfulResponse() on TIMEOUT as well, so that we apply the same retry backoff. However, since the previous code path did not update backoffDeadlineMs on onTimeout(), I kept that behavior here to minimize any behavioral change in this PR.
When you have bandwidth, please take another look. 🙇♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I just left a similar comment (#21279 (comment))
@squah-confluent If you’re okay with it, I can file an issue and follow up with a separate PR for this. If you were already planning to address it yourself, please let me know and I’ll hold off! Also, regarding the fix, I was thinking that reordering the operations as follows might address the issue, but we can discuss this further in the next PR. private void maybeRequestNextBlock() {
if (nextProducerIdBlock.get() != null)
return;
if (!requestInFlight.compareAndSet(false, true))
return;
final long retryTimestamp = backoffDeadlineMs.get();
final long now = time.milliseconds();
if (retryTimestamp != NO_RETRY && now < retryTimestamp) {
requestInFlight.set(false);
return;
}
sendRequest();
} |
squah-confluent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the fix, thank you!
Thanks for the fix. Please go ahead and file the issue and PR! I can think of two ways to fix it:
|
|
Hey folks -- sorry my email notifications don't work very well. I can take a look. |
|
@chickenchickenlove Could you update the PR description? |
|
@chickenchickenlove thanks for this fix. It makes sense to me. I have triggered the CI, and I will take a closer look shortly |
|
Ah ok -- will wait for Chia-Ping before merging. |
|
@squah-confluent @jolshan , @chia7712 |
The linter failure was caused by the missing "Reviewers" field in the description. Fixed |
| if (nextProducerIdBlock.get() == null && | ||
| requestInFlight.compareAndSet(false, true)) { | ||
| sendRequest(); | ||
| // Reset backoff after a successful send. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about using compareAndSet to resolve the race condition, but your approach is much cleaner.
Description
This PR fixes a race condition in
RPCProducerIdManager.maybeRequestNextBlock()that can clobber anewly-set retry backoff and cause premature retries.
The Problem
maybeRequestNextBlock()sends the controller requestasynchronously and then unconditionally resets
backoffDeadlineMstoNO_RETRY. On the response path,handleUnsuccessfulResponse()setsbackoffDeadlineMs = now + RETRY_BACKOFF_MS.Because the send is asynchronous, the unconditional reset in the request
path can execute after the failure handler has already set the backoff.
This overwrites the valid backoff with
NO_RETRY. Consequently, asubsequent
generateProducerId()call can re-send immediately, leadingto unnecessary controller traffic and flaky test behavior.
Fix
To avoid this race entirely,
backoffDeadlineMsis now only updated inthe response handler path:
backoffDeadlineMsfrommaybeRequestNextBlock().backoffDeadlineMstoNO_RETRY.backoffDeadlineMsto
NO_RETRY(no retry backoff is applied on timeout in this codepath).
This keeps backoff state changes localized to the response-handling
thread and prevents request-path updates from clobbering a concurrent
backoff update.
Flaky Tests fixed by this changes.
https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.coordinator.transaction.ProducerIdManagerTest&tests.sortField=FLAKY
ProducerIdManagerTest#testRetryBackoffOnNoResponseProducerIdManagerTest#testRetryBackoffOnAuthExceptionProducerIdManagerTest#testRetryBackoffOnVersionMismatchSequence Diagram in Flaky test cases that trigger race condition.
Reviewers: Justine Olshan [email protected], Sean Quah
[email protected], Chia-Ping Tsai [email protected]