KAFKA-20058: Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing premature retries #21279

chickenchickenlove · 2026-01-10T12:25:38Z

Description

This PR fixes a race condition in
RPCProducerIdManager.maybeRequestNextBlock() that can clobber a
newly-set retry backoff and cause premature retries.

The Problem maybeRequestNextBlock() sends the controller request
asynchronously and then unconditionally resets backoffDeadlineMs to
NO_RETRY. On the response path, handleUnsuccessfulResponse() sets
backoffDeadlineMs = now + RETRY_BACKOFF_MS.

Because the send is asynchronous, the unconditional reset in the request
path can execute after the failure handler has already set the backoff.
This overwrites the valid backoff with NO_RETRY. Consequently, a
subsequent generateProducerId() call can re-send immediately, leading
to unnecessary controller traffic and flaky test behavior.

Fix

To avoid this race entirely, backoffDeadlineMs is now only updated in
the response handler path:

Remove the request-path reset of backoffDeadlineMs from
maybeRequestNextBlock().
On a successful response, reset backoffDeadlineMs to NO_RETRY.
On timeout, keep the existing semantics by setting backoffDeadlineMs
to NO_RETRY (no retry backoff is applied on timeout in this code
path).

This keeps backoff state changes localized to the response-handling
thread and prevents request-path updates from clobbering a concurrent
backoff update.

Flaky Tests fixed by this changes.

https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.coordinator.transaction.ProducerIdManagerTest&tests.sortField=FLAKY

ProducerIdManagerTest#testRetryBackoffOnNoResponse
ProducerIdManagerTest#testRetryBackoffOnAuthException
ProducerIdManagerTest#testRetryBackoffOnVersionMismatch

Sequence Diagram in Flaky test cases that trigger race condition.

Reviewers: Justine Olshan [email protected], Sean Quah
[email protected], Chia-Ping Tsai [email protected]

…Manager causing premature retries

chickenchickenlove · 2026-01-16T14:52:56Z

@chia7712 @jolshan Hi!
Sorry for sudden mention.

While investigating a flaky test, I identified a race condition in the transaction code path (around RPCProducerIdManager) and opened a PR with a fix.

Since you’re closest to the current context in this area, I’d really appreciate it if you could take a look and share any feedback when you have bandwidth 🙇‍♂️

github-actions · 2026-01-18T03:38:11Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

chickenchickenlove · 2026-01-27T00:02:41Z

@jolshan @chia7712
Sorry to bother you, gently ping!
When you have bandwidth, please take a look 🙇‍♂️

github-actions · 2026-01-28T03:39:23Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

squah-confluent

Thanks for fixing the bug!

I also found a second race which can cause premature retries, where maybeRequestNextBlock reads a stale backoffDeadlineMs and then the in-flight request fails.

`maybeRequestNextBlock`	`handleUnsuccessfulResponse`
`var retryTimestamp = backoffDeadlineMs.get();`
`if (retryTimestamp == NO_RETRY \|\| time.milliseconds() >= retryTimestamp) {`
	`backoffDeadlineMs.set(time.milliseconds() + RETRY_BACKOFF_MS);`
	`requestInFlight.set(false);`
`requestInFlight.compareAndSet(false, true)`

Maybe we can fix this second race in a separate PR.

squah-confluent · 2026-02-01T18:52:23Z

...coordinator/src/main/java/org/apache/kafka/coordinator/transaction/RPCProducerIdManager.java

                sendRequest();
                // Reset backoff after a successful send.
-                backoffDeadlineMs.set(NO_RETRY);
+                backoffDeadlineMs.compareAndSet(retryTimestamp, NO_RETRY);


Thank you for fixing the bug!

Could we consider only updating backoffDeadlineMs together with the clearing of requestInFlight? That way we don't have to think about the race when setting backoffDeadlineMs at all, since it would be only set at the end of the in-flight request.

Thanks for your comments! Good Idea 👍
I made an commit based on your comment.

to preserve the existing semantics, I added code to set backoffDeadlineMs to NO_RETRY on the timeout path.

A more conservative approach could be to call handleUnsuccessfulResponse() on TIMEOUT as well, so that we apply the same retry backoff. However, since the previous code path did not update backoffDeadlineMs on onTimeout(), I kept that behavior here to minimize any behavioral change in this PR.

When you have bandwidth, please take another look. 🙇‍♂️

oh, I just left a similar comment (#21279 (comment))

chickenchickenlove · 2026-02-02T06:28:38Z

I also found a second race which can cause premature retries, where maybeRequestNextBlock reads a stale backoffDeadlineMs and then the in-flight request fails.

@squah-confluent
Thanks a lot for the careful review and for pointing this out. 🙇‍♂️
You’re right — there’s still another race here that I missed.

If you’re okay with it, I can file an issue and follow up with a separate PR for this. If you were already planning to address it yourself, please let me know and I’ll hold off!

Also, regarding the fix, I was thinking that reordering the operations as follows might address the issue, but we can discuss this further in the next PR.

private void maybeRequestNextBlock() {
    if (nextProducerIdBlock.get() != null) 
        return;

    if (!requestInFlight.compareAndSet(false, true)) 
        return;

    final long retryTimestamp = backoffDeadlineMs.get();
    final long now = time.milliseconds();

    if (retryTimestamp != NO_RETRY && now < retryTimestamp) {
        requestInFlight.set(false);
        return;
    }

    sendRequest();
}

squah-confluent

I'm happy with the fix, thank you!

squah-confluent · 2026-02-02T16:15:46Z

I also found a second race which can cause premature retries, where maybeRequestNextBlock reads a stale backoffDeadlineMs and then the in-flight request fails.

If you’re okay with it, I can file an issue and follow up with a separate PR for this. If you were already planning to address it yourself, please let me know and I’ll hold off!

Also, regarding the fix, I was thinking that reordering the operations as follows might address the issue, but we can discuss this further in the next PR.

Thanks for the fix. Please go ahead and file the issue and PR!

I can think of two ways to fix it:

Pack backoffDeadlineMs and requestInFlight into the same atomic, by creating a record class to hold them.
Set requestInFlight and then unset it if backoffDeadlineMs is not satisfied, which is your fix. I think this option could be nicer.

jolshan · 2026-02-02T17:10:44Z

Hey folks -- sorry my email notifications don't work very well. I can take a look.

squah-confluent · 2026-02-02T17:23:35Z

@chickenchickenlove Could you update the PR description?

chia7712 · 2026-02-02T17:29:01Z

@chickenchickenlove thanks for this fix. It makes sense to me. I have triggered the CI, and I will take a closer look shortly

jolshan · 2026-02-02T22:21:08Z

Ah ok -- will wait for Chia-Ping before merging.

chickenchickenlove · 2026-02-03T00:00:49Z

@squah-confluent
Thank you for taking a look.
I've opened KAFKA-20114 for the other race condition.
Let's discuss it there before starting on a fix!

@jolshan , @chia7712
Thank you for the review!
I noticed the PR Linter failed.
I checked the logs and noticed the error saying 'no Reviewers found in PR body'.
Should I update the PR message to include the 'Reviewers:' field myself?

chia7712 · 2026-02-03T10:52:09Z

I noticed the PR Linter failed.

The linter failure was caused by the missing "Reviewers" field in the description. Fixed

chia7712 · 2026-02-03T11:28:15Z

...coordinator/src/main/java/org/apache/kafka/coordinator/transaction/RPCProducerIdManager.java

            if (nextProducerIdBlock.get() == null &&
                    requestInFlight.compareAndSet(false, true)) {
                sendRequest();
-                // Reset backoff after a successful send.


I was thinking about using compareAndSet to resolve the race condition, but your approach is much cleaner.

KAFKA-20058: Fix race condition on backoffDeadlineMs on RPCProducerId…

c6212db

…Manager causing premature retries

github-actions bot added triage PRs from the community transactions Transactions and EOS small Small PRs labels Jan 10, 2026

github-actions bot added the needs-attention label Jan 18, 2026

github-actions bot removed the needs-attention label Jan 27, 2026

github-actions bot added the needs-attention label Jan 28, 2026

squah-confluent reviewed Feb 1, 2026

View reviewed changes

github-actions bot removed needs-attention triage PRs from the community labels Feb 2, 2026

Addressing review.

9872ba7

chickenchickenlove requested a review from squah-confluent February 2, 2026 07:39

squah-confluent approved these changes Feb 2, 2026

View reviewed changes

jolshan approved these changes Feb 2, 2026

View reviewed changes

chia7712 added the ci-approved label Feb 2, 2026

chia7712 approved these changes Feb 3, 2026

View reviewed changes

chia7712 merged commit a739b05 into apache:trunk Feb 3, 2026
50 of 53 checks passed

KAFKA-20058: Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing premature retries #21279

KAFKA-20058: Fix race condition on backoffDeadlineMs on RPCProducerIdManager causing premature retries #21279

Uh oh!

Conversation

chickenchickenlove commented Jan 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fix

Flaky Tests fixed by this changes.

Sequence Diagram in Flaky test cases that trigger race condition.

Uh oh!

chickenchickenlove commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 18, 2026

Uh oh!

chickenchickenlove commented Jan 27, 2026

Uh oh!

github-actions bot commented Jan 28, 2026

Uh oh!

squah-confluent left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squah-confluent Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

chickenchickenlove Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

chia7712 Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

chickenchickenlove commented Feb 2, 2026

Uh oh!

squah-confluent left a comment

Choose a reason for hiding this comment

Uh oh!

squah-confluent commented Feb 2, 2026

Uh oh!

jolshan commented Feb 2, 2026

Uh oh!

squah-confluent commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chia7712 commented Feb 2, 2026

Uh oh!

jolshan commented Feb 2, 2026

Uh oh!

chickenchickenlove commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chia7712 commented Feb 3, 2026

Uh oh!

chia7712 Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chickenchickenlove commented Jan 10, 2026 •

edited by github-actions bot

Loading

squah-confluent left a comment •

edited

Loading

squah-confluent commented Feb 2, 2026 •

edited

Loading

chickenchickenlove commented Feb 3, 2026 •

edited

Loading