Skip to content

Conversation

@chickenchickenlove
Copy link
Contributor

@chickenchickenlove chickenchickenlove commented Jan 10, 2026

Description

This PR fixes a race condition in
RPCProducerIdManager.maybeRequestNextBlock() that can clobber a
newly-set retry backoff and cause premature retries.

The Problem maybeRequestNextBlock() sends the controller request
asynchronously and then unconditionally resets backoffDeadlineMs to
NO_RETRY. On the response path, handleUnsuccessfulResponse() sets
backoffDeadlineMs = now + RETRY_BACKOFF_MS.

Because the send is asynchronous, the unconditional reset in the request
path can execute after the failure handler has already set the backoff.
This overwrites the valid backoff with NO_RETRY. Consequently, a
subsequent generateProducerId() call can re-send immediately, leading
to unnecessary controller traffic and flaky test behavior.

Fix

To avoid this race entirely, backoffDeadlineMs is now only updated in
the response handler path:

  • Remove the request-path reset of backoffDeadlineMs from
    maybeRequestNextBlock().
  • On a successful response, reset backoffDeadlineMs to NO_RETRY.
  • On timeout, keep the existing semantics by setting backoffDeadlineMs
    to NO_RETRY (no retry backoff is applied on timeout in this code
    path).

This keeps backoff state changes localized to the response-handling
thread and prevents request-path updates from clobbering a concurrent
backoff update.

Flaky Tests fixed by this changes.

https://develocity.apache.org/scans/tests?search.rootProjectNames=kafka&search.timeZoneId=Asia%2FTaipei&tests.container=org.apache.kafka.coordinator.transaction.ProducerIdManagerTest&tests.sortField=FLAKY

  • ProducerIdManagerTest#testRetryBackoffOnNoResponse
  • ProducerIdManagerTest#testRetryBackoffOnAuthException
  • ProducerIdManagerTest#testRetryBackoffOnVersionMismatch

Sequence Diagram in Flaky test cases that trigger race condition.

image

Reviewers: Justine Olshan [email protected], Sean Quah
[email protected], Chia-Ping Tsai [email protected]

@github-actions github-actions bot added triage PRs from the community transactions Transactions and EOS small Small PRs labels Jan 10, 2026
@chickenchickenlove
Copy link
Contributor Author

@chia7712 @jolshan Hi!
Sorry for sudden mention.

While investigating a flaky test, I identified a race condition in the transaction code path (around RPCProducerIdManager) and opened a PR with a fix.

Since you’re closest to the current context in this area, I’d really appreciate it if you could take a look and share any feedback when you have bandwidth 🙇‍♂️

@github-actions
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@chickenchickenlove
Copy link
Contributor Author

@jolshan @chia7712
Sorry to bother you, gently ping!
When you have bandwidth, please take a look 🙇‍♂️

@github-actions
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Copy link
Contributor

@squah-confluent squah-confluent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the bug!

I also found a second race which can cause premature retries, where maybeRequestNextBlock reads a stale backoffDeadlineMs and then the in-flight request fails.

maybeRequestNextBlock handleUnsuccessfulResponse
var retryTimestamp = backoffDeadlineMs.get();
if (retryTimestamp == NO_RETRY || time.milliseconds() >= retryTimestamp) {
backoffDeadlineMs.set(time.milliseconds() + RETRY_BACKOFF_MS);
requestInFlight.set(false);
requestInFlight.compareAndSet(false, true)

Maybe we can fix this second race in a separate PR.

sendRequest();
// Reset backoff after a successful send.
backoffDeadlineMs.set(NO_RETRY);
backoffDeadlineMs.compareAndSet(retryTimestamp, NO_RETRY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing the bug!

Could we consider only updating backoffDeadlineMs together with the clearing of requestInFlight? That way we don't have to think about the race when setting backoffDeadlineMs at all, since it would be only set at the end of the in-flight request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments! Good Idea 👍
I made an commit based on your comment.

to preserve the existing semantics, I added code to set backoffDeadlineMs to NO_RETRY on the timeout path.

A more conservative approach could be to call handleUnsuccessfulResponse() on TIMEOUT as well, so that we apply the same retry backoff. However, since the previous code path did not update backoffDeadlineMs on onTimeout(), I kept that behavior here to minimize any behavioral change in this PR.

When you have bandwidth, please take another look. 🙇‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I just left a similar comment (#21279 (comment))

@github-actions github-actions bot removed needs-attention triage PRs from the community labels Feb 2, 2026
@chickenchickenlove
Copy link
Contributor Author

I also found a second race which can cause premature retries, where maybeRequestNextBlock reads a stale backoffDeadlineMs and then the in-flight request fails.

@squah-confluent
Thanks a lot for the careful review and for pointing this out. 🙇‍♂️
You’re right — there’s still another race here that I missed.

If you’re okay with it, I can file an issue and follow up with a separate PR for this. If you were already planning to address it yourself, please let me know and I’ll hold off!

Also, regarding the fix, I was thinking that reordering the operations as follows might address the issue, but we can discuss this further in the next PR.

private void maybeRequestNextBlock() {
    if (nextProducerIdBlock.get() != null) 
        return;

    if (!requestInFlight.compareAndSet(false, true)) 
        return;

    final long retryTimestamp = backoffDeadlineMs.get();
    final long now = time.milliseconds();

    if (retryTimestamp != NO_RETRY && now < retryTimestamp) {
        requestInFlight.set(false);
        return;
    }

    sendRequest();
}

Copy link
Contributor

@squah-confluent squah-confluent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the fix, thank you!

@squah-confluent
Copy link
Contributor

I also found a second race which can cause premature retries, where maybeRequestNextBlock reads a stale backoffDeadlineMs and then the in-flight request fails.

If you’re okay with it, I can file an issue and follow up with a separate PR for this. If you were already planning to address it yourself, please let me know and I’ll hold off!

Also, regarding the fix, I was thinking that reordering the operations as follows might address the issue, but we can discuss this further in the next PR.

Thanks for the fix. Please go ahead and file the issue and PR!

I can think of two ways to fix it:

  1. Pack backoffDeadlineMs and requestInFlight into the same atomic, by creating a record class to hold them.
  2. Set requestInFlight and then unset it if backoffDeadlineMs is not satisfied, which is your fix. I think this option could be nicer.

@jolshan
Copy link
Member

jolshan commented Feb 2, 2026

Hey folks -- sorry my email notifications don't work very well. I can take a look.

@squah-confluent
Copy link
Contributor

squah-confluent commented Feb 2, 2026

@chickenchickenlove Could you update the PR description?

@chia7712
Copy link
Member

chia7712 commented Feb 2, 2026

@chickenchickenlove thanks for this fix. It makes sense to me. I have triggered the CI, and I will take a closer look shortly

@jolshan
Copy link
Member

jolshan commented Feb 2, 2026

Ah ok -- will wait for Chia-Ping before merging.

@chickenchickenlove
Copy link
Contributor Author

chickenchickenlove commented Feb 3, 2026

@squah-confluent
Thank you for taking a look.
I've opened KAFKA-20114 for the other race condition.
Let's discuss it there before starting on a fix!

@jolshan , @chia7712
Thank you for the review!
I noticed the PR Linter failed.
I checked the logs and noticed the error saying 'no Reviewers found in PR body'.
Should I update the PR message to include the 'Reviewers:' field myself?

@chia7712
Copy link
Member

chia7712 commented Feb 3, 2026

I noticed the PR Linter failed.

The linter failure was caused by the missing "Reviewers" field in the description. Fixed

if (nextProducerIdBlock.get() == null &&
requestInFlight.compareAndSet(false, true)) {
sendRequest();
// Reset backoff after a successful send.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about using compareAndSet to resolve the race condition, but your approach is much cleaner.

@chia7712 chia7712 merged commit a739b05 into apache:trunk Feb 3, 2026
50 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved small Small PRs transactions Transactions and EOS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants