Skip to content

Conversation

@corylanou
Copy link
Collaborator

@corylanou corylanou commented Feb 7, 2026

Summary

  • Fix data race in Compactor.client field detected by -race in FuzzRestoreWithMissingCompactedFile
  • Root cause: ensureCompactorClient() lazily syncs the compactor's client on every operation, creating a race when multiple goroutines call it concurrently from Store.Open() monitors
  • Fix: set the compactor client once in DB.Open() (where Replica is already assigned), eliminating ensureCompactorClient(), SetClient(), Client(), and the mutex entirely
  • Tests that previously overwrote db.Replica after Open() are restructured to set Replica before Open()

Race trace (on main): Goroutine reading Compactor.Client() via compaction level monitor races with goroutine writing Compactor.SetClient() via snapshot level monitor — both enter through DB.ensureCompactorClient().

Fixes #1085

Test plan

  • go test -race -count=5 -run FuzzRestoreWithMissingCompactedFile -v . — Compactor.client race eliminated
  • go test -race ./... — full test suite passes
  • Pre-commit hooks pass (go-imports, go-vet, staticcheck)

🤖 Generated with Claude Code

@corylanou corylanou force-pushed the issue-1085-fix-fuzzrestorewithmissingcompactedfile-race-condition-detected-with-race branch from 2e28e37 to 9779cf2 Compare February 8, 2026 00:18
@corylanou corylanou requested a review from benbjohnson February 8, 2026 00:18
@corylanou corylanou force-pushed the issue-1085-fix-fuzzrestorewithmissingcompactedfile-race-condition-detected-with-race branch 2 times, most recently from dd37d2f to 3ea579e Compare February 8, 2026 00:44
…nt access

The Compactor.client field (a ReplicaClient interface) was read and
written concurrently by goroutines spawned in Store.Open() without
synchronization. One goroutine calls Compactor.Client() via the
compaction level monitor while another calls Compactor.SetClient() via
the snapshot level monitor, both through DB.ensureCompactorClient().

Add a sync.RWMutex to the Compactor struct. Methods that mutate the
replica (Compact, EnforceSnapshotRetention, EnforceRetentionByTXID,
EnforceL0Retention) hold a write lock for their full duration. Read-only
methods (MaxLTXFileInfo, VerifyLevelConsistency, Client) hold a read
lock. SetClient holds a write lock, blocking until all in-flight
operations complete. Internal helpers (maxLTXFileInfo,
verifyLevelConsistency) assume the caller already holds the lock.

Fixes #1085

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@corylanou corylanou force-pushed the issue-1085-fix-fuzzrestorewithmissingcompactedfile-race-condition-detected-with-race branch from 3ea579e to 2cf4ebd Compare February 8, 2026 01:03
Copy link
Owner

@benbjohnson benbjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corylanou This one seems weird. Do you know why we're updating the Compactor's client field at all? It seems like we should only be setting its client once immediately after the Replica is set. Or we could move compactor to the Replica instead of having it on DB and then it can be initialized at the same time.

@corylanou
Copy link
Collaborator Author

@corylanou This one seems weird. Do you know why we're updating the Compactor's client field at all? It seems like we should only be setting its client once immediately after the Replica is set. Or we could move compactor to the Replica instead of having it on DB and then it can be initialized at the same time.

That would be a better firx for sure. I wasn't sure if you needed the ability to set the client once it was created. It's possible that the "SetClient" is only for testing? If so, we should be able to hopefully change the way we test. This came up as a fuzz test failing a race condition. I'll take another look if we don't think we need the SetClient method.

corylanou and others added 4 commits February 8, 2026 09:15
…ctorClient

Replace the RWMutex-based approach with a simpler design per review feedback:
the compactor client is now set once in DB.Open() when Replica is already
assigned, eliminating the need for ensureCompactorClient(), SetClient(),
Client(), and the mutex entirely.

Tests that previously overwrote db.Replica after Open() are restructured
to set Replica before Open(), ensuring the compactor gets the correct
client during initialization.

Fixes #1085

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move Replica nil check to the validation block at the top of Open(),
failing fast before any side effects. Remove the defensive nil guard
around compactor client assignment since Replica is now guaranteed set.
Fix createTestSQLiteDB to use database/sql directly instead of
litestream.DB which requires a Replica.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move compactor.client assignment before monitor goroutine starts to
  eliminate potential race window
- Add Replica.Client nil validation in Open() to fail fast
- Remove redundant db.Replica overwrites in TestDB_Snapshot and
  TestDB_EnforceRetention (MustOpenDBs already sets a file replica)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@corylanou
Copy link
Collaborator Author

@benbjohnson turns out the real fix was to change how we tested it. Removed all of the race conditions, etc. Much cleaner now.

@corylanou corylanou requested a review from benbjohnson February 8, 2026 17:20
Copy link
Owner

@benbjohnson benbjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems way better 👍

@corylanou corylanou merged commit c32f6c8 into main Feb 9, 2026
19 checks passed
@corylanou corylanou deleted the issue-1085-fix-fuzzrestorewithmissingcompactedfile-race-condition-detected-with-race branch February 9, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: FuzzRestoreWithMissingCompactedFile race condition detected with -race

2 participants