Skip to content

Fix duplicate CTE race condition in ephemeral model compilation#12602

Open
colin-rogers-dbt wants to merge 7 commits intomainfrom
duplicate-cte-compile
Open

Fix duplicate CTE race condition in ephemeral model compilation#12602
colin-rogers-dbt wants to merge 7 commits intomainfrom
duplicate-cte-compile

Conversation

@colin-rogers-dbt
Copy link
Contributor

Summary

  • Adds a per-node threading.Lock to CompiledResource to eliminate TOCTOU race conditions when multiple threads compile nodes that ref() the same ephemeral model
  • Wraps the two critical sections in _recursively_prepend_ctes (ephemeral compilation and CTE injection) with the ephemeral/consuming node's lock
  • Makes set_cte thread-safe to prevent duplicate CTE appends
  • Per-node lock granularity preserves parallelism — threads only block when competing for the exact same ephemeral node

Test plan

  • Added tests/unit/test_compilation_threading.py with 4 regression tests:
    • Concurrent set_cte deduplication (20 threads, assert single CTE)
    • Concurrent ephemeral compilation (assert compiled exactly once)
    • Lock excluded from serialization
    • Lock restored after deserialization round-trip
  • All 61 existing tests/unit/contracts/graph/test_nodes.py tests pass
  • All code quality checks pass (hatch run pre-commit run --all-files)

🤖 Generated with Claude Code

Add per-node threading.Lock to CompiledResource to prevent duplicate CTEs
when multiple threads compile nodes that ref() the same ephemeral model.
The existing check-then-act patterns in _recursively_prepend_ctes and
set_cte were classic TOCTOU races causing duplicate CTEs ~1/50 runs.

Per-node lock granularity preserves parallelism — threads only block when
competing for the exact same ephemeral node. No deadlock risk since
ephemeral deps form a DAG and lock acquisition follows dependency order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@colin-rogers-dbt colin-rogers-dbt requested a review from a team as a code owner March 5, 2026 20:26
@cla-bot cla-bot bot added the cla:yes label Mar 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Additional Artifact Review Required

Changes to artifact directory files requires at least 2 approvals from core team members.

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.32%. Comparing base (4b12914) to head (21187a7).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12602      +/-   ##
==========================================
- Coverage   91.39%   91.32%   -0.07%     
==========================================
  Files         203      203              
  Lines       25596    25644      +48     
==========================================
+ Hits        23394    23420      +26     
- Misses       2202     2224      +22     
Flag Coverage Δ
integration 88.09% <96.15%> (-0.14%) ⬇️
unit 65.36% <38.46%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 65.36% <38.46%> (+<0.01%) ⬆️
Integration Tests 88.09% <96.15%> (-0.14%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

colin-rogers-dbt and others added 4 commits March 5, 2026 13:32
Add __getstate__/__setstate__ to CompiledResource so the _lock field
is excluded during pickling and recreated on unpickle, fixing
TypeError: cannot pickle '_thread.lock' object.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To comments, but logically this all looks good!

extra_ctes_injected: bool = False
extra_ctes: List[InjectedCTE] = field(default_factory=list)
_pre_injected_sql: Optional[str] = None
_lock: threading.Lock = field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅 I think there is now a better way to do this. If you instead define _lock on the runtime instantiated version of CompiledResource, i.e. CompiledNode, you don't have to worry about the serialization aspect of it because extra attributes will be dropped when converting back CompiledResource during serialization.

I only know this because we discovered this during the development of microbatch. Here are some attributes we did it for (link)

colin-rogers-dbt and others added 2 commits March 6, 2026 15:22
Address PR feedback: define _lock on the runtime CompiledNode class
instead of the serialized CompiledResource. This avoids all mashumaro
serialization concerns since extra attributes on CompiledNode are
automatically dropped during serialization. Also restores the dropped
comment explaining the check-then-act pattern, and removes the
unnecessary lock from set_cte (only called with sql=None during parsing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the recursive _recursively_prepend_ctes call outside the
cte_model._lock scope. The old structure would deadlock with
threading.Lock (non-reentrant) because the recursive call re-acquires
the same node's lock for CTE injection. Now the lock only covers the
compile-or-skip decision, and recursion happens after release.

Add test that validates the lock acquisition pattern doesn't deadlock
by simulating the exact sequence from _recursively_prepend_ctes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants