Skip to content

Fix hook message ordering race in while loops#1696

Draft
blasphemetheus wants to merge 1 commit intoelixir-nx:mainfrom
blasphemetheus:fix/hook-message-ordering
Draft

Fix hook message ordering race in while loops#1696
blasphemetheus wants to merge 1 commit intoelixir-nx:mainfrom
blasphemetheus:fix/hook-message-ordering

Conversation

@blasphemetheus
Copy link
Contributor

@blasphemetheus blasphemetheus commented Mar 15, 2026

Execute hook callback before sending completion signal to outfeed loop, ensuring hook side-effects complete before the next iteration's hook process is spawned. Uses try/after to guarantee the parent is unblocked even if the hook raises.

Fixes #1686

Tradeoff

this serializes hook execution. so the outfeed loop now waits for run.(...) to complete before proceeding. (before it could run concurrently with the next from_outfeed NIF call).
outfeed loop already blocked on receive do ^ref before looping and from_outfeed (mostly waiting for XLA to produce the next value.

if the concurrency is worth keeping, you could instead do a single long-lived hook executor process (ordering guaranteed via BEAM's single-sender message guarantee), but that seems like a bigger change

Execute hook callback before sending completion signal to outfeed loop,
ensuring hook side-effects complete before the next iteration's hook
process is spawned. Uses try/after to guarantee the parent is unblocked
even if the hook raises.

Fixes elixir-nx#1686

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@blasphemetheus blasphemetheus force-pushed the fix/hook-message-ordering branch from c45f707 to 6513259 Compare March 15, 2026 02:09
@blasphemetheus
Copy link
Contributor Author

(added a regression test for the race), should reliably fail on main rn

fun.(EXLA.Defn.Buffers.to_nx!(buffers, template))
after
send(parent, ref)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The after here is not guaranteed to be delivered (for example, in case of broken links). If we want to guarantee, then the best way is to replace spawn(fn -> ... end) above by spawn_monitor(fn -> ... end). But the issue with this code is that we will now block future actions (such as infeeds and upcoming runtime callbacks) until the outfeed is done. The correct solution here is rather to have a queue of outfeeds (but then that introduces other complications).

On the other hand, we have a plan to rewrite outfeeds into runtime calls, so perhaps we revisit once that is done. The test is extremely useful though, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem! I'll make the PR a draft and can update later

@blasphemetheus blasphemetheus marked this pull request as draft March 15, 2026 23:23
@blasphemetheus
Copy link
Contributor Author

right, waiting on #1672 to resolve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hook messages within 'while' arrive out of order on OTP 28.3

2 participants