Fix hook message ordering race in while loops#1696
Fix hook message ordering race in while loops#1696blasphemetheus wants to merge 1 commit intoelixir-nx:mainfrom
Conversation
Execute hook callback before sending completion signal to outfeed loop, ensuring hook side-effects complete before the next iteration's hook process is spawned. Uses try/after to guarantee the parent is unblocked even if the hook raises. Fixes elixir-nx#1686 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c45f707 to
6513259
Compare
|
(added a regression test for the race), should reliably fail on main rn |
| fun.(EXLA.Defn.Buffers.to_nx!(buffers, template)) | ||
| after | ||
| send(parent, ref) | ||
| end |
There was a problem hiding this comment.
The after here is not guaranteed to be delivered (for example, in case of broken links). If we want to guarantee, then the best way is to replace spawn(fn -> ... end) above by spawn_monitor(fn -> ... end). But the issue with this code is that we will now block future actions (such as infeeds and upcoming runtime callbacks) until the outfeed is done. The correct solution here is rather to have a queue of outfeeds (but then that introduces other complications).
On the other hand, we have a plan to rewrite outfeeds into runtime calls, so perhaps we revisit once that is done. The test is extremely useful though, thank you!
There was a problem hiding this comment.
no problem! I'll make the PR a draft and can update later
|
right, waiting on #1672 to resolve |
Execute hook callback before sending completion signal to outfeed loop, ensuring hook side-effects complete before the next iteration's hook process is spawned. Uses try/after to guarantee the parent is unblocked even if the hook raises.
Fixes #1686
Tradeoff
this serializes hook execution. so the outfeed loop now waits for run.(...) to complete before proceeding. (before it could run concurrently with the next from_outfeed NIF call).
outfeed loop already blocked on
receive do ^refbefore looping andfrom_outfeed(mostly waiting for XLA to produce the next value.if the concurrency is worth keeping, you could instead do a single long-lived hook executor process (ordering guaranteed via BEAM's single-sender message guarantee), but that seems like a bigger change