Skip to content

Fix thread safety bug in split sender notify#1361

Merged
ericniebler merged 1 commit intoNVIDIA:mainfrom
ccotter:fix-split
Jul 6, 2024
Merged

Fix thread safety bug in split sender notify#1361
ericniebler merged 1 commit intoNVIDIA:mainfrom
ccotter:fix-split

Conversation

@ccotter
Copy link
Contributor

@ccotter ccotter commented Jul 5, 2024

Notifying the local_state might cause it to be destructed, whiich means the instrusive list's increment might be accessing uninitialized memory.

Fixes #1354

Test plan: I was able to fairly reliably reproduce the TSAN and MSAN diagnostics by defining local_state_base's destructor to set __next_=nullptr and adding a sleep(25ms) as the first line in __notify_waiters. With the fix, I could run the TSAN and MSAN executables for a few minutes and never saw any errors.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 5, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


STDEXEC_ASSERT(__waiters_copy.front() != __get_tombstone());
for (__local_state_base* __item: __waiters_copy) {
for (auto __itr = __waiters_copy.begin(); __itr != __waiters_copy.end(); ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find this fix all that satisfying (it seems a little too "cute"). Open to suggestions here. I considered changing the intrusive slist to handle this better, since the intrusive nature of the data structure leaves the iterator open to such bugs, but that didn't seem like the right thing to do.

@ccotter
Copy link
Contributor Author

ccotter commented Jul 5, 2024

If we find any more bugs with the "split is thread-safe" test, we'll need to consider renaming the test!

Notifying the `local_state` might cause it to be destructed,
whiich means the instrusive list's increment might be accessing
uninitialized memory.

Fixes NVIDIA#1354

Test plan: I was able to fairly reliably reproduce the TSAN and MSAN
diagnostics by defining `local_state_base`'s destructor to set
`__next_=nullptr` and adding a `sleep(25ms)` as the first line
in `__notify_waiters`. With the fix, I could run the TSAN and MSAN
executables for a few minutes and never saw any errors.
@ericniebler
Copy link
Collaborator

/ok to test

@ccotter
Copy link
Contributor Author

ccotter commented Jul 5, 2024

The CI failure seems unrelated?

633/739 Test #633: compare two schedulers obtained from different system_context objects .........................................................   Passed    0.02 sec
test 634
        Start 634: system scheduler can produce a sender

634: Test command: /home/coder/stdexec/build/test/test.stdexec "system scheduler can produce a sender"
634: Working Directory: /home/coder/stdexec/build/test
634: Test timeout computed to be: 60
634: Filters: system scheduler can produce a sender
634: ===============================================================================
634: All tests passed (3 assertions in 1 test case)
634: 
634: pure virtual method called
634: terminate called without an active exception
634/739 Test #634: system scheduler can produce a sender .........................................................................................Subprocess aborted***Exception:   0.30 sec
Filters: system scheduler can produce a sender
===============================================================================

@ericniebler ericniebler merged commit 7c657de into NVIDIA:main Jul 6, 2024
@ericniebler
Copy link
Collaborator

THANK YOU so much for finding an fixing this bug. 🎉

The CI failure seems unrelated?

yes, unrelated. it only happens with nvc++. i haven't gotten to the bottom of it.

@ccotter ccotter deleted the fix-split branch July 6, 2024 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TSAN failure in split

2 participants