gh-149101: Implement PEP 788#149116
Conversation
Documentation build overview
47 files changed ·
|
encukou
left a comment
There was a problem hiding this comment.
Thanks for adding these!
I'll send notes for Doc/ now; code review coming up.
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
|
🤖 New build scheduled with the buildbot fleet by @ZeroIntensity for commit bc78c10 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F149116%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
|
for buildbots: The RHEL8 failures aren't relevant. Refleaks are worrying though. |
Never mind; main currently leaks (#149179). |
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
For some reason, clangd thought it would be a good idea to automatically apply its ugly formatting style on save. It has since been uninstalled.
But this time, I didn't accidentally reformat all of pylifecycle.c!
| // it is not necessarily guaranteed that countdown is zero). We use it | ||
| // to simply prevent the finalization from constantly spinning and | ||
| // atomically reading the countdown. | ||
| _PyEvent_Reset(&interp->finalization_guards.done); |
There was a problem hiding this comment.
I'd prefer if a threading expert checked that _PyEvent_Reset is a safe primitive.
But before pinging one: why is countdown necessary? Could this do with only the R/W lock, with guards as readers and finalization (which sets the Finalized flag) as a writer?
There was a problem hiding this comment.
Resetting an event isn't a super common thing to do, but it is safe as long as there aren't any waiters (for reference, we do have Event.clear in the threading module).
It's likely possible to do this with only the RW lock, but it would require a lot of refactoring. During finalization, our current setup for "pre-finalization" tasks (guards, non-daemon threads, atexit callbacks, etc) is as follows:
- First, we try to run all of them sequentially. So,
joinis called for each thread, atexit callbacks are executed, and in this case, we wait on thePyEventfor guards. - Then, we make a stop-the-world pause to do an atomic check for the number of each remaining. We know other threads can't create more things during this period.
- If any of them is non-zero (meaning that more tasks were created while the world was stopping), then the world is started and the cycle repeats.
- Otherwise, the world is kept stopped, and then all non-main threads are deleted and thread state attachment is disallowed before the world is started again.
We need the countdown for step 2. If we don't track the number of guards, we can't know whether they're all cleaned up, so we don't know whether to run another cycle. With the current setup, the alternative without the countdown would involve disallowing the creation of new guards prematurely (before step 1), but that would mean threads can't protect against finalization in perfectly valid scenarios. We can't do it after step 4, because the world is stopped, so threads wouldn't be able to release their guards.
There was a problem hiding this comment.
Do we need a _PyRWMutex_TryLock that fails if the lock is held?
Well, guess that's for after the beta :)
There was a problem hiding this comment.
Do we need a _PyRWMutex_TryLock that fails if the lock is held?
What for? Interrupting the lock acquisition?
There was a problem hiding this comment.
With that this could be:
- as before
- We stop the world and try acquiring the write lock. We know other threads can't create more things during this period.
- If acquiring failed (meaning that more tasks were created while the world was stopping), then the world is started and the cycle repeats.
- as before
There was a problem hiding this comment.
This doesn't seem safe to me. Isn't there a race here with make_pre_finalization_calls and try_acquire_interp_guard?
- T1 checks
finalization_guards.countdown > 0 - T2 decrements, notifies, and resets
- T1 waits on
finalization_guards.done
There was a problem hiding this comment.
What would be the race there, exactly? It's supposed to wait on finalization_guards.done multiple times.
There was a problem hiding this comment.
Oh, I misunderstood your comment; I was looking at the finalization_guards.countdown read during the STW pause. I think I see the race now. We need to grab the lock before waiting. Edit: actually, we can just check finalization_guards.countdown > 0 while spinning.
| PyThreadState_Swap(to_restore); | ||
| } | ||
|
|
||
| PyThreadState_Swap(to_restore); |
There was a problem hiding this comment.
PyThreadState_Swap doesn't need to happen twice, right?
Also, could you assert the result?
There was a problem hiding this comment.
Good catch, we don't need the PyThreadState_Swap in the branch above.
Also, could you assert the result?
What result am I asserting?
There was a problem hiding this comment.
PyThreadState_Swap should return tstate.
| if (tstate->ensure.owned_guard != NULL) { | ||
| PyInterpreterGuard_Close(tstate->ensure.owned_guard); | ||
| tstate->ensure.owned_guard = NULL; | ||
| } |
There was a problem hiding this comment.
This looks fishy: tstate shouldn't be used after PyThreadState_Clear wipes it, and AFIAK owned_guard isn't locked here.
There was a problem hiding this comment.
Technically, it's safe to access owned_guard as long as the thread state is only cleared, not deleted (because the memory is still valid). It's not really safe to release the guard before we've cleared the thread state because the guard needs to be there to protect finalizers.
Also, what do you mean by "locked"?
There was a problem hiding this comment.
Going by the docs, PyThreadState_Clear will “Reset all information in a thread state object.”
I realize we can rely on undocumented behavior here, but still, it would be nice to “take over” owned_guard & delete_on_release (stash them in locals and unset the tstate's owned_guard) while everything is still attached.
what do you mean by "locked"?
I don't see what's preventing another thread's EnsureFromView from relying on the owned_guard just before it's set to NULL.
There was a problem hiding this comment.
Ah, okay, I see where you're coming from now.
I realize we can rely on undocumented behavior here, but still, it would be nice to “take over” owned_guard & delete_on_release (stash them in locals and unset the tstate's owned_guard) while everything is still attached.
This sounds reasonable, though I'm a little uneasy about setting owned_guard to NULL in PyThreadState_Clear. (I can't think of a practical example where it would be an issue, but something feels a little off about it.)
Would you be comfortable with stashing owned_guard in locals before calling PyThreadState_Clear, but not actually clearing the owned_guard field in PyThreadState_Clear (for now)? We can always add it later (presumably after b1).
I don't see what's preventing another thread's EnsureFromView from relying on the owned_guard just before it's set to NULL.
Another thread shouldn't be touching this owned_guard at all (the field is local to the thread, not to the interpreter). The other thread will have its own guard associated with its own EnsureFromView call.
There was a problem hiding this comment.
Would you be comfortable with stashing owned_guard in locals before calling PyThreadState_Clear, but not actually clearing the owned_guard field in PyThreadState_Clear (for now)? We can always add it later (presumably after b1).
I got over my fear -- done in 25687af.
There was a problem hiding this comment.
Ah, OK.
Would you be comfortable with stashing owned_guard in locals before calling PyThreadState_Clear, but not actually clearing the owned_guard field in PyThreadState_Clear (for now)?
Definitely.
Ideally, close the guard before the PyThreadState_Clear, though :)
|
Failing CI is because Ubuntu is down. |
colesbury
left a comment
There was a problem hiding this comment.
I've only briefly looked through the PR, but my thoughts so far are:
- This is a big change with plenty of opportunities for subtle bugs and deadlocks
- It doesn't seem wise to me to rush this into 3.15 at the end of the feature development cycle
| static const PyThreadState _no_tstate_sentinel = {0}; | ||
| #define NO_TSTATE_SENTINEL ((PyThreadState *)&_no_tstate_sentinel) |
There was a problem hiding this comment.
PyThreadState objects are large. We shouldn't be creating one here just for a sentinel return value.
| if (attached_tstate != NULL && attached_tstate->interp == interp) { | ||
| /* Yay! We already have an attached thread state that matches. */ | ||
| ++attached_tstate->ensure.counter; | ||
| return NO_TSTATE_SENTINEL; |
There was a problem hiding this comment.
I feel like I'm missing something here. How is this supposed to be used by callers? Sometimes PyThreadState_Ensure will return a completely unusable (but not NULL) value? Like how are callers supposed to know if the PyThreadState* returned from this function is useful?
I didn't see any discussions about NO_TSTATE_SENTINEL in the PEP. I also didn't see it in the discussion forums, but that can be hard to follow/search so I might have missed the post.
There was a problem hiding this comment.
Like how are callers supposed to know if the PyThreadState* returned from this function is useful?
They aren't supposed to, and that's by design. The PyThreadState * returned by Ensure/EnsureFromView is a handle to the previous thread state, which isn't relevant to the user. If they want to see the thread state that was just attached, they have to use PyThreadState_Get().
I also didn't see it in the discussion forums, but that can be hard to follow/search so I might have missed the post.
I think it was brought up at some point in the C API WG discussion. I do agree that it's a little hacky, but it doesn't seem like a huge deal to me either. Switching to int PyThreadState_Ensure(PyInterpreterGuard *guard, PyThreadState **out_ptr) would complicate API usage, so it seemed like a fair tradeoff.
There was a problem hiding this comment.
So if I understand correctly, the API has a public return value that is by design useless?
There was a problem hiding this comment.
It's in the PEP:
This function will return NULL to indicate a memory allocation failure, and otherwise return a pointer to the thread state that was previously attached (which might have been NULL, in which case an non-NULL sentinel value is returned instead to differentiate between failure – this means that this function will sometimes return an invalid PyThreadState pointer).
There was a problem hiding this comment.
This function will return NULL to indicate a memory allocation failure, and otherwise return a pointer to the thread state that was previously attached.
I find that text very confusing. This isn't a pointer to a previously attached thread state!
We have a return value that is:
- Useless to the caller
- An existing type used in multiple other APIs! So callers should reasonably expect that they interoperate!
Doesn't this stand out as a giant API design red flag to you?
There was a problem hiding this comment.
Yes, it's not ideal. The result is something you can only pass to PyThreadState_Release -- or compare with another PyThreadState*.
There was a problem hiding this comment.
Wait, why can't we just use void * instead of PyThreadState * for Ensure/Release? That makes it clear that it might not be a valid thread state.
Hugo granted an exception to backport this to 3.15 if we don't make the b1 deadline (though I did tell him that if we do go over the deadline, it won't be by that long). Are you suggesting that we defer this to 3.16? |
Yes |
|
You should take that up with the Steering Council, as the PEP was approved for 3.15. |
|
I'm taking it up with the PEP author, who can choose to disagree with my opinion |
|
I'd like to aim for 3.15, at least for the time being. I think the majority of thread-safety bugs here are going to be rare and won't require large changes to the implementation. That said, if it turns out that we do find some fundamental flaws here, I am okay with deferring the PEP to 3.16. |
|
Could you switch to |
|
(And it'd be better to have this in b1 than not, even if the internals and that type maybe aren't perfect yet.) |
Hugo has graciously given me permission to backport this if we don't make the May 5th deadline, but let's try to get this done in time!
I will write a full tutorial and migration guide once this is merged; I want to first make sure that this lands before the beta freeze.