Skip to content

Conversation

@janvorli
Copy link
Member

@janvorli janvorli commented Dec 9, 2022

Tests relying on SIGCHLD signal are intermittently hanging when running with GC stress C enabled. Investigations have shown that the problem is that while according to dtrace the SIGCHLD was delivered to a thread, the signal handler for it was actually not invoked. That caused tests waiting for child completion to wait forever, thus the tests were hanging.
Further investigations have uncovered the real issue. The problem is that when we return from the coreclr hardware exception handler to PAL after processing an invalid instruction used by the GC stress C machinery, we use MachSetThreadContext to update the context of the thread and to resume its execution. The problem is that the MachSetThreadContext uses the pattern of suspend thread / set thread state / resume thread. This pattern is problematic when async signals can be delivered to the thread. The kernel can update the thread state to point to the signal handler even when the thread is suspended. So in our case, the kernel has set the state to execute the signal handler, but in race conditions, we have overwritten that by our context, effectively going back to the managed code and preventing the signal handler execution.
We are using MachSetThreadContext instead of RtlRestoreContext at that place because we need all registers to be restored and it is not possible to restore all the registers and jump to the target in user mode code. We are left with at least one register containing different value (the target address).

This change uses a trick to achieve full thread state restoration. It invokes a new helper RestoreCompleteContext which contains just a single invalid instruction. The hardware exception handling is triggered by that, we detect that the fault address is the RestoreCompleteContext address and we set the context of the faulting thread to the desired context. In this case, the signal handling cannot interfere with this process and so there is no race.

I have originally modified RtlRestoreContext instead of creating the new RestoreCompleteContext, but it turned out that for the other usages of RtlRestoreContext, it is too expensive. And for the other usages, we don't require the full fidelity restoration.

Close #69092

Tests relying on SIGCHLD signal are intermittently hanging when
running with GC stress C enabled. Investigations have shown that
the problem is that while according to dtrace the SIGCHLD was
delivered to a thread, the signal handler for it was actually not
invoked. That caused tests waiting for child completion to wait
forever, thus the tests were hanging.
Further investigations have uncovered the real issue. The problem
is that when we return from the coreclr hardware exception handler
to PAL after processing an invalid instruction used by the GC stress
C machinery, we use MachSetThreadContext to update the context
of the thread and to resume its execution. The problem is that the
MachSetThreadContext uses the pattern of suspend thread / set
thread state / resume thread. This pattern is problematic when
async signals can be delivered to the thread. The kernel can update
the thread state to point to the signal handler even when the thread
is suspended. So in our case, the kernel has set the state to execute
the signal handler, but in race conditions, we have overwritten that
by our context, effectively going back to the managed code and
preventing the signal handler execution.
We are using MachSetThreadContext instead of RtlRestoreContext at
that place because we need all registers to be restored and it
is not possible to restore all the registers and jump to the target
in user mode code. We are left with at least one register containing
different value (the target address).

This change uses a trick to achieve full thread state restoration.
It invokes a new helper RestoreCompleteContext which contains just
a single invalid instruction. The hardware exception handling is
triggered by that, we detect that the fault address is the
RestoreCompleteContext address and we set the context of the faulting
thread to the desired context. In this case, the signal handling
cannot interfere with this process and so there is no race.

I have originally modified RtlRestoreContext instead of creating
the new RestoreCompleteContext, but it turned out that for the
other usages of RtlRestoreContext, it is too expensive. And for
the other usages, we don't require the full fidelity restoration.
@janvorli janvorli added this to the 8.0.0 milestone Dec 9, 2022
@janvorli janvorli requested a review from jkotas December 9, 2022 01:39
@janvorli janvorli self-assigned this Dec 9, 2022
@ghost ghost added the area-PAL-coreclr label Dec 9, 2022
@runfoapp runfoapp bot mentioned this pull request Dec 9, 2022
@janvorli janvorli merged commit f8cc773 into dotnet:main Dec 12, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 11, 2023
@janvorli janvorli deleted the fix-macos-arm64-gcstress-c branch January 26, 2023 13:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test failure JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd

2 participants