Fix GC stress C on macOS arm64 #79426
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tests relying on
SIGCHLDsignal are intermittently hanging when running with GC stress C enabled. Investigations have shown that the problem is that while according to dtrace theSIGCHLDwas delivered to a thread, the signal handler for it was actually not invoked. That caused tests waiting for child completion to wait forever, thus the tests were hanging.Further investigations have uncovered the real issue. The problem is that when we return from the coreclr hardware exception handler to PAL after processing an invalid instruction used by the GC stress C machinery, we use
MachSetThreadContextto update the context of the thread and to resume its execution. The problem is that theMachSetThreadContextuses the pattern of suspend thread / set thread state / resume thread. This pattern is problematic when async signals can be delivered to the thread. The kernel can update the thread state to point to the signal handler even when the thread is suspended. So in our case, the kernel has set the state to execute the signal handler, but in race conditions, we have overwritten that by our context, effectively going back to the managed code and preventing the signal handler execution.We are using
MachSetThreadContextinstead ofRtlRestoreContextat that place because we need all registers to be restored and it is not possible to restore all the registers and jump to the target in user mode code. We are left with at least one register containing different value (the target address).This change uses a trick to achieve full thread state restoration. It invokes a new helper
RestoreCompleteContextwhich contains just a single invalid instruction. The hardware exception handling is triggered by that, we detect that the fault address is theRestoreCompleteContextaddress and we set the context of the faulting thread to the desired context. In this case, the signal handling cannot interfere with this process and so there is no race.I have originally modified
RtlRestoreContextinstead of creating the newRestoreCompleteContext, but it turned out that for the other usages ofRtlRestoreContext, it is too expensive. And for the other usages, we don't require the full fidelity restoration.Close #69092