-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Unwind: separate the "set frame reg" and "define new frame" operations to fix metadata bug related to stack checks. #3864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1374eb0 to
a30f092
Compare
|
Hmm, that didn't work on Windows-2022; a generic "stack overrun" crash. This will have to go to my back burner for a bit until I have a workable Windows machine to debug on; in the meantime if anyone else wants to poke at this, please feel free! |
|
I was poking around at this again today. For a trivially recursive function where the prefixed hex number is the offset of that instruction within the The The diff from the unwind info from @@ -11,7 +11,7 @@
reg: 5,
},
SetFPReg {
- instruction_offset: 21,
+ instruction_offset: 4,
},
],
}which is to be expected. So the only other bit of unwind info we could really modify here seems to be @peterhuene you may know more about windows unwinding here, does windows disallow unwinding from within the prologue itself? |
|
Unwinding from within the prologue is permitted on windows. It is why the unwind codes are expected to be sorted in descending order from prologue offset: the unwinder simply walks the codes, skipping over any where the prologue offset is greater than the function offset of the exception address. So if we were originally emitting the frame pointer establishment unwind code with a prologue offset equal to the size of the prologue, then any exception from within the prologue would have skipped undoing that operation. |
|
Hm ok so the unwinding logs I have from this PR look like: where I believe the first wasm pc is which is indeed the The "second wasm frame", however is bogus This looks like the unwinder did not actually unwind the I'm not sure why the |
|
Thanks @alexcrichton for digging into this more! For my own potential debugging later: are you able to test this locally on a desktop Windows install (and which version if so)? I do have a Win10 VM I can dig up, which I suspect corresponds to |
|
Nah I'm just throwing things at CI and reading output from github actions, basically vicariously printf-debugging through github heh |
…s to fix metadata bug related to stack checks. In bytecodealliance#3859 we discovered that the unwind metadata info on Windows is placed slightly incorrectly in the presence of explicit stack checks in function prologues. In particular, if the stack check fails, then we start the process of backtracing from a state where we have actually updated rbp (the frame pointer) but we have not emitted the metadata saying we have done so. The fix is to emit the `SetFPReg` Win32 unwind op exactly at the offset of the `mov rbp, rsp` instruction, not after the stack check. However, actually achieving this output given our current unwind design (which abstracts over SysV and Win32) is slightly tricky, because: - We previously had a single `DefineNewFrame` op that both indicated where the FP was set, and gave the size of the clobbers; - We don't know the size of the clobbers until we generate the clobber-save sequence; and - The clobber-save sequence is generated after any explicit stack check. To resolve the problem, this PR splits the `DefineNewFrame` op into two pieces: `SetFrameReg` and `DefineNewFrame`. On x86-64 and aarch64 this is emitted in the appropriate place. (`s390x` does not use a frame register, so has no need for this op.) This should resolve the issue we were seeing with Windows-2022 and unwinding. (Thanks to @alexcrichton, @iximeow and @peterhuene for help debugging this!)
|
(Pushed rebase onto current I'm at a bit of a loss here: I re-read all of the documentation on fastcall prologues and unwind info, and nothing seems amiss. I will list all of the assumptions that we're making to see if any might be wrong:
Any others? Does all of this sound reasonable? The most dubious is IMHO the last. However (i) if the unwinder claims to work by only reading the array of unwind ops, this shouldn't be a problem; and (ii) we need to make it so, because if there were clobber-saves, they would come after the stack limit check, and these are part of the prologue. I suppose a high-level alternative design branch would be to do the stack limit check outside of the prologue. However this requires a bunch of re-engineering, and divergence from the current shared ABI code:
So that doesn't seem terribly tenable to me either. It's really unfortunate that I don't have a way of reproducing this locally, and that the unwinder is so opaque. I'm not sure what to do currently and have no remaining strategies for solving this bug other than "ask Microsoft what changed in the Windows-2022 unwinder"; does anyone else have better ideas? |
|
I did a "final" poking around at this last night as I was inspired again. I double-checked everything down to the binary encoding and API alls we're calling into Windows. Everything checked out and made sense except for the fact that this doesn't work. It definitely feels odd that the exact same code works on 2019 and doesn't work on 2022. At this point I think we'd need to enlist a Windows expert to figure this out. |
|
I'm cleaning up old PRs and closing this now: since we no longer rely on unwind metadata for stack walking on Windows (now that #4431 is merged), this is no longer an issue. |
|
Isn't this still necessary for debuggers? |
|
Yes, I suppose it would still be an issue for debuggers on Windows versions with the bug. However we weren't able to find the issue after extensive debugging (see whole thread above!) and this doesn't seem to solve the problem, so for that case I would say I'm closing the PR simply because it doesn't work... At this point to get Windows' native unwinder to fully work in windows-2022 I think we need someone who is a deep expert in SEH and/or is willing to single-instruction-step through Windows internals. That's definitely not me! |
In #3859 we discovered that the unwind metadata info on Windows is
placed slightly incorrectly in the presence of explicit stack checks in
function prologues. In particular, if the stack check fails, then we
start the process of backtracing from a state where we have actually
updated rbp (the frame pointer) but we have not emitted the metadata
saying we have done so.
The fix is to emit the
SetFPRegWin32 unwind op exactly at the offsetof the
mov rbp, rspinstruction, not after the stack check. However,actually achieving this output given our current unwind design (which
abstracts over SysV and Win32) is slightly tricky, because:
DefineNewFrameop that both indicatedwhere the FP was set, and gave the size of the clobbers;
clobber-save sequence; and
To resolve the problem, this PR splits the
DefineNewFrameop into twopieces:
SetFrameRegandDefineNewFrame. On x86-64 and aarch64 thisis emitted in the appropriate place. (
s390xdoes not use a frameregister, so has no need for this op.) This should resolve the issue we
were seeing with Windows-2022 and unwinding.
(Thanks to @alexcrichton, @iximeow and @peterhuene for help debugging
this!)
I've switched the main Windows CI jobs back to Windows-2022 (naming this explicitly rather than
windows-latestto pin our CI on a known version -- happy to switch that to-latestthough if we think that's better). I've also added a job that explicitly tests Windows-2019 to ensure we don't have any other regressions. If the overhead for this is too high then maybe we can remove it after seeing a clean pass at least on this fix.Fixes #3859.