-
Notifications
You must be signed in to change notification settings - Fork 128
Don't catch all exceptions; propagate unexpected errors to trigger Watson dump #2825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sbomer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the call to FailFast to happen inside of an exception filter (a when clause on the catch), so that it terminates the process before the exception is caught. My understanding is that this will trigger the creation of a crash dump which includes the stack frames up to the throwing operation, whereas failing from a catch would only give us a dump up to the catch.
src/linker/Linker/Driver.cs
Outdated
| } catch (Exception e) { | ||
| // Unhandled exceptions are usually linker bugs. Ask the user to report it. | ||
| Context.LogError (null, DiagnosticId.LinkerUnexpectedError); | ||
| Environment.FailFast ("Unexpected Error", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to take down the whole process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this change is to make it possible to diagnose hard to repro failures (like the ones we see in runtime CI currently). What this is trying to do is to crash the process in such a way that dump will be captured... and we can then investigate the dump.
We shortly discussed the possibility to use non fatal Watson dump on Windows, but for now decided that the hard crash would be better (it works everywhere, not just Windows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right just ensure that all the products that integrate linker (VS, VS4M, VSCode, etc.) via msbuild are prepared for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those should do it through msbuild - and in that case linker should run as an external process. If it crashes msbuild will produce an error - it doesn't look exactly pretty, but that's basically the tradeoff here - between pretty failures and diagnosable ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to catch these exceptions at all?
If we don't have a catch anywhere, it will just go into the runtime-default unhandled exception paths that also trigger Watson.
For example here we could do the equivalent of:
try
{
...
}
catch when (PrintMessageToContactLinkerDevsAndReturnFalse()) { /* unreachable */ }The difference is basically that if you have FailFast, the stack trace that the runtime prints to stderr is the one of the FailFast. If you don't catch the exception, the stack trace printed to stderr is the one for when the exception was thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not marking this as still in progress, after discussing with Sven and Andy I realized we want to fail in the exception filter like Michal suggested to get the Watson dumps. I've updated the code to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was more about "why do we catch the exception in the first place when we don't handle it?". If we catch unexpected things, we need to add extra code to do the FailFast - but we could just be more mindful of what we catch instead.
Like in this specific spot I'm commenting on: we catch only to print something into the logger (and then we retrhow previously, or FailFast with this change). But if we instead print into the logger from a filter clause (a filter that returns false to not actually trigger a catch) we don't need to worry about rethrow or FailFast - the exceptions will fly past this spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think of one potential reason to FailFast, which is that it makes it hard to accidentally introduce a change which catches the exception that we wanted to propagate. But in general I would prefer to be more careful about what we catch, so I like your suggestion @MichalStrehovsky.
I think @agocke suggested FailFast as a way to tear down the process as quickly and violently as possible to ensure we get a crash dump with the most context. But it seems like an uncaught exception would include all of the same context. Is there any advantage to using FailFast over just letting the exception propagate @agocke?
src/linker/Linker/Driver.cs
Outdated
| } | ||
|
|
||
| return Context.ErrorsCount > 0 ? 1 : 0; | ||
| Context.FlushCachedWarnings (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit nervous about this. Ideally we would do very little in this method since we can't really trust data structures anymore. I don't see why we would need to print out all the warnings we got so far on a fatal crash. I know that there's some chance that it might help, but in the worst case if we get a dump the warnings will be in the dump.
Same goes about the tracer - the tracer produces the large XML files which are rarely used and only to figure out "why was this kept". Not having that info complete when the linker hard crashes is absolutely fine.
Definitely worth a comment that we intentionally don't do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably want to move them out of the finally block too then, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if we can't trust the data structures and it doesn't make sense to flush the warnings, there's not anything meaningful to do in the exception filter and it might make more sense to just not have a try/catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer I realize messages and errors are not cached, so we would be able to log some extra information without flushing.
sbomer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for seeing this through, LGTM!
Co-authored-by: Sven Boemer <[email protected]>
agocke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| switch (e) { | ||
| case LinkerFatalErrorException lex: | ||
| Context.LogMessage (lex.MessageContainer); | ||
| Console.Error.WriteLine (lex.ToString ()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why delete this line? Is it just a duplicate for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that when the exception causes the program to crash, it will print out all the same information as lex.ToString(), so it isn't necessary to do it in the exception filter too.
…tson dump (dotnet/linker#2825) Commit migrated from dotnet/linker@93de720
This allows us to collect a dump when we encounter a failure that we don't expect and handle with our own exception.