Don't catch all exceptions; propagate unexpected errors to trigger Watson dump #2825

jtschuster · 2022-06-07T19:47:44Z

This allows us to collect a dump when we encounter a failure that we don't expect and handle with our own exception.

sbomer

I think we want the call to FailFast to happen inside of an exception filter (a when clause on the catch), so that it terminates the process before the exception is caught. My understanding is that this will trigger the creation of a crash dump which includes the stack frames up to the throwing operation, whereas failing from a catch would only give us a dump up to the catch.

src/linker/Linker.Steps/BaseStep.cs

src/linker/Linker/Driver.cs

marek-safar · 2022-06-08T07:40:18Z

src/linker/Linker/Driver.cs

+			} catch (Exception e) {
 				// Unhandled exceptions are usually linker bugs. Ask the user to report it.
 				Context.LogError (null, DiagnosticId.LinkerUnexpectedError);
+				Environment.FailFast ("Unexpected Error", e);


Are we sure we want to take down the whole process?

The goal of this change is to make it possible to diagnose hard to repro failures (like the ones we see in runtime CI currently). What this is trying to do is to crash the process in such a way that dump will be captured... and we can then investigate the dump.

We shortly discussed the possibility to use non fatal Watson dump on Windows, but for now decided that the hard crash would be better (it works everywhere, not just Windows).

Right just ensure that all the products that integrate linker (VS, VS4M, VSCode, etc.) via msbuild are prepared for that.

All of those should do it through msbuild - and in that case linker should run as an external process. If it crashes msbuild will produce an error - it doesn't look exactly pretty, but that's basically the tradeoff here - between pretty failures and diagnosable ones.

Why do we need to catch these exceptions at all?

If we don't have a catch anywhere, it will just go into the runtime-default unhandled exception paths that also trigger Watson.

For example here we could do the equivalent of:

try { ... } catch when (PrintMessageToContactLinkerDevsAndReturnFalse()) { /* unreachable */ }

The difference is basically that if you have FailFast, the stack trace that the runtime prints to stderr is the one of the FailFast. If you don't catch the exception, the stack trace printed to stderr is the one for when the exception was thrown.

Sorry for not marking this as still in progress, after discussing with Sven and Andy I realized we want to fail in the exception filter like Michal suggested to get the Watson dumps. I've updated the code to do that.

My question was more about "why do we catch the exception in the first place when we don't handle it?". If we catch unexpected things, we need to add extra code to do the FailFast - but we could just be more mindful of what we catch instead.

Like in this specific spot I'm commenting on: we catch only to print something into the logger (and then we retrhow previously, or FailFast with this change). But if we instead print into the logger from a filter clause (a filter that returns false to not actually trigger a catch) we don't need to worry about rethrow or FailFast - the exceptions will fly past this spot.

I can think of one potential reason to FailFast, which is that it makes it hard to accidentally introduce a change which catches the exception that we wanted to propagate. But in general I would prefer to be more careful about what we catch, so I like your suggestion @MichalStrehovsky.

I think @agocke suggested FailFast as a way to tear down the process as quickly and violently as possible to ensure we get a crash dump with the most context. But it seems like an uncaught exception would include all of the same context. Is there any advantage to using FailFast over just letting the exception propagate @agocke?

test/Mono.Linker.Tests/TestCasesRunner/TestRunner.cs

…failFast

src/linker/Linker/Driver.cs

vitek-karas · 2022-06-08T21:11:35Z

src/linker/Linker/Driver.cs

 			}
-
-			return Context.ErrorsCount > 0 ? 1 : 0;
+			Context.FlushCachedWarnings ();


I'm a bit nervous about this. Ideally we would do very little in this method since we can't really trust data structures anymore. I don't see why we would need to print out all the warnings we got so far on a fatal crash. I know that there's some chance that it might help, but in the worst case if we get a dump the warnings will be in the dump.

Same goes about the tracer - the tracer produces the large XML files which are rarely used and only to figure out "why was this kept". Not having that info complete when the linker hard crashes is absolutely fine.

Definitely worth a comment that we intentionally don't do it.

We'll probably want to move them out of the finally block too then, right?

It looks like if we can't trust the data structures and it doesn't make sense to flush the warnings, there's not anything meaningful to do in the exception filter and it might make more sense to just not have a try/catch.

Looking closer I realize messages and errors are not cached, so we would be able to log some extra information without flushing.

src/linker/Linker/Driver.cs

sbomer

Thanks for seeing this through, LGTM!

src/linker/Linker/Driver.cs

src/linker/Linker.Steps/ProcessLinkerXmlBase.cs

Co-authored-by: Sven Boemer <[email protected]>

…failFast

agocke

LGTM

agocke · 2022-06-13T22:59:01Z

src/linker/Linker/Driver.cs

+			switch (e) {
+			case LinkerFatalErrorException lex:
 				Context.LogMessage (lex.MessageContainer);
-				Console.Error.WriteLine (lex.ToString ());


Why delete this line? Is it just a duplicate for some reason?

My understanding is that when the exception causes the program to crash, it will print out all the same information as lex.ToString(), so it isn't necessary to do it in the exception filter too.

…tson dump (dotnet/linker#2825) Commit migrated from dotnet/linker@93de720

Fail fast when we fail without a LinkerFatalError

057616b

jtschuster requested a review from marek-safar as a code owner June 7, 2022 19:47

jtschuster requested review from agocke, sbomer and vitek-karas June 7, 2022 19:48

sbomer reviewed Jun 7, 2022

View reviewed changes

src/linker/Linker.Steps/BaseStep.cs Outdated Show resolved Hide resolved

marek-safar reviewed Jun 8, 2022

View reviewed changes

src/linker/Linker/Driver.cs Outdated Show resolved Hide resolved

marek-safar reviewed Jun 8, 2022

View reviewed changes

Fail fast when we fail without a LinkerFatalError

b7bda86

jtschuster force-pushed the failFast branch from 057616b to b7bda86 Compare June 8, 2022 14:44

jtschuster commented Jun 8, 2022

View reviewed changes

test/Mono.Linker.Tests/TestCasesRunner/TestRunner.cs Outdated Show resolved Hide resolved

Formatting

57dd667

jtschuster requested review from MichalStrehovsky and sbomer June 8, 2022 17:35

jtschuster and others added 3 commits June 8, 2022 12:44

Update ILLink.Shared.projitems

a95904e

Merge branch 'failFast' of https://github.com/jtschuster/linker into …

330db00

…failFast

Bubble up exception instead of FailFast

84f6a7e

vitek-karas reviewed Jun 8, 2022

View reviewed changes

sbomer reviewed Jun 8, 2022

View reviewed changes

src/linker/Linker/Driver.cs Outdated Show resolved Hide resolved

src/linker/Linker/Driver.cs Outdated Show resolved Hide resolved

agocke reviewed Jun 8, 2022

View reviewed changes

src/linker/Linker/Driver.cs Outdated Show resolved Hide resolved

PR feedback

5d7dfd1

jtschuster changed the title ~~Fail fast when we encounter an unexpected error~~ Don't catch all exceptions; propagate unexpected errors to trigger Watson dump Jun 9, 2022

sbomer approved these changes Jun 9, 2022

View reviewed changes

src/linker/Linker/Driver.cs Outdated Show resolved Hide resolved

src/linker/Linker.Steps/ProcessLinkerXmlBase.cs Outdated Show resolved Hide resolved

jtschuster and others added 2 commits June 9, 2022 13:58

Update src/linker/Linker/Driver.cs

89d81c3

Co-authored-by: Sven Boemer <[email protected]>

Remove comment

21ebc42

jtschuster requested review from agocke and vitek-karas June 9, 2022 19:04

jtschuster added 2 commits June 9, 2022 15:04

Merge branch 'failFast' of https://github.com/jtschuster/linker into …

37806fa

…failFast

Merge remote-tracking branch 'upstream/main' into failFast

4794d29

agocke approved these changes Jun 13, 2022

View reviewed changes

jtschuster merged commit 93de720 into dotnet:main Jun 13, 2022

jtschuster mentioned this pull request Jun 13, 2022

FailFast when an exception occurs in the linker #2824

Closed

vitek-karas mentioned this pull request Sep 16, 2022

Improve crash dump creation/logging for illink.exe #1592

Closed

agocke pushed a commit to dotnet/runtime that referenced this pull request Nov 16, 2022

Don't catch all exceptions; propagate unexpected errors to trigger Wa…

327d17d

…tson dump (dotnet/linker#2825) Commit migrated from dotnet/linker@93de720

Don't catch all exceptions; propagate unexpected errors to trigger Watson dump #2825

Don't catch all exceptions; propagate unexpected errors to trigger Watson dump #2825

Uh oh!

Conversation

jtschuster commented Jun 7, 2022

Uh oh!

sbomer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbomer Jun 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbomer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

agocke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtschuster Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sbomer Jun 8, 2022 •

edited

Loading

jtschuster Jun 14, 2022 •

edited

Loading