Skip to content

Conversation

@jtschuster
Copy link
Member

This allows us to collect a dump when we encounter a failure that we don't expect and handle with our own exception.

Copy link
Member

@sbomer sbomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the call to FailFast to happen inside of an exception filter (a when clause on the catch), so that it terminates the process before the exception is caught. My understanding is that this will trigger the creation of a crash dump which includes the stack frames up to the throwing operation, whereas failing from a catch would only give us a dump up to the catch.

} catch (Exception e) {
// Unhandled exceptions are usually linker bugs. Ask the user to report it.
Context.LogError (null, DiagnosticId.LinkerUnexpectedError);
Environment.FailFast ("Unexpected Error", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to take down the whole process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this change is to make it possible to diagnose hard to repro failures (like the ones we see in runtime CI currently). What this is trying to do is to crash the process in such a way that dump will be captured... and we can then investigate the dump.

We shortly discussed the possibility to use non fatal Watson dump on Windows, but for now decided that the hard crash would be better (it works everywhere, not just Windows).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right just ensure that all the products that integrate linker (VS, VS4M, VSCode, etc.) via msbuild are prepared for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those should do it through msbuild - and in that case linker should run as an external process. If it crashes msbuild will produce an error - it doesn't look exactly pretty, but that's basically the tradeoff here - between pretty failures and diagnosable ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to catch these exceptions at all?

If we don't have a catch anywhere, it will just go into the runtime-default unhandled exception paths that also trigger Watson.

For example here we could do the equivalent of:

try
{
    ...
}
catch when (PrintMessageToContactLinkerDevsAndReturnFalse()) { /* unreachable */ }

The difference is basically that if you have FailFast, the stack trace that the runtime prints to stderr is the one of the FailFast. If you don't catch the exception, the stack trace printed to stderr is the one for when the exception was thrown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not marking this as still in progress, after discussing with Sven and Andy I realized we want to fail in the exception filter like Michal suggested to get the Watson dumps. I've updated the code to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more about "why do we catch the exception in the first place when we don't handle it?". If we catch unexpected things, we need to add extra code to do the FailFast - but we could just be more mindful of what we catch instead.

Like in this specific spot I'm commenting on: we catch only to print something into the logger (and then we retrhow previously, or FailFast with this change). But if we instead print into the logger from a filter clause (a filter that returns false to not actually trigger a catch) we don't need to worry about rethrow or FailFast - the exceptions will fly past this spot.

Copy link
Member

@sbomer sbomer Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can think of one potential reason to FailFast, which is that it makes it hard to accidentally introduce a change which catches the exception that we wanted to propagate. But in general I would prefer to be more careful about what we catch, so I like your suggestion @MichalStrehovsky.

I think @agocke suggested FailFast as a way to tear down the process as quickly and violently as possible to ensure we get a crash dump with the most context. But it seems like an uncaught exception would include all of the same context. Is there any advantage to using FailFast over just letting the exception propagate @agocke?

}

return Context.ErrorsCount > 0 ? 1 : 0;
Context.FlushCachedWarnings ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit nervous about this. Ideally we would do very little in this method since we can't really trust data structures anymore. I don't see why we would need to print out all the warnings we got so far on a fatal crash. I know that there's some chance that it might help, but in the worst case if we get a dump the warnings will be in the dump.

Same goes about the tracer - the tracer produces the large XML files which are rarely used and only to figure out "why was this kept". Not having that info complete when the linker hard crashes is absolutely fine.

Definitely worth a comment that we intentionally don't do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to move them out of the finally block too then, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if we can't trust the data structures and it doesn't make sense to flush the warnings, there's not anything meaningful to do in the exception filter and it might make more sense to just not have a try/catch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer I realize messages and errors are not cached, so we would be able to log some extra information without flushing.

@jtschuster jtschuster changed the title Fail fast when we encounter an unexpected error Don't catch all exceptions; propagate unexpected errors to trigger Watson dump Jun 9, 2022
Copy link
Member

@sbomer sbomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for seeing this through, LGTM!

@jtschuster jtschuster requested review from agocke and vitek-karas June 9, 2022 19:04
Copy link
Member

@agocke agocke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

switch (e) {
case LinkerFatalErrorException lex:
Context.LogMessage (lex.MessageContainer);
Console.Error.WriteLine (lex.ToString ());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this line? Is it just a duplicate for some reason?

Copy link
Member Author

@jtschuster jtschuster Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that when the exception causes the program to crash, it will print out all the same information as lex.ToString(), so it isn't necessary to do it in the exception filter too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants