-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Factor out and improve the vectorization of RegexInterpreter.FindFirstChar #61490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
fdbffd2
Factor out and improve the vectorization of RegexInterpreter.FindFirs…
stephentoub 61e3cef
Fix missing condition in RegexCompiler
stephentoub 6679536
Try to fix mono failures and address comment feedback
stephentoub 502b505
Delete more now dead code
stephentoub 51fcda3
Fix dead code emitting after refactoring
stephentoub 6ce524f
Remove some now dead code
stephentoub File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
Factor out and improve the vectorization of RegexInterpreter.FindFirs…
…tChar
This change started with the "simple" goal of factoring out the FindFirstChar logic from RegexInterpreter and consuming it in SymbolicRegexMatcher. The existing engines use FindFirstChar to quickly skip ahead to the next location that might possibly match, at which point they fall back to analyzing the whole pattern at that location. SymbolicRegexMatcher (used by RegexOptions.NonBacktracking) had its own implementation for this, which it used any time it entered a start state. This required non-trivial additional code to maintain, and there's no good reason it should be separate from the other engines.
However, what started out as a simple change grew due to regressions that resulted from differences in the implementations. In particular, SymbolicRegexMatcher already works off of precomputed equivalence tables for casing, which gives it very different characteristics in this regard from the existing engines. For example, SymbolicRegexMatcher's existing "skip ahead to the next possible match start location" logic already evaluated all the characters that could possibly start a match, which included variations of the same character when using IgnoreCase, but the existing RegexInterpreter logic didn't. That discrepancy then results in a significant IgnoreCase regression for NonBacktracking due to losing the ability to use a vectorized search for the next starting location. We already plan to shift the existing engines over to a plan where all of these equivalences are computed at construction time rather than using ToLower at both construction time and match time, so this PR takes some steps in that direction, doing so for most of ASCII. This has added some temporary cruft, which we'll be able to delete once we fully shift the implementations over (which we should do in the near future).
Another difference was SymbolicRegexMatcher was enabling use of IndexOfAny for up to 5 characters, whereas RegexOptions.Compiled was only doing up to 3 characters, and RegexInterpreter wasn't doing for any number. The PR now uses 5 everywhere.
However, the more characters involved, the more overhead there is to IndexOfAny, and for some inputs, the higher the chances are that IndexOfAny will find a match sooner, which means its overhead compounds more. To help with that, we now not only compute the possible characters that might match at the beginning of the pattern, but also characters that might match at a fixed offset from the beginning of the pattern (e.g. in \d{3}-\d{2}-\d{4}, it will find the '-' at offset 3 and be able to vectorize a search for that and then back off by the relevant distance. That then also means we might end up with multiple sets to choose to search for, and this PR borrows an idea from Rust, which is to use some rough frequency analysis to determine which set should be targeted. It's not perfect, and we can update the texts use to seed the analysis (right now I based it primarily on *.cs files in dotnet/runtime and some Project Gutenberg texts), but it's good enough for these purposes for now.
We'd previously switched to using IndexOf for a case-sensitive prefix string, but still were using Boyer-Moore for case-insensitive. Now that we're able to also vectorize a search for case-insensitive values (right now just ASCII letter, but that'll be fixed soon), we can just get rid of Boyer-Moore entirely. This saves all the costs to do with constructing the Boyer-Moore tables and also avoids having to generate the Boyer-Moore implementations in RegexOptions.Compiled and the source generator.
The casing change also defeated some other optimizations already present. For example, in .NET 5 we added an optimization whereby an alternation like `abcef|abcgh` would be transformed into `abc(?:ef|gh)`, and that would apply whether case-sensitive or case-insensitive. But by transforming the expression at construction now for case-insensitive into `[Aa][Bb][Cc][Ee][Ff]|[Aa][Bb][Cc][Gg][Hh]`, that optimization was defeated. I've added a new optimization pass for alternations that will detect common prefixes even if they're sets.
The casing change also revealed some cosmetic issues. As part of the change, when we encounter a "multi" (a multi-character string in the pattern), we convert that single case-insensitive RegexNode to instead be one case-sensitive RegexNode per character, with a set for all the equivalent characters that can match. This then defeats some of the nice formatting we had for multis in the source generator, so as part of this change, the source generator has been augmented to output nicer code for concatenations. And because sets like [Ee] are now way more common (since e.g. a case-insensitive 'e' will be transformed into such a set), we also special-case that in both the source generator and RegexOptions.Compiled, to spit out the equivalent of `(c | 0x20) == 'e'` rather than `(c == 'E'| c == 'e')`.
Along the way, I cleaned up a few things as well, such as passing around a CultureInfo more rather than repeatedly calling CultureInfo.CurrentCulture, using CollectionsMarshal.GetValueRefOrAddDefault on a hot path to do with interning strings in a lookup table, tweaking SymbolicRegexRunnerFactory's Runner to itself be generic to avoid an extra layer of virtual dispatch per operation, and cleaning up code / comments in SymbolicRegexMatcher along the way.
For the most part the purpose of the change wasn't to improve perf, and in fact I was willing to accept some regressions in the name of consolidation. There are a few regressions here, mostly small, and mostly for cases where we're simply paying an overhead for vectorization, e.g. where the current location is fine to match, or where the target character being searched for is very frequent. Overall, though, there are some substantial improvements.- Loading branch information
commit fdbffd26a4dd599059e4c44d1c053a1db65d73de
There are no files selected for viewing
668 changes: 328 additions & 340 deletions
668
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.