You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add possessive quantifiers to avoid catastrophic backtracking (#258)
Fixes the crash in #245 by
prohibiting the regex engine from backtracking catastrophically via
[possessive
quantifiers](https://www.regular-expressions.info/possessive.html).
<img width="400" alt="image"
src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569">
Interestingly these possesives make the encoding a lot faster again in
`fancy-regex`.
Before this change (but with large byte pair merge PR cherry-picked):
```
num_threads: 1, num_bytes: 98379553
tiktoken 11,946,036 bytes / s
tiktoken 11,961,343 bytes / s
tiktoken 11,995,846 bytes / s
tiktoken 11,951,263 bytes / s
tiktoken 11,983,405 bytes / s
```
Same, with these changes applied:
```
num_threads: 1, num_bytes: 98379553
tiktoken 14,511,827 bytes / s
tiktoken 14,638,134 bytes / s
tiktoken 14,644,029 bytes / s
tiktoken 14,729,030 bytes / s
tiktoken 14,666,903 bytes / s
```
Updating the regex libs makes it a tiny bit faster still:
```
num_threads: 1, num_bytes: 98379553
tiktoken 14,485,590 bytes / s
tiktoken 14,854,049 bytes / s
tiktoken 14,891,086 bytes / s
tiktoken 14,843,007 bytes / s
tiktoken 14,874,520 bytes / s
```
This is almost 2x faster than [before any of the
optimizations](#234).
-------
Opened an issue for increasing the [default backtrack
limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407),
see: fancy-regex/fancy-regex#134, but it
shouldn't be necessary here anymore.
---------
Co-authored-by: Lőrinc <[email protected]>
0 commit comments