Add possessive quantifiers to avoid catastrophic backtracking#258
Add possessive quantifiers to avoid catastrophic backtracking#258hauntsaninja merged 5 commits intoopenai:mainfrom
Conversation
30c92dc to
ccd8702
Compare
| big_value = "^" * 1000000 | ||
| assert big_value == enc.decode(enc.encode(big_value)) | ||
|
|
||
| big_value = " " + big_value |
There was a problem hiding this comment.
space is often optional at the beginning, this way the backtracking can reach the space - let's test that as well
2d616bd to
21c5688
Compare
| big_value = " " + big_value | ||
| assert big_value == enc.decode(enc.encode(big_value)) | ||
|
|
||
| big_value = big_value + "\n" |
There was a problem hiding this comment.
some groups require a newline at the end, stress those paths as well
src/lib.rs
Outdated
| pattern: &str, | ||
| ) -> PyResult<Self> { | ||
| let regex = Regex::new(pattern) | ||
| let regex = RegexBuilder::new(pattern).backtrack_limit(100_000).build() |
There was a problem hiding this comment.
doesn't work for values bigger than a million - see fancy-regex/fancy-regex#134
I've set it lower for now, hoping we'll be able to fix the whitespace problem
019de85 to
51c8a8a
Compare
| return { | ||
| "name": "cl100k_base", | ||
| "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""", | ||
| "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s""", |
There was a problem hiding this comment.
seems the cl100k also had some backtracking problems, these possessives improve the situation considerably (e.g. in Java these aren't necessary, see knuddelsgmbh/jtokkit#87)
There was a problem hiding this comment.
can we collapse the \s+(?!\S) too? it should be equivalent to \s+$ no?
There was a problem hiding this comment.
It sounds reasonable, but check the tests to understand why they're not the same
| @pytest.mark.parametrize("make_enc", ENCODING_FACTORIES) | ||
| def test_extremely_big_encoding(make_enc: Callable[[], tiktoken.Encoding]): | ||
| enc = make_enc() | ||
| for c in ["^", "0", "a", "'s", " ", "\n"]: |
There was a problem hiding this comment.
stressing different parts of the regex, makin sure none have catastrophic backtracking
| # The pattern in the original GPT-2 release is: | ||
| # r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" | ||
| # This is equivalent, but executes faster: | ||
| _legacy_splitter_regex = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}++| ?\p{N}++| ?[^\s\p{L}\p{N}]++|\s++$|\s+(?!\S)|\s""" |
There was a problem hiding this comment.
The whitespaces can't be possessive (it needs to step back when encountering a non-whitespace), but we can rule out the offending bactracking case by adding a possessive trailing whitespace check.
| fancy-regex = "0.13.0" | ||
| regex = "1.10.3" |
There was a problem hiding this comment.
not absolutely necessary, but adds a tiny speed increase
| pattern: &str, | ||
| ) -> PyResult<Self> { | ||
| let regex = Regex::new(pattern) | ||
| let regex = RegexBuilder::new(pattern).backtrack_limit(10_000).build() |
There was a problem hiding this comment.
after this change we should never backtract catastrophically - and if we do, this will warn us early
|
|
||
| big_value = big_value + "\n" | ||
| assert big_value == enc.decode(enc.encode(big_value)) | ||
|
|
There was a problem hiding this comment.
big_value = big_value + "x" would still fail for whitespaces, i.e " x".
Seems less typical than the other cases which are fixed here, not yet sure how to fix this one, though, the fancy-regex seems pretty basic in this regard...
|
Was the backtrack limit reverted intentionally in 05e66e8? 05e66e8#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759L421-R438 Was there a regression? |
|
Yes, it was reverted intentionally. There are OpenAI internal encodings where setting the limit caused issues. |
|
Thanks for checking @tmm1, @hauntsaninja. |
Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.
Interestingly these possesives make the encoding a lot faster again in
fancy-regex.Before this change (but with large byte pair merge PR cherry-picked):
Same, with these changes applied:
Updating the regex libs makes it a tiny bit faster still:
This is almost 2x faster than before any of the optimizations.
Opened an issue for increasing the default backtrack limit, see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore.