Fix case-insensitive matching with uppercase literals#850
Open
lczyk wants to merge 5 commits intomicrosoft:mainfrom
Open
Fix case-insensitive matching with uppercase literals#850lczyk wants to merge 5 commits intomicrosoft:mainfrom
lczyk wants to merge 5 commits intomicrosoft:mainfrom
Conversation
`inlined_memicmp` requires the `Condition::PrefixInsensitive` needle to be ascii-lowercase: it lowercases the haystack byte and compares it raw against the needle. `emit_literal` currently interns the literal verbatim, so any uppercase byte makes the prefix check mismatch unconditionally and the `if` branch silently never fires. three failing tests cover the cases: - single-literal `(?i:FOO)` against uppercase input - single-literal `(?i:FOO)` against lowercase input (sanity check that case-insensitivity itself works) - alternation `(?i:FROM|RUN|CMD)` against mixed-case inputs these will pass once `emit_literal` lowercases the needle when `case_insensitive` is set. failing on purpose -- hence `test!:`.
`inlined_memicmp` requires the `Condition::PrefixInsensitive` needle to be ascii-lowercase: it lowercases the haystack byte and compares it raw against the needle. emit_literal interned the literal verbatim, so any uppercase byte in `(?i:...)` made the prefix check mismatch unconditionally and the `if` branch silently never fired. lowercase the needle when the case-insensitive flag is set and any byte is uppercase. zero-cost otherwise (no allocation when the literal is already lowercase). makes the regression tests added in the previous commit pass. bug latent in upstream: no current grammar happens to use uppercase in `(?i:...)` literals, so nothing visibly broke. dockerfile.lsh in a downstream fork tripped it (every dockerfile instruction is uppercase by convention).
integration tests in `tests/` already see `[dependencies]` of the crate-under-test; no need to re-list `stdext` under `[dev-dependencies]`. mistakenly added in 8021fee.
dropped the `tests/regex_case_insensitive.rs` integration file in favour of a `#[cfg(test)] mod tests` block at the bottom of `regex.rs`. shrinks the PR diff and keeps the test next to the bug site (`emit_literal`). same coverage: single literal + alternation, mixed-case inputs.
Author
|
@microsoft-github-policy-service agree |
Member
|
Thank you for the fix! I've decided to reject the unit test for now, as I consider the way the integration between Compiler and Runtime works to be complete 💩. lol Edit: To clarify, I meant my code. |
lhecker
approved these changes
May 4, 2026
Author
😆 dw. glad fix is useful 👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
when working on my divergent fork, came across this and, since i didn't tamper with lsh, i thought i'd PR it :)
the issue is tha t
(?i:...)literals containing uppercase bytes silently never match.inlined_memicmplowercases the haystack and compares raw against the needle butemit_literalinterns the literal verbatim. uppercase needle byte means unconditional mismatch with no compile error and no panic, just a silent dead branch. this is not an issue for any of the current grammars since every literal happens to already be lowercase. lowercase the needle inemit_literalto fix.