Skip to content

Fix case-insensitive matching with uppercase literals#850

Open
lczyk wants to merge 5 commits intomicrosoft:mainfrom
lczyk:lowercase-lsh-bug
Open

Fix case-insensitive matching with uppercase literals#850
lczyk wants to merge 5 commits intomicrosoft:mainfrom
lczyk:lowercase-lsh-bug

Conversation

@lczyk
Copy link
Copy Markdown

@lczyk lczyk commented May 4, 2026

when working on my divergent fork, came across this and, since i didn't tamper with lsh, i thought i'd PR it :)

the issue is tha t(?i:...) literals containing uppercase bytes silently never match. inlined_memicmp lowercases the haystack and compares raw against the needle but emit_literal interns the literal verbatim. uppercase needle byte means unconditional mismatch with no compile error and no panic, just a silent dead branch. this is not an issue for any of the current grammars since every literal happens to already be lowercase. lowercase the needle in emit_literal to fix.

lczyk added 4 commits May 4, 2026 16:10
`inlined_memicmp` requires the `Condition::PrefixInsensitive` needle to be
ascii-lowercase: it lowercases the haystack byte and compares it raw against
the needle. `emit_literal` currently interns the literal verbatim, so any
uppercase byte makes the prefix check mismatch unconditionally and the `if`
branch silently never fires.

three failing tests cover the cases:
- single-literal `(?i:FOO)` against uppercase input
- single-literal `(?i:FOO)` against lowercase input (sanity check that
  case-insensitivity itself works)
- alternation `(?i:FROM|RUN|CMD)` against mixed-case inputs

these will pass once `emit_literal` lowercases the needle when
`case_insensitive` is set. failing on purpose -- hence `test!:`.
`inlined_memicmp` requires the `Condition::PrefixInsensitive` needle to be
ascii-lowercase: it lowercases the haystack byte and compares it raw against
the needle. emit_literal interned the literal verbatim, so any uppercase byte
in `(?i:...)` made the prefix check mismatch unconditionally and the `if`
branch silently never fired.

lowercase the needle when the case-insensitive flag is set and any byte is
uppercase. zero-cost otherwise (no allocation when the literal is already
lowercase). makes the regression tests added in the previous commit pass.

bug latent in upstream: no current grammar happens to use uppercase in
`(?i:...)` literals, so nothing visibly broke. dockerfile.lsh in a downstream
fork tripped it (every dockerfile instruction is uppercase by convention).
integration tests in `tests/` already see `[dependencies]` of the
crate-under-test; no need to re-list `stdext` under `[dev-dependencies]`.
mistakenly added in 8021fee.
dropped the `tests/regex_case_insensitive.rs` integration file in favour of a
`#[cfg(test)] mod tests` block at the bottom of `regex.rs`. shrinks the PR
diff and keeps the test next to the bug site (`emit_literal`). same coverage:
single literal + alternation, mixed-case inputs.
@lczyk
Copy link
Copy Markdown
Author

lczyk commented May 4, 2026

@microsoft-github-policy-service agree

@lhecker
Copy link
Copy Markdown
Member

lhecker commented May 4, 2026

Thank you for the fix! I've decided to reject the unit test for now, as I consider the way the integration between Compiler and Runtime works to be complete 💩. lol

Edit: To clarify, I meant my code.

@lhecker lhecker changed the title fix: lowercase lsh bug Fix case-insensitive matching with uppercase literals May 4, 2026
@lhecker lhecker enabled auto-merge (squash) May 4, 2026 18:17
@lczyk
Copy link
Copy Markdown
Author

lczyk commented May 4, 2026

Edit: To clarify, I meant my code.

😆 dw. glad fix is useful 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants