feat: case-insensitive regex for lowercasing tokenizers#6277
feat: case-insensitive regex for lowercasing tokenizers#6277congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
Conversation
When a field's tokenizer lowercases indexed terms (e.g. "default", "raw_lowercase", "lowercase"), regex queries now automatically prepend (?i) to match case-insensitively. Without this, patterns like `.*ECONNREFUSED.*` would never match because the inverted index only contains lowercase tokens. Changes: - `to_field_and_regex` now returns the tokenizer name as a 4th element - `build_tantivy_ast_impl` and warmup `visit_regex` prepend (?i) when the tokenizer does lowercasing and the regex doesn't already have it - `TokenizerManager::tokenizer_does_lowercasing` public helper added - Unit tests for case-insensitive behavior, tokenizer detection, and edge cases (already-(?i), raw tokenizer, JSON fields) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the 4-element tuple return from `to_field_and_regex` with a named `ResolvedRegex` struct to satisfy clippy::type_complexity which is promoted to a hard error via -D warnings in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
trinity-1686a
left a comment
There was a problem hiding this comment.
i think the logic for adding (?i) should exist only once, probably in to_field_and_regex (or whatever its name end up being)
having that logic in two places makes it more error-prone, and i think harder to test too (at least one test doesn't do what it claims because of that, and other tests could check more if the logic was moved)
| } | ||
|
|
||
| /// Returns true if the given tokenizer lowercases its output. | ||
| pub fn tokenizer_does_lowercasing(&self, tokenizer_name: &str) -> bool { |
There was a problem hiding this comment.
can you use this function to simplify get_normalizer() ?
also it should probably return None when using an unknown tokenizer
|
|
||
| impl RegexQuery { | ||
| /// Resolves this regex query against the given schema. | ||
| pub fn to_field_and_regex( |
There was a problem hiding this comment.
| pub fn to_field_and_regex( | |
| pub fn to_resolved( |
| let regex = if !resolved.regex.starts_with("(?i)") && does_lowercasing { | ||
| format!("(?i){}", resolved.regex) | ||
| } else { |
There was a problem hiding this comment.
i wonder if this is always correct, or if there are other (?X) modifier that could be put and cause us to miss a case-Insensitive modifier
| #[test] | ||
| fn test_tokenizer_does_lowercasing() { | ||
| let tokenizer_manager = crate::tokenizers::create_default_quickwit_tokenizer_manager(); | ||
|
|
||
| assert!(tokenizer_manager.tokenizer_does_lowercasing("raw_lowercase")); | ||
| assert!(tokenizer_manager.tokenizer_does_lowercasing("default")); | ||
| assert!(tokenizer_manager.tokenizer_does_lowercasing("lowercase")); | ||
| assert!(!tokenizer_manager.tokenizer_does_lowercasing("raw")); | ||
| assert!(!tokenizer_manager.tokenizer_does_lowercasing("nonexistent")); | ||
| } |
There was a problem hiding this comment.
wrong module, but this test should exists
| let result = query.build_tantivy_ast_impl(&context); | ||
| assert!( | ||
| result.is_ok(), | ||
| "regex query with existing (?i) should build successfully" |
There was a problem hiding this comment.
would a query with (?i)(?i)something fail with this test? (is it testing something, or would the test pass anyway without the branch to check for already case-insensitive query?)
| // The regex should NOT have (?i) since raw doesn't lowercase | ||
| assert_eq!(resolved.regex, "abc.*xyz"); |
There was a problem hiding this comment.
add (?i) is done in build_tantivy_ast_impl, not in to_field_and_regex, this test doesn't do what it claims
Summary
(?i)when the field's tokenizer lowercases indexed terms (e.g.default,raw_lowercase,lowercase), so patterns like.*ECONNREFUSED.*match correctly against lowercase-indexed datato_field_and_regexreturns aResolvedRegexstruct (with tokenizer name), and bothbuild_tantivy_ast_impland warmupvisit_regexuseTokenizerManager::tokenizer_does_lowercasingto decide whether to add the flag(?i)Motivation
When Trino's Elasticsearch connector translates a SQL
LIKEpredicate on a text field, it sends a wildcard query to the ES-compatible API. For example:In our deployment, the
extra_ftsfield uses thedatadogtokenizer, which lowercases all indexed terms. This means the inverted index only contains lowercase tokens (e.g.econnrefused), even if the original log message containedECONNREFUSED.The ES connector translates this to:
{"query": {"wildcard": {"extra_fts": {"value": "*ECONNREFUSED*"}}}}The
WildcardQueryin Quickwit already handles case-insensitivity correctly — it normalizes literal text through the field's tokenizer, soECONNREFUSEDbecomeseconnrefusedbefore matching.However, regex queries do not have this normalization. A
case_insensitive: trueES term query gets converted to aRegexQuery:{"query": {"term": {"extra_fts": {"value": "ECONNREFUSED", "case_insensitive": true}}}}This becomes
RegexQuery { regex: "(?i)ECONNREFUSED" }— which works because of the explicit(?i)flag.But when a bare regex query is used without
case_insensitive:{"query": {"regexp": {"extra_fts": ".*ECONNREFUSED.*"}}}The regex
.*ECONNREFUSED.*is matched against the inverted index which only contains lowercase tokens (because thedatadogtokenizer — or any other lowercasing tokenizer likedefault,raw_lowercase,lowercase— lowercases during indexing). The uppercase pattern never matches.Unlike
WildcardQuery,RegexQuerycannot normalize literal parts through the tokenizer because regex metacharacters are interleaved with literal text — you can't reliably separate them for normalization.The fix: when the field's tokenizer lowercases its output, automatically prepend
(?i)to make the regex match case-insensitively. This aligns regex behavior with how wildcard queries already work (just via a different mechanism).Why the warmup must also change
Quickwit pre-warms search data by walking the query AST and loading relevant term dictionary entries. The warmup builds a regex automaton to scan the term dictionary. If the warmup uses
.*ECONNREFUSED.*(case-sensitive) but the actual query uses(?i).*ECONNREFUSED.*(case-insensitive), the warmup misses the matching terms and the query returns no results. Both must apply the same(?i)logic.Test plan
(?i)prefix (test_regex_case_insensitive_with_lowercasing_tokenizer)(?i)regex is not doubled (test_regex_already_case_insensitive_not_doubled)rawtokenizer does NOT get(?i)(test_regex_no_case_insensitive_with_raw_tokenizer)to_field_and_regexreturns correct tokenizer name for text and JSON fieldstokenizer_does_lowercasingcorrectly identifies lowercasing vs non-lowercasing tokenizers🤖 Generated with Claude Code