Skip to content

feat: case-insensitive regex for lowercasing tokenizers#6277

Open
congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
congx4:cong.xie/case-insensitive-regex
Open

feat: case-insensitive regex for lowercasing tokenizers#6277
congx4 wants to merge 2 commits intoquickwit-oss:mainfrom
congx4:cong.xie/case-insensitive-regex

Conversation

@congx4
Copy link
Copy Markdown
Contributor

@congx4 congx4 commented Apr 7, 2026

Summary

  • Regex queries now automatically prepend (?i) when the field's tokenizer lowercases indexed terms (e.g. default, raw_lowercase, lowercase), so patterns like .*ECONNREFUSED.* match correctly against lowercase-indexed data
  • to_field_and_regex returns a ResolvedRegex struct (with tokenizer name), and both build_tantivy_ast_impl and warmup visit_regex use TokenizerManager::tokenizer_does_lowercasing to decide whether to add the flag
  • Skips the prefix if the regex already contains (?i)

Motivation

When Trino's Elasticsearch connector translates a SQL LIKE predicate on a text field, it sends a wildcard query to the ES-compatible API. For example:

-- Trino SQL
SELECT * FROM logs WHERE extra_fts LIKE '%ECONNREFUSED%';

In our deployment, the extra_fts field uses the datadog tokenizer, which lowercases all indexed terms. This means the inverted index only contains lowercase tokens (e.g. econnrefused), even if the original log message contained ECONNREFUSED.

The ES connector translates this to:

{"query": {"wildcard": {"extra_fts": {"value": "*ECONNREFUSED*"}}}}

The WildcardQuery in Quickwit already handles case-insensitivity correctly — it normalizes literal text through the field's tokenizer, so ECONNREFUSED becomes econnrefused before matching.

However, regex queries do not have this normalization. A case_insensitive: true ES term query gets converted to a RegexQuery:

{"query": {"term": {"extra_fts": {"value": "ECONNREFUSED", "case_insensitive": true}}}}

This becomes RegexQuery { regex: "(?i)ECONNREFUSED" } — which works because of the explicit (?i) flag.

But when a bare regex query is used without case_insensitive:

{"query": {"regexp": {"extra_fts": ".*ECONNREFUSED.*"}}}

The regex .*ECONNREFUSED.* is matched against the inverted index which only contains lowercase tokens (because the datadog tokenizer — or any other lowercasing tokenizer like default, raw_lowercase, lowercase — lowercases during indexing). The uppercase pattern never matches.

Unlike WildcardQuery, RegexQuery cannot normalize literal parts through the tokenizer because regex metacharacters are interleaved with literal text — you can't reliably separate them for normalization.

The fix: when the field's tokenizer lowercases its output, automatically prepend (?i) to make the regex match case-insensitively. This aligns regex behavior with how wildcard queries already work (just via a different mechanism).

Why the warmup must also change

Quickwit pre-warms search data by walking the query AST and loading relevant term dictionary entries. The warmup builds a regex automaton to scan the term dictionary. If the warmup uses .*ECONNREFUSED.* (case-sensitive) but the actual query uses (?i).*ECONNREFUSED.* (case-insensitive), the warmup misses the matching terms and the query returns no results. Both must apply the same (?i) logic.

Test plan

  • Unit test: lowercasing tokenizer causes (?i) prefix (test_regex_case_insensitive_with_lowercasing_tokenizer)
  • Unit test: already-(?i) regex is not doubled (test_regex_already_case_insensitive_not_doubled)
  • Unit test: raw tokenizer does NOT get (?i) (test_regex_no_case_insensitive_with_raw_tokenizer)
  • Unit test: to_field_and_regex returns correct tokenizer name for text and JSON fields
  • Unit test: tokenizer_does_lowercasing correctly identifies lowercasing vs non-lowercasing tokenizers

🤖 Generated with Claude Code

When a field's tokenizer lowercases indexed terms (e.g. "default",
"raw_lowercase", "lowercase"), regex queries now automatically prepend
(?i) to match case-insensitively. Without this, patterns like
`.*ECONNREFUSED.*` would never match because the inverted index only
contains lowercase tokens.

Changes:
- `to_field_and_regex` now returns the tokenizer name as a 4th element
- `build_tantivy_ast_impl` and warmup `visit_regex` prepend (?i) when
  the tokenizer does lowercasing and the regex doesn't already have it
- `TokenizerManager::tokenizer_does_lowercasing` public helper added
- Unit tests for case-insensitive behavior, tokenizer detection, and
  edge cases (already-(?i), raw tokenizer, JSON fields)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@guilload guilload requested a review from trinity-1686a April 7, 2026 18:58
Replace the 4-element tuple return from `to_field_and_regex` with a
named `ResolvedRegex` struct to satisfy clippy::type_complexity which
is promoted to a hard error via -D warnings in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@trinity-1686a trinity-1686a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the logic for adding (?i) should exist only once, probably in to_field_and_regex (or whatever its name end up being)
having that logic in two places makes it more error-prone, and i think harder to test too (at least one test doesn't do what it claims because of that, and other tests could check more if the logic was moved)

}

/// Returns true if the given tokenizer lowercases its output.
pub fn tokenizer_does_lowercasing(&self, tokenizer_name: &str) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use this function to simplify get_normalizer() ?
also it should probably return None when using an unknown tokenizer


impl RegexQuery {
/// Resolves this regex query against the given schema.
pub fn to_field_and_regex(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn to_field_and_regex(
pub fn to_resolved(

Comment on lines +140 to +142
let regex = if !resolved.regex.starts_with("(?i)") && does_lowercasing {
format!("(?i){}", resolved.regex)
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this is always correct, or if there are other (?X) modifier that could be put and cause us to miss a case-Insensitive modifier

Comment on lines +337 to +346
#[test]
fn test_tokenizer_does_lowercasing() {
let tokenizer_manager = crate::tokenizers::create_default_quickwit_tokenizer_manager();

assert!(tokenizer_manager.tokenizer_does_lowercasing("raw_lowercase"));
assert!(tokenizer_manager.tokenizer_does_lowercasing("default"));
assert!(tokenizer_manager.tokenizer_does_lowercasing("lowercase"));
assert!(!tokenizer_manager.tokenizer_does_lowercasing("raw"));
assert!(!tokenizer_manager.tokenizer_does_lowercasing("nonexistent"));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong module, but this test should exists

let result = query.build_tantivy_ast_impl(&context);
assert!(
result.is_ok(),
"regex query with existing (?i) should build successfully"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a query with (?i)(?i)something fail with this test? (is it testing something, or would the test pass anyway without the branch to check for already case-insensitive query?)

Comment on lines +426 to +427
// The regex should NOT have (?i) since raw doesn't lowercase
assert_eq!(resolved.regex, "abc.*xyz");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add (?i) is done in build_tantivy_ast_impl, not in to_field_and_regex, this test doesn't do what it claims

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants