feat: case-insensitive regex for lowercasing tokenizers by congx4 · Pull Request #6277 · quickwit-oss/quickwit

congx4 · 2026-04-07T18:57:00Z

Summary

Regex queries now automatically prepend (?i) when the field's tokenizer lowercases indexed terms (e.g. default, raw_lowercase, lowercase), so patterns like .*ECONNREFUSED.* match correctly against lowercase-indexed data
to_field_and_regex returns a ResolvedRegex struct (with tokenizer name), and both build_tantivy_ast_impl and warmup visit_regex use TokenizerManager::tokenizer_does_lowercasing to decide whether to add the flag
Skips the prefix if the regex already contains (?i)

Motivation

When Trino's Elasticsearch connector translates a SQL LIKE predicate on a text field, it sends a wildcard query to the ES-compatible API. For example:

-- Trino SQL
SELECT * FROM logs WHERE extra_fts LIKE '%ECONNREFUSED%';

In our deployment, the extra_fts field uses the datadog tokenizer, which lowercases all indexed terms. This means the inverted index only contains lowercase tokens (e.g. econnrefused), even if the original log message contained ECONNREFUSED.

The ES connector translates this to:

{"query": {"wildcard": {"extra_fts": {"value": "*ECONNREFUSED*"}}}}

The WildcardQuery in Quickwit already handles case-insensitivity correctly — it normalizes literal text through the field's tokenizer, so ECONNREFUSED becomes econnrefused before matching.

However, regex queries do not have this normalization. A case_insensitive: true ES term query gets converted to a RegexQuery:

{"query": {"term": {"extra_fts": {"value": "ECONNREFUSED", "case_insensitive": true}}}}

This becomes RegexQuery { regex: "(?i)ECONNREFUSED" } — which works because of the explicit (?i) flag.

But when a bare regex query is used without case_insensitive:

{"query": {"regexp": {"extra_fts": ".*ECONNREFUSED.*"}}}

The regex .*ECONNREFUSED.* is matched against the inverted index which only contains lowercase tokens (because the datadog tokenizer — or any other lowercasing tokenizer like default, raw_lowercase, lowercase — lowercases during indexing). The uppercase pattern never matches.

Unlike WildcardQuery, RegexQuery cannot normalize literal parts through the tokenizer because regex metacharacters are interleaved with literal text — you can't reliably separate them for normalization.

The fix: when the field's tokenizer lowercases its output, automatically prepend (?i) to make the regex match case-insensitively. This aligns regex behavior with how wildcard queries already work (just via a different mechanism).

Why the warmup must also change

Quickwit pre-warms search data by walking the query AST and loading relevant term dictionary entries. The warmup builds a regex automaton to scan the term dictionary. If the warmup uses .*ECONNREFUSED.* (case-sensitive) but the actual query uses (?i).*ECONNREFUSED.* (case-insensitive), the warmup misses the matching terms and the query returns no results. Both must apply the same (?i) logic.

Test plan

Unit test: lowercasing tokenizer causes (?i) prefix (test_regex_case_insensitive_with_lowercasing_tokenizer)
Unit test: already-(?i) regex is not doubled (test_regex_already_case_insensitive_not_doubled)
Unit test: raw tokenizer does NOT get (?i) (test_regex_no_case_insensitive_with_raw_tokenizer)
Unit test: to_field_and_regex returns correct tokenizer name for text and JSON fields
Unit test: tokenizer_does_lowercasing correctly identifies lowercasing vs non-lowercasing tokenizers

🤖 Generated with Claude Code

When a field's tokenizer lowercases indexed terms (e.g. "default", "raw_lowercase", "lowercase"), regex queries now automatically prepend (?i) to match case-insensitively. Without this, patterns like `.*ECONNREFUSED.*` would never match because the inverted index only contains lowercase tokens. Changes: - `to_field_and_regex` now returns the tokenizer name as a 4th element - `build_tantivy_ast_impl` and warmup `visit_regex` prepend (?i) when the tokenizer does lowercasing and the regex doesn't already have it - `TokenizerManager::tokenizer_does_lowercasing` public helper added - Unit tests for case-insensitive behavior, tokenizer detection, and edge cases (already-(?i), raw tokenizer, JSON fields) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the 4-element tuple return from `to_field_and_regex` with a named `ResolvedRegex` struct to satisfy clippy::type_complexity which is promoted to a hard error via -D warnings in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

trinity-1686a

i think the logic for adding (?i) should exist only once, probably in to_field_and_regex (or whatever its name end up being)
having that logic in two places makes it more error-prone, and i think harder to test too (at least one test doesn't do what it claims because of that, and other tests could check more if the logic was moved)

trinity-1686a · 2026-04-08T12:58:53Z

quickwit/quickwit-query/src/tokenizers/tokenizer_manager.rs

    }

+    /// Returns true if the given tokenizer lowercases its output.
+    pub fn tokenizer_does_lowercasing(&self, tokenizer_name: &str) -> bool {


can you use this function to simplify get_normalizer() ?
also it should probably return None when using an unknown tokenizer

trinity-1686a · 2026-04-08T12:59:45Z