Skip to content

Extract more literal n-grams from regex queries for tighter index pruning#3

Merged
platypii merged 3 commits into
masterfrom
grep-refinement
May 25, 2026
Merged

Extract more literal n-grams from regex queries for tighter index pruning#3
platypii merged 3 commits into
masterfrom
grep-refinement

Conversation

@platypii
Copy link
Copy Markdown
Contributor

  • Recurse into groups when expanding alternation into DNF branches (cross-product up to 32 branches; drop conservatively beyond that)
  • Fold small literal character classes (e.g. [ei]) into the surrounding literal run instead of treating them as wildcards
  • Fold groups whose branches are all single literals into the run, so patterns like /(eigen|petri)(value|chor)/ extract the full words
  • Net effect: Wikipedia transfer for /(eigen|petri)(value|chor)/ drops from 247 MB to 51 MB

platypii added 3 commits May 25, 2026 00:35
Cross-product nested alternation up to MAX_DNF=32 branches; beyond
that, drop the offending group conservatively (treat as wildcard).
Negated, ranged, special-escape, and multi-char-quantifier classes
still fall back to wildcard treatment. /serv[ei]rless/ now prunes to
'serverless' or 'servirless' instead of just 'rless'.
When every inner branch of a group is a single contiguous literal, the
group concats into the surrounding run instead of standing alone.
/(eigen|petri)(value|chor)/ now extracts the four full words; Wikipedia
transfer drops from 247 MB to 51 MB.
@platypii platypii merged commit 7c03b5f into master May 25, 2026
6 checks passed
@platypii platypii deleted the grep-refinement branch May 25, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant