research(seprag): BET 2⊗4 filtered-ANN region-pruning — qualified NO-GO (ADR-201) [finding, not a feature]#536
Open
shaal wants to merge 8 commits into
Open
Conversation
… (issue ruvnet#534) Region-pruned filtered ANN vs tuned ACORN. New self-contained crate ruvector-filtered-bench, depending only on ruvector-acorn (incumbent + oracle) and ruvector-rairs (IVF) — independent of ruvector-seprag/PR ruvnet#535. Pre-registration (docs/plans/bet2-filtered-ann/PRE-REGISTRATION.md) freezes a selectivity-shaped win/kill gate before any contender runs: at correlation rho>=0.7, contender A within 2% filtered-recall@10 of tuned ACORN at >=5x fewer distance-evals/query at sel<=1% (>=2x at sel=5%), monotonic in selectivity; graceful-degradation and wall-clock honesty guards; rho=0 recall-collapse kill control. M0 (plumbing, pre-freeze-safe): - data.rs: aligned ogbn-arxiv feat/label/year loader. - predicate.rs: rho-correlation knob holding selectivity exactly constant across rho, plus natural label/year predicate families. - tests/oracle_gate.rs: exact_filtered_knn cross-checked against an independent brute force on a real arxiv slice (sel x rho grid). 5 tests green, clippy clean.
… baseline Instrument ruvector-acorn with additive, result-preserving counted-search variants (acorn_search_counted, flat_filtered_search_counted) so distance-evals — the pre-registered primary cost metric — are measured exactly on ACORN-as-shipped. 13 acorn tests pass incl. a counted==uncounted + flat-evals==#matches invariant. filtered-bench contenders (src/contenders.rs): - B: ACORN predicate-agnostic search (the incumbent), exact eval counts. - C: classic post-filter (retrieve top-pool unfiltered, then filter) — the floor. M1 findings (n=20k arxiv, ρ=1, k=10): - TEETH (examples/teeth.rs): at the gate-relevant low selectivity, post-filter collapses while ACORN holds — sel=0.1%: 73.7% vs 22.7%; sel=0.5%: 90.4% vs 59.7%; sel=1%: 92.6% vs 79.3%. At sel>=5% post-filter is fine (as theory predicts). Benchmark is demonstrably sensitive (50+ pt recall swing) — the negative control. - TUNED ACORN (examples/acorn_tune.rs): ACORN reaches ~92.6% recall at sel=1% with gamma=2, ef=512, at ~1622 evals/query; evals are ~flat in ef (early-termination bound), so "tuned" = crank ef for recall at near-constant cost. This is the fair incumbent baseline for the M3 gate, and it validates the >=5x bar: contender A must reach >=90.6% recall at <=~324 evals/query to win.
src/prune.rs: RegionPruneIvf, built on ruvector-rairs k-means (ADR-193 substrate).
Two stacked prunings realizing the salvaged SepRAG kernel on the treewidth-immune
IVF hierarchy:
1. predicate pruning — skip clusters with zero matching members (the BET-2 win).
2. branch-and-bound distance pruning — triangle-inequality lower bound
(dist(q,centroid) - radius); once the top-k heap is full, clusters whose LB
exceeds the worst result are skipped. Probe in LB order so the bound lets us
break, not just skip — a strict improvement over the M2-sketch's match-count
ordering, and it yields EXACT filtered top-k.
Cost metric = nclusters (routing) + matching members scanned; the O(1) predicate
gates the expensive distance, so non-matching points cost nothing (the asymmetry
vs ACORN, which evaluates a distance per expanded node regardless of predicate).
max_probe knob: None = exact B&B (recall 1.0); Some(p) caps match-clusters probed
(trades recall for fewer evals, mirroring ACORN's ef) for equal-recall comparison.
Tests: exact_bb_matches_oracle (recall 1.0 vs exact_filtered_knn on 20 queries) and
zero_match_clusters_are_skipped (1% selectivity → <1000 evals vs 4000 full scan).
8 unit + 1 integration green, clippy clean.
… sel<=1%) examples/sweep.rs: full selectivity x rho grid, cost-at-matched-recall comparison (tune A's probe cap to ACORN's recall, then compare distance-evals), with the wall-clock honesty guard and the rho=0 kill control. VERDICT vs the frozen gate (n=20k, ACORN gamma2 ef=512, IVF nclusters=64): - WIN at sel<=1%, rho>=0.7: region-pruned IVF beats tuned ACORN by 6.1-48x evals and 4.7-26x wall-clock at equal-or-better recall (A's exact B&B recall >= ACORN). e.g. rho=1 sel=1%: ACORN 92.6%@1622 evals vs A 99.9%@264 evals = 6.1x (4.7x wall). - MISS at sel=5%: best 1.5x (gate wanted >=2x). The win is a low-selectivity (<=1%) phenomenon — the dominant production metadata-filter regime, but a real boundary, not the full pre-registered claim. - Mechanism partly refuted: A also wins at rho=0 (low sel), so the eval advantage is selectivity-driven (few matches -> cheap exact B&B) more than correlation- driven; correlation governs recall, not cost. Reported, not buried. - rho=0 kill control: A does NOT collapse (recall-safe); high-sel (>=10%) A loses as expected (ACORN's regime). Wall-clock guard: PASS (win survives the clock). nclusters is A's tuning knob (parallel to ACORN's ef): 64 beats 128 in the win regime (cheaper routing); both confirm the same boundary.
…y fails the gate Adds predicate-aware-entry ACORN (the rule-ruvnet#5 "tune harder" adversary): - ruvector-acorn: acorn_search_seeded_counted (beam starts from caller seeds instead of multi-probe entry); acorn_search_impl refactored to take Option<seeds>, existing fns pass None — 13 acorn tests still green (behavior preserved). - contenders.rs: Acorn::search_predicate_entry — stride-sample probes, predicate-test free, distance-eval only matching probes, seed the beam from the nearest matches. - examples/adversarial.rs: A vs best-of(vanilla-B, predicate-entry-D) at matched recall. FINDING (rule ruvnet#5 changed the verdict): predicate-aware entry slashes ACORN's cost at HIGH correlation (rho=1 sel=0.1%: 3753 -> 203 evals), collapsing A's advantage from 44.7x (vs vanilla) to 2.4x — BELOW the pre-registered 5x bar. A vs best ACORN: rho=1.0: 2.4x / 2.3x / 1.9x (sel .001/.005/.01) — MISS at the 5x bar. rho=0.7: 38.8x / 14.6x / 6.5x — WIN (D's seeding is weak at moderate correlation, where matches are scattered so a seeded walk still wanders). So A and predicate-entry-ACORN exploit the SAME structure and converge (~2x) at high correlation; A's clean win is NOT robust to a properly-tuned ACORN. Honest verdict: largely a KILL at the pre-registered bar, with a narrower conditional edge at rho~0.7. Caveat favoring A: D's seeding leans on ~16k "free" predicate tests (the eval metric ignores the O(1) predicate scan); at scale that scan isn't free, restoring some edge.
…O-GO (M4) Writes up the BET 2 ⊗ BET 4 outcome with ADR-199/200 honesty. Verdict: region-pruned IVF beats VANILLA ACORN 6-48x evals (4.7-26x wall-clock) at sel<=1%, but the pre-registered >=5x WIN does NOT survive the rule-ruvnet#5 adversarial check — giving ACORN a predicate-aware entry collapses the gap to ~2x at high correlation (rho=1), below the bar. A retains a narrow conditional edge at moderate correlation (rho~0.7, 6-39x) plus an at-scale caveat (D's seeding leans on a ~full predicate scan the eval metric treats as free). Net: the bet does not cleanly pay; the clean win was an artifact of an under-equipped incumbent. Central lesson: a filtered-ANN cost claim is meaningless without a predicate-aware-entry baseline. Also strips a stray tag from the pre-registration doc (non-semantic).
The experiment's own evidence points to two flip conditions (conjunctions where ACORN's predicate-seeding degrades but cluster-skip composes; large-n where the predicate scan stops being free) and the open BET 4 standalone baseline.
Open
5 tasks
… hold) A conjunction is a single O(1) boolean predicate of selectivity = product; in the distance-eval metric it reduces to (selectivity, scatter) — both already swept. The 'exponentially-unlikely seed' reasoning was wrong (testing a conjunction is O(1)). Residual leads downgraded to narrow/speculative (predicate-eval cost, large-n). Recommend closing BET 2 ⊗ BET 4; thread value is BET 1 productionization + BET 3.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
BET 2 ⊗ BET 4 of the SepRAG research line (#534): does region-pruned IVF search beat the
ruvector-acornincumbent on correlated filtered queries? Pre-registered ≥5× distance-eval gate.Verdict: qualified NO-GO. Region-pruning beats vanilla ACORN 6–48× at selectivity ≤ 1% — but the win does not survive the mandatory adversarial check: giving ACORN a predicate-aware entry (a simple, standard enhancement) collapses the gap to ~2× at high correlation, below the bar. A real but narrow edge remains at moderate correlation (ρ≈0.7). Full reasoning:
docs/adr/ADR-201.What's in this PR (independent of #535)
ruvector-filtered-bench— depends only onruvector-acorn+ruvector-rairs; zero dependency onruvector-seprag/PR SepRAG: CCH-inspired retrieval exploration + customizable re-weighting (ADRs 196-200, ruvector-seprag) #535.ruvector-acorn:acorn_search_counted,flat_filtered_search_counted,acorn_search_seeded_counted. Existing functions delegate unchanged; 13 acorn tests prove behavior is preserved. Useful tooling for anyone measuring ACORN's distance-eval cost.Why it may still be worth landing despite the NO-GO
Honest verdict (cost at matched recall, n=20k arxiv)
Central lesson: a filtered-ANN cost claim is meaningless without a predicate-aware-entry baseline.
Notes for review
main; does not affect SepRAG: CCH-inspired retrieval exploration + customizable re-weighting (ADRs 196-200, ruvector-seprag) #535 (BET 1, a separate WIN).ruvector-acornis touched only additively (measurement); the core algorithm is unchanged.Resolves the BET 2 ⊗ BET 4 item of #534 (qualified NO-GO). Follow-ups (conjunctions, large-n, BET 4 standalone) noted in ADR-201 and #534.