Skip to content

SepRAG BET 3 — curated-KG treewidth probe → NO-GO (finding, ADR-203)#538

Open
shaal wants to merge 18 commits into
ruvnet:mainfrom
shaal:docs/seprag-bet3-kg-treewidth
Open

SepRAG BET 3 — curated-KG treewidth probe → NO-GO (finding, ADR-203)#538
shaal wants to merge 18 commits into
ruvnet:mainfrom
shaal:docs/seprag-bet3-kg-treewidth

Conversation

@shaal
Copy link
Copy Markdown
Contributor

@shaal shaal commented Jun 5, 2026

SepRAG BET 3 — curated-KG treewidth probe → NO-GO (a finding, not a feature)

This is a reportable finding, not a merge request — no urgency, framed like #536.
It records the third KILL in the SepRAG thread (issue #534) and closes the
CCH-full-contraction line.

TL;DR

The salvaged separator-tree branch-and-bound k-NN kernel is validated and
treewidth-independent at query time (recall stays exact), but CCH contraction/build
blew up on high-treewidth embedding/citation graphs (ADR-199). BET 3 tested the last
untested backbone — a curated, bounded-degree knowledge graph (Wikidata-style). It was
probed first (cheap treewidth go/no-go gate) before any multi-hop-QA harness.

Verdict: NO-GO. All three curated KGs — WN18RR (WordNet), FB15k-237 (Freebase),
CoDEx-L (genuine Wikidata) — are high-treewidth. Scoreboard: 1 WIN, 3 KILLS.

Backbone role p (exp) blow@ref(2k) elim_h/n tw lower bound verdict
roadNet-PA control 0.613 7.2× 0.04 tw ≥ 4 calibrated ✓
ogbn-arxiv reference 1.259 17.4× 0.28 tw ≥ 44 NO-GO (ADR-199)
WN18RR KG 0.508 59.9× 0.21 tw ≥ 28 NO-GO
FB15k-237 KG 42.3× 0.46 tw ≥ 42 NO-GO
CoDEx-L (Wikidata) KG 1.826 5.1× 0.17 tw ≥ 28 NO-GO

Why it's a robust NO-GO

  • Structurally certain, not a heuristic artifact. The minor-min-width treewidth
    lower bounds (28–44) are 7–11× the road control's (4) and on par with the
    citation reference — a lower bound guarantees treewidth is at least that large. This
    is the adversarial upgrade over ADR-199 (which used the separator upper bound only).
  • Three distinct failure signatures; the conjunction gate caught each. WN18RR: low
    exponent but huge absolute blowup (the "judge absolute height" lesson). FB15k-237:
    near-linear elim_h. CoDEx: tree-like periphery (5.1×@2k) whose hub-dense core
    collapses treewidth (p=1.83) — a single-n blowup snapshot would have been a false GO;
    the scaling exponent (the pre-registered primary metric) caught it.
  • Root cause: KGs are small-world with hubs (max degree 520/1999/4999 vs road's 9).
    Hubs kill balanced separators regardless of average degree. Road-like low treewidth
    needs near-planar geometric locality, which no semantic graph has.

What's here (review only the 3 bet3 commits)

⚠️ Stacked on #535 (the BET-1/CCH branch where ruvector-seprag lives). The first
15 commits are #535's and are already under review there. Only these 3 are new:

  • docs(bet3): frozen pre-registration (gate fixed before any data)
  • feat(bet3): kg_treewidth_probe.rs — reuses the validated kernel + road control
  • docs(bet3): ADR-203 (the NO-GO writeup)

Will rebase onto main if/when #535 merges. No new production code; the probe is an
example, KG data stays under target/ (gitignored).

Prove-not-hype protocol honored: frozen pre-registered gate, calibrated road control,
treewidth lower bound as adversarial check, 3 honest deviations recorded (none changed a
verdict). The multi-hop QA benchmark was correctly never built — the cheap probe
killed the backbone first.

Refs #534. Writeup: docs/adr/ADR-203-curated-kg-treewidth-probe.md.

shaal added 18 commits June 4, 2026 02:11
Add design ADRs and milestone plans for adapting Customizable Contraction
Hierarchies (nested dissection, separators, contraction shortcuts, elimination
trees, separator-tree k-NN) to RuVector's hybrid vector + knowledge-graph memory.

ADRs:
- ADR-196: SepRAG keystone (separator-tree retrieval; complements HNSW/DiskANN)
- ADR-197: navigation-graph construction + metric-independent ND ordering
- ADR-198: customizable metric layer (CCH customization <-> GNN self-learning loop)
- ADR-199: public-corpus benchmark & evaluation harness

Plans (docs/plans/seprag-cch-retrieval/): M0 correctness gate -> M1 blowup
go/no-go on ogbn-arxiv -> M2 customization -> M3 full hybrid -> M4 integration.
Maps decisions onto existing crates (ruvector-mincut/jtree, solver/bmssp,
sparsifier, diskann, gnn, attn-mincut).
…aphs

New crate ruvector-seprag implementing the SepRAG M0 milestone
(docs/plans/seprag-cch-retrieval/M0-correctness-gate.md):

- graph: undirected weighted graph + Dijkstra brute-force k-NN oracle
- order: nested-dissection ordering via BFS-layer separators + separator tree
- contraction: metric-free symbolic contraction -> chordal upward graph + elim tree
- customize: bottom-up triangle-sweep shortcut weighting (re-runnable per metric)
- query: upward search, exhaustive CCH k-NN, bucket-based branch-and-bound k-NN
  with admissible early-stop and search-space accounting
- gen: deterministic SBM/grid/path/clique generators (SplitMix64)
- examples/blowup_report: M0->M1 diagnostic (blowup ratio, elim height, pruning)

Tests (8 + doctest, all green): SepRAG k-NN == Dijkstra oracle on SBM/grid/
path/clique; pruned == unpruned (pruning sound); pruning reduces search space;
determinism; bounded blowup. cargo clippy clean.

Finding: query pruning eliminates 95-100% of scans regardless of structure;
blowup/elim-height are dominated by separator quality, and the naive BFS
separator degenerates on low-diameter dense graphs (SBM 18.6x) — motivating the
ruvector-mincut separator swap planned for M1 (ADR-197).
…ion subgraph

examples/m1_arxiv: ingest the real ogbn-arxiv edge list (169K nodes / 1.16M
edges), induce a connected BFS-ball subgraph, build the SepRAG hierarchy, and
report the ADR-199 go/no-go metrics (blowup ratio, elim-tree height, build
time) plus a sampled Dijkstra-oracle recall check.

First-pass result (N=1500 citation-only subgraph, M0 BFS separator):
- recall 50/50 vs Dijkstra oracle  -> CORRECTNESS HOLDS on real data
- query pruning saves ~100% of scans (364 vs 418555 bucket scans/query)
- BUT blowup 56.9x and elim height ~= n  -> the BFS separator degenerates
  completely on small-world citation graphs (picks a giant BFS layer as the
  separator), and the raw citation ball is dense (avg degree ~24).

Verdict (ADR-199 fallback ladder): NO-GO for the naive separator + dense
backbone. Trustworthy precisely because M0 proved the algorithm correct, so 57x
is a separator-quality/backbone artifact, not a SepRAG refutation. Next:
ruvector-mincut balanced separators + alpha-pruned sparse backbone (ADR-197).
…tion

order.rs: add SeparatorKind::Balanced (grow half-size region, take only its
boundary) alongside the M0 BfsLayer strategy; make Balanced the default.
lib.rs: SepRag::build_with(graph, kind). m1_arxiv: add max_degree backbone
sparsification + separator-kind args for A/B attribution.

M1 attribution (ogbn-arxiv N=1500 citation BFS-ball):
  raw + layer      blowup 56.9x  elim_h 1443  build 56s
  raw + balanced   blowup 23.8x  elim_h  941  build 13s   <- best
  deg<=10 + layer  blowup 89.6x  elim_h 1295  build 38s
  deg<=10 + bal    blowup 60.1x  elim_h 1035  build 18s
  (recall 50/50 and ~100% query pruning in all configs)

Findings: (1) balanced separator is a real win (2.4x less fill, 4x faster);
(2) hub-dampening degree-bound BACKFIRES (shrinks denominator faster than
|G+|, destroys good cuts) — discard it; (3) even best config leaves 23.8x /
elim_h~0.6n: the dense small-world citation ball is intrinsically high-treewidth
(ADR-197 expander risk, measured). Next: feature-manifold/hyperbolic backbone,
not the citation topology.
m1_arxiv: robust edge reader (skip # comments, comma/tab/space separators) so
SNAP road networks load through the same harness.
m1_manifold: alpha-pruned kNN backbone over real ogbn-arxiv 128-d node features
(Vamana RobustPrune) — the decisive ADR-197 thesis test.

Results (N=1500, balanced separator, recall 50/50 everywhere):
  roadNet-PA          blowup  7.6x  elim_h 136 (~3.5 sqrt n)   <- CCH works
  citation (arxiv)    blowup 23.8x  elim_h 941 (~0.6 n)
  feature-manifold k10 blowup 42.4x elim_h 837 (~0.56 n)       <- worse
  feature-manifold k6  blowup 45.1x elim_h 699 (~0.47 n)

Conclusion: the road control proves the implementation is sound (planar sqrt(n)
separators -> 7.6x, instant build). Both embedding-derived backbones (citation
small-world AND Euclidean feature kNN) are intrinsically high-treewidth
(elim_h ~ 0.5n), so CCH contraction blows up regardless of separator quality or
degree. The expander risk (ADR-197) is confirmed across two independent
backbones. Query pruning stays ~100% effective; the cost is preprocessing.
Last untested rung: hyperbolic backbone (needs real hyperbolic embeddings).
The M1 go/no-go gate ran on real public data and returned NO-GO for CCH full
contraction on embedding/citation retrieval graphs. Recorded the measured
evidence in ADR-199 (Empirical Outcome section), updated ADR-196/197 status, and
marked the milestone tracker (M0 done, M1 NO-GO, M2-M4 not pursued).

Evidence (N=1500, recall 50/50 everywhere):
  roadNet-PA control       blowup  7.6x  elim_h ~3.5 sqrt n   (CCH works)
  ogbn-arxiv citation      blowup 23.8x  elim_h ~0.6 n
  ogbn-arxiv feature kNN   blowup 42.4x  elim_h ~0.56 n

Implementation is sound (road control + exact recall); embedding backbones are
intrinsically high-treewidth. Query pruning works (~100%); preprocessing
fill-in is the blocker. ruvector-seprag retained as a validated reference.
… on linear drift)

Salvage ADR-198's customizable-metric idea, decoupled from CCH, as a rigorous
pre-registered head-to-head: does a fixed ANN topology + recomputed distances
absorb metric drift as well as a full rebuild?

examples/reweight_vs_rebuild.rs: self-contained Vamana-lite (RobustPrune +
greedy beam search); drift modelled as a vector-space transform M=A^T A; sweeps
diagonal AND adversarial dense-Mahalanobis (rotational) drift; A=reuse topology,
B=rebuild, C=stale control.

Result (n=2000 ogbn-arxiv embeddings, recall@10, pre-registered gate):
  A (re-weight, 0 rebuild) within 0.2% of B (full rebuild) up to 36% relevant-set
  churn, under both drift modes. C (stale) loses up to 29 points -> benchmark has
  teeth, A's parity is genuine adaptation. WIN.

Honest claim: COST win at equal quality (rebuilds become free under LINEAR drift).
Boundaries (ADR-200): non-linear/region-local drift + scale untested. Next:
non-linear learned metric (decisive adversarial test).

Also adds docs/plans/.../FUTURE-DIRECTIONS.md (4-bet backlog + prove-not-hype
protocol) and ADR-200.
…otal WIN)

Extend the re-weight-vs-rebuild harness with the decisive adversarial cases and
close the proof:
- non-linear drift mode (residual tanh warp v + s*tanh(Wv)) — removes the
  'linear only' caveat; A still matches B within 0.2% up to 35% churn.
- per-query distance-eval columns — A and B match within ~1%, disproving any
  hidden query-cost trade. Reuse is equal recall AND equal query cost.
- fix display bug (C/stale was double-divided by query count; control now
  correctly reads 90% at t=0, validating the negative control).
- drift modelled via transform closure (diag/rot/nonlin share one code path).
- clippy: idiomatic char-class split in m1_arxiv reader.

ADR-200 + FUTURE-DIRECTIONS updated: WIN across diagonal/rotational/non-linear
drift; only open caveats are scale (n>=1e5, decisive next), region-local drift,
incremental-rebuild baseline.
Extract the Vamana-lite ANN + metric-drift helpers into a reusable lib module
(src/ann.rs) with an efficient two-heap greedy beam search (replaces the O(L)
linear-scan beam, ~2x faster, needed for n>=1e5). Thin reweight_vs_rebuild to
use it (regression-checked: identical recall/evals at n=2000). Add scale_drift
example: sweeps N (5k..100k), measures recall(reuse) vs recall(rebuild) at the
adversarial rotational drift point plus the rebuild-cost curve.
… widens

scale_drift sweep (5k->100k, rotational drift ~40% churn, recall@10):
  N        A reuse  B rebuild  gap     rebuild   update    ratio
  5000     90.2%    90.0%      +0.2%   3.6s      0.001s    ~3600x
  10000    89.5%    90.3%      -0.8%   10.2s     0.004s    ~2500x
  25000    88.5%    89.2%      -0.7%   21.4s     0.009s    ~2400x
  50000    87.7%    88.6%      -0.9%   47.1s     0.043s    ~1100x
  100000   85.0%    86.7%      -1.7%   141.8s    0.035s    ~4000x

Verdict: WIN within the 2% gate through 100k at ~1000-4000x lower update cost,
BUT the recall gap widens with N (-0.2%->-1.7%) => defer/batch rebuilds, not
never-rebuild. Honest caveats: both A&B recall fall with N (fixed beam); 100
queries => ~+-1% noise, confirm trend with more queries.

Also: rename Rng::next->next_u64 (clippy). ADR-200 + FUTURE-DIRECTIONS updated
with scale evidence, widening-gap caveat, and a hybrid re-weight+periodic-rebuild
policy as a next step.
…, gate PASS

examples/region_drift.rs: warp only a 15% local cluster, grade recall separately
for queries INSIDE vs OUTSIDE the drifted region (a global average would hide a
local failure).

Result (n=20k, recall@10):
  t      churnIn  A_in   B_in  | churnOut A_out  B_out
  0.25   44%      89.8%  81.4% | 21%      87.9%  89.0%
  0.50   53%      89.3%  90.0% | 21%      87.9%  89.0%
  1.00   45%      89.5%  90.0% | 21%      87.9%  89.0%

Gate PASS: reuse holds inside the drifted region (A_in within 0.7% of B_in, and
ABOVE it at t=0.25) even at 53% in-region churn; out-region ~unchanged. Region-
local drift did NOT break reuse.

Honest caveat: the t=0.25 B_in dip to 81% (reuse beats rebuild by 8pts) is a
build-variance artifact of the simplified single-pass Vamana baseline, not a
smooth effect — strongest argument to port the baseline to production
ruvector-diskann. ADR-200 + FUTURE-DIRECTIONS + status updated.
…onfirmed

examples/diskann_drift.rs: re-run the re-weight-vs-rebuild test on the shipping
ruvector-diskann VamanaGraph (added as dev-dependency). The reuse trick is native
- the graph stores only topology and greedy_search takes vectors externally, so
drift = search a graph-built-on-original with the transformed vectors.

Result (n=20k, recall@10):
  GLOBAL rotational: A reuse vs B rebuild within 2% (95.6 vs 97.1 worst, t=0.5)
  REGION-LOCAL in-region: A_in within ~1.5% of B_in; A_in 98.6 vs B_in 94.5 at t=0.25
  absolute recall 96-99% (stronger/fairer baseline than lite-Vamana ~90%)

Confirms BET 1 on the production index. The t=0.25 reuse-beats-rebuild dip
REPRODUCED on diskann => it is a real property (fresh Vamana build on a
half-warped region underperforms reuse), not lite-Vamana noise. Baseline-variance
caveat RESOLVED. Remaining caveat: gap widens with scale/churn (defer/batch
rebuilds). ADR-200 + FUTURE-DIRECTIONS + status updated.
…ippable

examples/hybrid_policy.rs: simulate a compounding random-walk metric-drift
trajectory and compare operating policies on the production diskann Vamana —
always / never / periodic-K / drift-triggered (Frobenius monitor).

Result (n=10k, 24 steps, aggressive random-walk drift, recall@10):
  always       99.1% mean  98.4% min   24 rebuilds
  never        94.4% mean  89.7% min    1 rebuild   <- decays under heavy drift
  periodic-4   98.8% mean  97.9% min    6 rebuilds  <- ~always at 25% cost
  periodic-8   98.4% mean  96.5% min    3 rebuilds  <- at 12.5% cost

Shippable operating point: re-weight every step + rebuild every ~4 steps recovers
near-full recall at a fraction of the cost. Honest sub-finding: the drift-TRIGGERED
monitor (Frobenius of cumulative transform) underperformed simple periodic —
periodic-K is the recommended knob; a sampled-recall probe trigger is future work.
Under gentle single-direction drift (n=5k) never did NOT decay, so the hybrid only
matters under large/compounding drift.

ADR-200 status/boundaries/next-steps + FUTURE-DIRECTIONS updated; stale
'n=2000/scale-unconfirmed' caveats removed (now resolved).
BET 3 go/no-go gate, frozen before any data: probe whether a curated
bounded-degree KG (WN18RR / FB15k-237 / 2WikiMultiHop, the backbone under
HotpotQA/MuSiQue) is low-treewidth enough for CCH contraction/build to stay
cheap — the last untested backbone for the salvaged separator-tree kernel
(treewidth-independent at query, but build blew up on high-treewidth embedding/
citation graphs, ADR-199).

Primary metric: elim-tree-height exponent p (elim_h ~ n^p). GO p<=0.6 AND
blowup<=10x; KILL p>=0.8 OR blowup>=23x; inconclusive 0.6<p<0.8; VOID if the
roadNet-PA control fails to reproduce ~sqrt(n)/7.6x. Adversarial upgrade over
ADR-199: add a minor-min-width treewidth LOWER bound + run both separators so a
NO-GO is structurally certain, not heuristic-limited. QA harness NOT built
unless GO. Branch off PR ruvnet#535 (where ruvector-seprag lives); re-home if GO.

Refs ruvnet#534
…control

New example kg_treewidth_probe.rs (reuses SepRag::build_with, blowup_ratio,
elim_depth, Dijkstra oracle from the validated kernel). Adds only what the
frozen gate needs: generic KG/triple loader (WordNet/Freebase/Wikidata),
scale sweep with OLS log-log exponent fit (elim_h ~ n^p, the primary metric),
both separators, a minor-min-width treewidth LOWER bound (the adversarial
upgrade over ADR-199), and an adaptive build-time budget so high-treewidth
backbones reveal themselves at moderate n without runaway contraction.

Run prints every KG bracketed between the two calibrated controls and applies
the frozen gate automatically. Refs ruvnet#534
All three curated KGs (WN18RR/FB15k-237/CoDEx-L = WordNet/Freebase/Wikidata)
fail the frozen treewidth gate, each with a distinct signature the conjunction
gate caught: WN18RR blowup 59.9x (low exponent but huge absolute height),
FB15k-237 elim_h~0.46n + tw>=42, CoDEx-L exponent p=1.83 (tree-like periphery,
hub-dense core collapses treewidth — a false-GO the scaling metric caught).
Minor-min-width LOWER bounds (28-44) are 7-11x the road control's (4) =>
structurally certain, not heuristic. Cause: KGs are small-world WITH hubs
(max deg 520/1999/4999 vs road 9). Combined with ADR-199 this closes the
CCH-contraction line. Query kernel stays exact (recall 30/30) + treewidth-
independent; QA harness correctly never built. Records 3 honest deviations
from the frozen pre-reg (none changed a verdict). Refs ruvnet#534
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant