From 35756c94ca68549bd9acb817eec3a31d1d002d0e Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 02:11:41 -0400 Subject: [PATCH 01/15] docs: SepRAG (CCH-inspired retrieval) ADRs 196-199 + milestone plans Add design ADRs and milestone plans for adapting Customizable Contraction Hierarchies (nested dissection, separators, contraction shortcuts, elimination trees, separator-tree k-NN) to RuVector's hybrid vector + knowledge-graph memory. ADRs: - ADR-196: SepRAG keystone (separator-tree retrieval; complements HNSW/DiskANN) - ADR-197: navigation-graph construction + metric-independent ND ordering - ADR-198: customizable metric layer (CCH customization <-> GNN self-learning loop) - ADR-199: public-corpus benchmark & evaluation harness Plans (docs/plans/seprag-cch-retrieval/): M0 correctness gate -> M1 blowup go/no-go on ogbn-arxiv -> M2 customization -> M3 full hybrid -> M4 integration. Maps decisions onto existing crates (ruvector-mincut/jtree, solver/bmssp, sparsifier, diskann, gnn, attn-mincut). --- ...R-196-seprag-cch-hierarchical-retrieval.md | 164 ++++++++++++++++++ ...ation-graph-metric-independent-ordering.md | 136 +++++++++++++++ ...customizable-metric-layer-self-learning.md | 131 ++++++++++++++ ...ADR-199-public-corpus-benchmark-harness.md | 123 +++++++++++++ .../M0-correctness-gate.md | 75 ++++++++ .../M1-blowup-measurement.md | 79 +++++++++ .../M2-customization-loop.md | 45 +++++ .../seprag-cch-retrieval/M3-full-hybrid.md | 56 ++++++ .../seprag-cch-retrieval/M4-integration.md | 34 ++++ docs/plans/seprag-cch-retrieval/README.md | 52 ++++++ 10 files changed, 895 insertions(+) create mode 100644 docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md create mode 100644 docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md create mode 100644 docs/adr/ADR-198-customizable-metric-layer-self-learning.md create mode 100644 docs/adr/ADR-199-public-corpus-benchmark-harness.md create mode 100644 docs/plans/seprag-cch-retrieval/M0-correctness-gate.md create mode 100644 docs/plans/seprag-cch-retrieval/M1-blowup-measurement.md create mode 100644 docs/plans/seprag-cch-retrieval/M2-customization-loop.md create mode 100644 docs/plans/seprag-cch-retrieval/M3-full-hybrid.md create mode 100644 docs/plans/seprag-cch-retrieval/M4-integration.md create mode 100644 docs/plans/seprag-cch-retrieval/README.md diff --git a/docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md b/docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md new file mode 100644 index 0000000000..30bebacf83 --- /dev/null +++ b/docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md @@ -0,0 +1,164 @@ +--- +adr: 196 +title: "SepRAG — CCH-Inspired Separator-Tree Retrieval for Hybrid Vector + Graph Memory" +status: proposed +date: 2026-06-04 +authors: [ofershaal, claude-flow] +related: [ADR-002, ADR-193, ADR-194, ADR-195, ADR-197, ADR-198, ADR-199] +tags: [ruvector, retrieval, cch, contraction-hierarchies, graph-rag, mincut, jtree, hnsw, diskann, knn] +--- + +# ADR-196 — SepRAG: CCH-Inspired Separator-Tree Retrieval + +## Status + +**Proposed.** Keystone design ADR. Depends on the navigation-graph and ordering +decisions in [ADR-197], the metric layer in [ADR-198], and is validated by the +benchmark harness in [ADR-199]. No code yet; prototype lands behind a feature gate. + +## Context + +Customizable Contraction Hierarchies (CCH) achieve ~35,000x speedups over Dijkstra +on continental road networks by reducing a million-vertex search to ~1,200–1,450 +vertices. The speedup comes from three ideas: + +1. **Nested-dissection ordering** — recursively split the graph by *balanced + separators*; rank separator vertices highest. +2. **Contraction + shortcuts** — eliminating a vertex makes its higher-ranked + neighbors a clique (fill-in). The shortcut *set* depends only on the order and + topology, **not on edge weights**. +3. **Elimination-tree query** — the search space from any vertex is exactly its + ancestors in the elimination tree (a contiguous array walk), so point-to-point + and k-NN queries touch a tiny, bounded set of vertices. + +Two grounding papers: Buchhold & Wagner (2021), *Nearest-Neighbor Queries in +Customizable Contraction Hierarchies* (the separator-tree k-NN algorithm); and +Bläsius et al. (arXiv:2502.10519), the modern CCH implementation survey. + +**Why this matters for RuVector.** A repository audit found that RuVector already +ships ~70% of the required machinery — it has simply never been composed as a +retrieval hierarchy: + +| CCH primitive | Existing crate / module | +|---|---| +| Balanced separators / nested dissection | `ruvector-mincut::{expander, cluster, sparsify, algorithm}` | +| Elimination / junction tree, leveled hierarchy | `ruvector-mincut::jtree` (`hierarchy`, `level`, `coordinator`) | +| Dynamic tree operations (incremental updates) | `ruvector-mincut::linkcut` | +| Path/cut duality, customization-style sweeps | `ruvector-solver::{bmssp, forward_push, backward_push, simd}` | +| Effective-resistance edge importance | `ruvector-sparsifier` | +| Hybrid vector+graph retrieval surface | `ruvector-graph::hybrid::{semantic_search, rag_integration, graph_neural, vector_index}` | +| Entry-point search | `ruvector-diskann` (Vamana), `ruvector-hyperbolic-hnsw` | +| Quantized distance evaluation | `ruvector-rabitq` | +| Learned metric + rerank | `ruvector-gnn`, `ruvector-attn-mincut` | + +**The problem being solved.** Flat ANN (HNSW/DiskANN) is excellent at pure-semantic +top-k cosine, but degrades on the queries RuVector actually wants to be great at: +*constrained, relational, multi-hop, and re-weightable* retrieval (Graph-RAG). +Post-filtering a flat index is wasteful, results are semantically scattered (no +topic coherence), and any change to the relevance metric forces an index rebuild. + +## Decision + +**Adopt a CCH-inspired retrieval layer ("SepRAG") that complements — does not +replace — HNSW/DiskANN.** Division of labor: + +- **HNSW/DiskANN answers "where am I?"** — find entry leaves in O(log n). Do not + reinvent this; flat ANN is the right tool for landing near the query. +- **SepRAG answers "what is structurally near here, under these constraints, by + this learned metric?"** — separator-tree branch-and-bound that prunes entire + semantic regions in O(1) per cell, supports subtree filter predicates, and uses a + re-customizable cost. + +### Architecture (three phases, mirroring CCH) + +``` +Phase 1 ORDER (metric-independent) → vertex order + shortcut SET + elim tree +Phase 2 CUSTOMIZE (metric-dependent, ~s) → shortcut WEIGHTS for current metric [ADR-198] +Phase 3 QUERY (metric-fixed, ~µs–ms) → separator-tree k-NN with pruning +``` + +Phase 1 + the navigation graph it runs on are specified in [ADR-197]. Phase 2 (the +self-learning metric loop) is [ADR-198]. + +### Phase-3 query algorithm (the workhorse) + +``` +fn knn(s, k, w, sep_tree, poi_buckets) -> TopK: + d_anc = upward(s, w, elim_parent) # d(s -> x) for ancestors/separators + topk = BoundedHeap(k) # exposes delta_k = current k-th best + PQ = MinHeap() # ordered by admissible lower bound + PQ.push(lb=0, node=sep_tree.root) + while PQ not empty: + (lb, node) = PQ.pop() + if lb > topk.delta_k(): break # global prune; nothing better remains + for p in poi_buckets[node]: # POIs attached at this separator node + for x in node.separator_vertices: + topk.offer(p, d_anc[x] + downdist[x][p]) + for child in node.children: + lb_child = min_{x in child.boundary} d_anc[x] # separator sits ABOVE cell + if lb_child <= topk.delta_k() and child.may_satisfy(filter): + PQ.push(lb_child, child) # else: whole subtree pruned in O(1) + return topk +``` + +Admissibility of the bound rests on the separator-above-cell property: any path from +`s` into a cell must pass through that cell's separator, so `d(s -> cell) >= +d(s -> its separator)`. Region-level pruning is where the search-space reduction comes +from. + +### Three techniques layered on the same topology + +1. **SepRAG k-NN** (above) — hybrid graph-distance top-k with region pruning. +2. **Hierarchical hybrid filtering** — query constraints (tenant, recency, relation + type, entity reachability) become *subtree predicates*. Semantic lower-bound + pruning and constraint pruning run in the same branch-and-bound, so structured + + semantic + filtered retrieval is a single traversal. This is SepRAG's decisive + advantage over flat-index post-filtering. +3. **Multi-metric quiver** — one topology, several cheap customizations + (`semantic`, `recency`, `trust`, `task`, on-the-fly blends). Per-query lens + selection at near-zero marginal cost. Detailed in [ADR-198]. + +### Composition pipeline + +``` +query ─► HNSW/DiskANN top-m (entry leaves) + ─► SepRAG separator-tree branch & bound (metric from ADR-198, filters as predicates) + ─► ruvector-attn-mincut rerank (cut-as-attention gating) + ─► top-k memories + elimination-tree path (free provenance / explanation) +``` + +## Consequences + +**Positive.** +- Reuses existing, tested crates; this is composition, not green-field. +- Region pruning + subtree filters target the exact queries flat ANN handles poorly. +- The elimination-tree path is a free provenance trail ("why was this retrieved"). +- Metric updates do not rebuild topology (the self-learning payoff — [ADR-198]). + +**Negative / risk.** +- **Expander risk (decisive).** Dense kNN graphs have good expansion → large + separators → shortcut blowup → CCH collapses. Mitigation is the navigation-graph + design in [ADR-197]; the risk is *measured* (not argued) in [ADR-199] via the + shortcut-blowup ratio `|G+| / |G_nav|`. +- Preprocessing (Phase 1 ordering) is superlinear; viable only because it is + metric-independent and amortized over all future customizations. +- Graph-distance k-NN is not Euclidean k-NN — recall must be defined against a + hybrid-distance oracle, not cosine top-k (see [ADR-199]). + +**Neutral.** +- SepRAG is additive behind a feature gate; HNSW/DiskANN paths are untouched. + +## Alternatives considered + +- **Replace HNSW with CCH for plain top-k.** Rejected — embedding graphs are too + expander-like; HNSW wins on pure cosine. CCH's edge is constrained/relational. +- **Plain CH (weights baked into order).** Rejected — every relevance update would + re-run the expensive ordering. Metric-independence ([ADR-198]) is the whole point. +- **HNSW post-filtering for constraints.** Rejected as the *primary* path — wasteful + and incoherent; kept only as a baseline in [ADR-199]. + +## Open questions + +Carried into [ADR-199], to be answered empirically by the benchmark corpus rather +than guessed: dominant query shape, real graph sparsity, GNN metric-update cadence, +whether `jtree` already runs on data graphs, and exact vs approximate recall target. diff --git a/docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md b/docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md new file mode 100644 index 0000000000..edb5ad6a16 --- /dev/null +++ b/docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md @@ -0,0 +1,136 @@ +--- +adr: 197 +title: "Navigation-Graph Construction & Metric-Independent Nested-Dissection Ordering" +status: proposed +date: 2026-06-04 +authors: [ofershaal, claude-flow] +related: [ADR-196, ADR-198, ADR-199] +tags: [ruvector, cch, nested-dissection, separators, mincut, jtree, diskann, hyperbolic, treewidth] +--- + +# ADR-197 — Navigation-Graph Construction & Metric-Independent Ordering + +## Status + +**Proposed.** Implements Phase 1 of [ADR-196]. This is the **make-or-break** +technical decision: if the navigation graph has large separators, the whole SepRAG +approach fails, so the choices here directly determine viability. + +## Context + +CCH's speedup depends on **small balanced separators** (low treewidth). Road +networks have them naturally; arbitrary embedding kNN graphs generally do not — a +dense, well-connected kNN graph is *expander-like*, which is precisely the worst case +(good expansion ⇒ large separators ⇒ massive fill-in ⇒ no pruning). + +So the central engineering question is **what graph to build the hierarchy on**. The +dense kNN graph is the wrong answer. The decision below selects a sparse, structured +"navigation graph" `G_nav` whose separator structure is favorable, then orders it +metric-independently using `ruvector-mincut`. + +## Decision + +### 1 — Navigation graph `G_nav`: sparse, structured, road-like + +Build `G_nav` as the union of: + +- **Knowledge-graph relation edges** (`ruvector-graph::Edge`, `RelationType`) — + sparse, hierarchical, low-treewidth, the most "road-like" component. +- **A degree-bounded, diversified kNN graph** — RNG / α-pruned, exactly what + `ruvector-diskann`'s Vamana already produces. Pruning is what *creates* separators; + it removes the expander-inducing density. +- **(Optional) HNSW upper-layer skeleton** as a coarse long-range backbone. + +Explicitly **do not** run nested dissection on the full dense kNN graph. + +**Hyperbolic option (high-synergy).** `ruvector-hyperbolic-hnsw` embeds in hyperbolic +space, whose tree-like geometry yields naturally small separators. Building `G_nav` +in hyperbolic space is a first-class alternative backbone and is benchmarked head-to-head +in [ADR-199]. + +### 2 — Metric-independent ordering via nested dissection + +Reuse `ruvector-mincut`: + +``` +fn nested_dissection_order(G_nav) -> (order, sep_tree): + fn recurse(cell, parent_sep_node): + if |cell| <= LEAF: assign ranks, attach leaf; return + S = balanced_separator(G_nav, cell) # ruvector-mincut expander/cluster/sparsify + (A, B) = components(cell \ S) + node = sep_tree.add_separator(parent_sep_node, S) + recurse(A, node); recurse(B, node) + assign ranks to S AFTER A,B # separators ranked highest + recurse(all_vertices, ROOT) +``` + +`ruvector-mincut::jtree` already produces a leveled hierarchical decomposition with +BMSSP integration and O(n^ε) updates — adapt it to emit a CCH vertex order and the +separator decomposition tree, rather than re-implementing nested dissection. + +### 3 — Symbolic contraction → chordal graph + elimination tree (metric-free) + +``` +fn build_chordal(G_nav, order) -> (G_plus, elim_parent): + for v in order (low -> high rank): + Hi = { higher-ranked neighbors of v in G_plus } + elim_parent[v] = argmin_rank(Hi) # lowest higher neighbor + make Hi a clique in G_plus # fill-in / shortcuts +``` + +The shortcut set is fixed here and reused across every metric customization +([ADR-198]). + +### 4 — Cache-friendly layout + +- **Relabel vertices to rank order** so the upward CSR rows are ascending and + SIMD-comparable (`ruvector-solver::simd` can vectorize triangle relaxations). +- **Store the elimination tree in DFS post-order** so a vertex's ancestors occupy a + near-contiguous band — the ancestor walk is a cache-friendly stride, not pointer + chasing. Mirrors `jtree::level` bucketing. + +```rust +pub struct CchTopology { // metric-INDEPENDENT, built once + n: u32, + up_offsets: Vec, // CSR, len n+1 (upward chordal graph) + up_targets: Vec, // ranks ascending within a row → SIMD-friendly + elim_parent: Vec, // rank-indexed; parent[root] = SENTINEL + dfs_order: Vec, // elim tree in post-order (contiguous ancestors) + sep_tree: SepTree, // separator decomposition (cells = vertex ranges) +} +``` + +### 5 — Incremental updates + +New memories arrive continuously; rebuilding topology per insert is infeasible. +Use `ruvector-mincut::{jtree (O(n^ε) updates), linkcut}` for incremental contraction +on insert; re-order in periodic batches. Weight-only changes never touch topology — +they are pure re-customization ([ADR-198]). + +## Consequences + +**Positive.** Reuses `ruvector-mincut`/`jtree`/`linkcut` wholesale. The expander risk +is contained by construction (sparse backbone) and measured in [ADR-199]. + +**Negative.** +- Ordering is the heavy, superlinear step; only justified because it is amortized. +- Quality of separators (hence everything downstream) depends on the balanced-cut + heuristics in `ruvector-mincut` — must be validated on real data, not assumed. +- Bounding contraction degree to cap fill-in trades exactness for blowup control and + must be recall-tested. + +**Neutral.** The hyperbolic vs Euclidean backbone choice is deferred to measurement. + +## Decision drivers / metrics gating success + +The single gating metric is the **shortcut-blowup ratio** `|G+| / |G_nav|` and the +**separator-size distribution**, both reported in [ADR-199]. If blowup is small and +separators are sublinear, proceed; if not, fall back to the hyperbolic backbone, then +to GNN-learned ordering (ADR-196 extension E1), before abandoning the approach. + +## Alternatives considered + +- **Full dense kNN graph as backbone.** Rejected — expander-like, the worst case. +- **METIS/standard ND library.** Viable, but `ruvector-mincut` already implements + dynamic, sparsifier-backed balanced cuts with incremental updates — reusing it keeps + the dynamic-insert story coherent. diff --git a/docs/adr/ADR-198-customizable-metric-layer-self-learning.md b/docs/adr/ADR-198-customizable-metric-layer-self-learning.md new file mode 100644 index 0000000000..0bf4e900eb --- /dev/null +++ b/docs/adr/ADR-198-customizable-metric-layer-self-learning.md @@ -0,0 +1,131 @@ +--- +adr: 198 +title: "Customizable Metric Layer for Self-Learning Retrieval (CCH Customization ↔ GNN Loop)" +status: proposed +date: 2026-06-04 +authors: [ofershaal, claude-flow] +related: [ADR-196, ADR-197, ADR-199] +tags: [ruvector, cch, customization, gnn, self-learning, ewc, multi-metric, solver, bmssp, sparsifier] +--- + +# ADR-198 — Customizable Metric Layer for Self-Learning Retrieval + +## Status + +**Proposed.** Implements Phase 2 of [ADR-196] on the topology from [ADR-197]. +Validated by [ADR-199]. + +## Context + +RuVector is a *self-learning* memory system: its GNN continuously re-estimates +relevance from feedback (`ruvector-gnn`, with `ewc` for catastrophic-forgetting +control). In a flat-index world, changing the relevance metric means **rebuilding the +index** — prohibitively expensive to do on every learning step. + +CCH's defining feature is the separation of **metric-independent topology** (order + +shortcut set, [ADR-197]) from **metric-dependent weights** (customization). A new +metric is absorbed by re-running customization only — seconds on continental graphs, +topology untouched. + +**The key mapping:** RuVector's self-learning loop *is a stream of new metrics*. The +GNN plays the role that live traffic plays in road routing; CCH customization is the +engine that re-absorbs it without rebuilding. This is the single strongest argument +for the whole SepRAG program. + +## Decision + +### 1 — Edge cost is supplied by the GNN; the metric is non-negative and additive + +``` +w(u, v) = f_θ(h_u, h_v, edge_feats) # GNN edge head, >= 0 +``` + +Cost definitions used (all additive along paths, valid for triangle relaxation): + +- **Manifold-semantic:** `1 - cos(u,v)` (or angular `sqrt(2 - 2cos)`, a true metric) + defined only on `G_nav` edges → path cost follows the data manifold's geodesics, not + flat cosine across empty embedding space. +- **Relational:** `-log strength(e)` on KG edges → multiplicative confidence becomes + additive path cost (max-product ↔ shortest path; trust decays along a chain). +- **Learned:** the GNN edge head above — the metric that customization re-absorbs on + every `θ` update. + +Cosine *similarity* itself is never used as a path cost (not additive, not a metric). + +### 2 — Customization as a sparse triangular sweep (reuse the solver) + +``` +fn customize(G_plus, order, w_init) -> w: # weights over all G_plus edges + w = w_init # orig edges = metric; shortcuts = +inf + for level in elim_tree_levels_bottom_up: # PARALLEL within a level + for v in level: + for (u, x) in lower_triangles(v): # u,x = higher-ranked nbrs of v + w[u,x] = min(w[u,x], w[u,v] + w[v,x]) + return w # re-run ONLY this on metric change +``` + +This is structurally a DP sweep over a chordal graph by elimination-tree level — +the same shape as `ruvector-solver`'s `bmssp` multigrid V-cycle and `forward_push` / +`backward_push`. **Decision: implement customization as a specialization of +`ruvector-solver`**, not a fresh kernel; vectorize triangle relaxations with +`ruvector-solver::simd`. Effective-resistance importance from `ruvector-sparsifier` +can prioritize which shortcuts to refresh first under a time budget. + +### 3 — Multi-metric quiver: one topology, many cheap customizations + +```rust +pub struct CchMetric { // one per metric; many coexist cheaply + up_weight: Vec, // parallel to CchTopology.up_targets + metric_id: MetricId, // semantic | recency | trust | task | blend +} +``` + +Maintain a bank: `w_semantic`, `w_recency`, `w_trust`, `w_task`, and on-the-fly +blends `Σ λ_i w_i`. A query selects or blends a **lens** at near-zero marginal cost — +the retrieval analogue of CCH's car/truck/bike profiles, and infeasible with +per-metric HNSW rebuilds. Natural fit for multi-tenant and task-conditioned retrieval. + +### 4 — Update cadence and forgetting + +- **Weight change (frequent):** re-customize only. No topology touch. +- **Topology change (batched):** incremental insert via `jtree`/`linkcut` ([ADR-197]); + periodic re-order. +- **Catastrophic forgetting:** `ruvector-gnn::ewc` constrains `θ` drift so the metric + evolves without collapsing prior structure; customization then reflects the + EWC-regularized metric. + +### 5 — Rerank with cut-as-attention + +Survivors are reranked by `ruvector-attn-mincut`, reusing the *same separator cuts* +as attention masks (only attend across separators "open" for this query). Keeps +retrieval and attention on one shared structure. + +## Consequences + +**Positive.** +- Relevance updates cost a customization pass, not an index rebuild — this is the + quantifiable self-learning benefit ([ADR-199] measures customization time vs HNSW + rebuild and "adaptation lag" in queries). +- Multi-lens retrieval is essentially free. +- Reuses `ruvector-solver`, `ruvector-sparsifier`, `ruvector-gnn`, `attn-mincut`. + +**Negative.** +- Requires a non-negative additive metric; rich learned scores must be mapped into + that form (mitigated by the cost definitions above). +- Customization parallelism is bounded by elimination-tree level structure (deep + narrow trees parallelize poorly) — another reason separator quality ([ADR-197]) + matters. +- A metric that violates the triangle structure (e.g. negative learned costs) breaks + relaxation; the GNN edge head must be constrained to `>= 0`. + +**Neutral.** The quiver's memory cost is `O(#metrics × |G+|)` floats — cheap, but +capped by a configured lens budget. + +## Alternatives considered + +- **Bake the metric into the order (plain CH).** Rejected — defeats the entire + self-learning premise. +- **Rebuild HNSW on metric change.** Rejected — the cost this ADR exists to avoid; + retained only as the [ADR-199] baseline to quantify the win. +- **Recompute all shortcut weights from scratch each query.** Rejected — customization + amortizes this across queries between metric updates. diff --git a/docs/adr/ADR-199-public-corpus-benchmark-harness.md b/docs/adr/ADR-199-public-corpus-benchmark-harness.md new file mode 100644 index 0000000000..208cd31ee4 --- /dev/null +++ b/docs/adr/ADR-199-public-corpus-benchmark-harness.md @@ -0,0 +1,123 @@ +--- +adr: 199 +title: "Public-Corpus Benchmark & Evaluation Harness for SepRAG" +status: proposed +date: 2026-06-04 +authors: [ofershaal, claude-flow] +related: [ADR-194, ADR-195, ADR-196, ADR-197, ADR-198] +tags: [ruvector, benchmark, evaluation, wikipedia, wikidata, beir, hotpotqa, ogb, recall, graph-rag] +--- + +# ADR-199 — Public-Corpus Benchmark & Evaluation Harness for SepRAG + +## Status + +**Proposed.** This is the experimental backbone for [ADR-196]–[ADR-198]. It exists to +*answer empirically* the design questions the other ADRs leave open, using large public +datasets rather than synthetic graphs or a priori reasoning. + +## Context + +SepRAG ([ADR-196]) rests on one decisive, unprovable-by-argument assumption: that a +real hybrid memory graph has small enough separators that contraction does not blow up. +The only way to settle it is to load a large, *natively hybrid* public corpus (text + +explicit graph) into RuVector and measure. The benchmark also reveals the **crossover** +between flat ANN and SepRAG by testing two query shapes, so we learn *where* each wins +instead of cherry-picking. + +This ADR also records the standing guidance that "huge" is a trap for the superlinear +Phase-1 ordering ([ADR-197]): start at 10⁵–10⁶ nodes, prove the win, then scale. + +## Decision + +### 1 — Corpus (natively hybrid backbone) + +Primary: **Wikipedia + Wikidata + the Wikipedia hyperlink graph**, all aligned on the +same entities — text to embed, a real KG (`RelationType` edges), and a real link graph +in one corpus. Licensing is clean (Wikipedia CC-BY-SA; Wikidata CC0). + +**Use precomputed embeddings** (e.g. Cohere `wikipedia-22-12` on HuggingFace, or BEIR's) +so embedding cost does not dominate and we are not bottlenecked on the embedder work in +[ADR-194]/[ADR-195]. Hash-embedding fallback is explicitly disallowed for scored runs +(per the silent-fallback lesson in [ADR-194]). + +### 2 — Query workloads (test the crossover with two shapes) + +- **Pure-semantic IR** — BEIR subsets (NQ, FEVER, MS MARCO-style passage retrieval). + Ground truth: qrels. **Expectation: HNSW/DiskANN wins or ties.** This is the parity / + no-regression guard for SepRAG. +- **Multi-hop / relational** — HotpotQA, 2WikiMultiHopQA, MuSiQue. Ground truth: + supporting-fact passages. **Expectation: SepRAG wins** on supporting-passage coverage + and topic coherence — the queries flat post-filtering handles poorly. + +### 3 — Graph-structure datasets (separator quality / blowup, [ADR-197]) + +OGB graphs with node features as embeddings: **ogbn-arxiv (~170K, start here)**, +ogbn-products, and a capped ogbn-papers100M subset later. Real citation / co-purchase +structure with standard splits — ideal for measuring separator size and fill-in on +graphs of known character. + +### 4 — Metrics (report all together; no metric in isolation) + +| Metric | Why it matters | +|---|---| +| **Shortcut-blowup ratio `\|G+\|/\|G_nav\|`** | The gating viability metric ([ADR-197]). Decides go/no-go. | +| **Separator-size distribution** | Diagnoses "road-like vs expander-like" on real data. | +| Recall@k vs **hybrid-distance oracle** | Correctness of SepRAG's graph-distance k-NN. | +| Recall@k / nDCG vs **qrels** | End-task retrieval quality vs baselines. | +| Latency p50 / p95 | The headline performance claim. | +| Search-space size (vertices touched) | Mechanistic proof of region pruning. | +| **Customization time vs HNSW rebuild** | Quantifies the self-learning payoff ([ADR-198]). | +| **Adaptation lag** (queries until new feedback reflected) | Self-learning responsiveness. | +| Multi-hop supporting-passage coverage | Where SepRAG is expected to win. | + +Baselines for every run: plain HNSW top-k, DiskANN beam search, and (for constrained +queries) HNSW + post-filtering. + +### 5 — Milestones (incremental, start small) + +``` +M0 Toy validation SBM + grid graphs; prove recall == brute force, + search space shrinks with separator size. (~2 wks) +M1 SepRAG MVP ogbn-arxiv + a Wikipedia category subgraph, + static metric. MEASURE BLOWUP RATIO (go/no-go). +M2 Customization loop Wire GNN edge head → customize; time re-customize + vs HNSW rebuild; measure adaptation lag. [ADR-198] +M3 Full hybrid HNSW entry → SepRAG → filters as subtree preds → + attn-mincut rerank; multi-hop QA eval. +M4 Integration ruvector-postgres fn seprag_knn(query,k,lens,filter); + ruvector-node bindings; topology in ruvector-snapshot. +``` + +### 6 — Harness location + +Extend the existing `ruvector-bench` crate and `benches/` with a SepRAG suite. Every +run emits the blowup ratio and separator-size distribution alongside latency/recall, so +the expander risk is always visible. + +## Consequences + +**Positive.** +- The benchmark *answers* the open questions from [ADR-196] (query shape, sparsity, + metric cadence, exact-vs-approximate recall) instead of requiring up-front answers. +- Two query shapes reveal the HNSW↔SepRAG crossover rather than a biased single number. +- M1's blowup measurement is an early, cheap go/no-go gate before heavy investment. + +**Negative.** +- Wikipedia/Wikidata ingestion + KG alignment is non-trivial data engineering. +- Building hybrid-distance ground-truth oracles is expensive (brute-force graph + distance) — budget for it, restrict to sampled query sets. +- Scaling the Phase-1 ordering beyond ~10⁶ nodes may need the GNN-learned ordering + extension (ADR-196 E1) before ogbn-papers100M is feasible. + +**Neutral.** +- All datasets are public and appropriately licensed; no secrets or PII involved. + +## Alternatives considered + +- **Synthetic graphs only.** Rejected as the *primary* corpus — they cannot settle the + expander question for real embeddings; kept only for M0 sanity. +- **A single dataset.** Rejected — would hide the crossover; the two-shape design is + the point. +- **Embed everything ourselves.** Rejected for v1 — precomputed embeddings de-risk the + experiment and isolate retrieval performance from embedder throughput ([ADR-194]). diff --git a/docs/plans/seprag-cch-retrieval/M0-correctness-gate.md b/docs/plans/seprag-cch-retrieval/M0-correctness-gate.md new file mode 100644 index 0000000000..3f054f311d --- /dev/null +++ b/docs/plans/seprag-cch-retrieval/M0-correctness-gate.md @@ -0,0 +1,75 @@ +# M0 — Correctness Gate (toy graphs) + +**Status:** Planned · **Est:** 2–3 days · **Depends on:** none · +**Feeds:** [M1](M1-blowup-measurement.md) (reuses this exact implementation) +**ADRs:** [196](../../adr/ADR-196-seprag-cch-hierarchical-retrieval.md), +[197](../../adr/ADR-197-navigation-graph-metric-independent-ordering.md) + +## Purpose + +Prove the SepRAG algorithm is **correct** before pointing it at real data. M0 is a +*test harness*, not a milestone to polish. Its sole job: make M1's blowup go/no-go +signal trustworthy by eliminating "is it a bug or is it the data?" ambiguity. + +**Critical caveat:** M0 success does **not** validate the thesis — synthetic SBM +graphs have small separators by construction, so M0 will pass almost regardless of +real-world viability. M0 validates *code*, M1 validates *the idea*. + +## Scope + +Build the core, metric-independent SepRAG pipeline + static query path on synthetic +graphs only. No embeddings, no real data, no GNN, no customization loop. + +## Where it lives + +New module `crates/ruvector-mincut/src/cch/` (reuses in-crate separator machinery), or +a thin new crate `ruvector-seprag` depending on `ruvector-mincut`. Decision recorded in +M0 task 1. Tests in-crate; toy benchmark wired into `ruvector-bench`. + +## Tasks + +1. **Module scaffold** — decide `ruvector-mincut::cch` vs new `ruvector-seprag` crate; + define `CchTopology`, `SepTree`, `SepNode`, `CchMetric` structs (per ADR-197 §4). +2. **Nested-dissection order** — adapt `ruvector-mincut::jtree`/`expander`/`cluster` to + emit a vertex order (separators ranked highest) + separator decomposition tree. +3. **Symbolic contraction** — build chordal `G+` + `elim_parent` (fill-in), metric-free. +4. **Cache-friendly layout** — relabel vertices to rank order; upward CSR with ascending + rows; elimination tree in DFS post-order. +5. **Static customization** — single fixed metric (edge weight = graph edge cost); the + bottom-up triangle sweep producing shortcut weights. +6. **Upward search + separator-tree k-NN** — the branch-and-bound with lower-bound + pruning (ADR-196 Phase-3 algorithm). +7. **Brute-force oracle** — exhaustive graph-distance k-NN (Dijkstra/BFS per query) for + exact comparison. +8. **Toy benchmark** — wire into `ruvector-bench`; emit search-space size + (for sanity) + blowup ratio even on toy graphs. + +## Test graphs + +- **Stochastic Block Model** (clear communities → clean separators) — primary. +- **2-D / 3-D grid** (known √n separators) — separator-size sanity. +- **Path / cycle / clique** — degenerate edge cases (clique = worst-case fill-in). +- Sizes 100 → 10,000 vertices (small enough for exhaustive oracle). + +## Exit criteria (all must hold) + +- [ ] SepRAG k-NN output **exactly equals** the brute-force oracle for k ∈ {1,5,10,50} + across all toy graphs and ≥100 random queries each (this is the gate). +- [ ] Search-space size (vertices touched) **shrinks** as separator size decreases — + demonstrated by varying SBM inter-block density. +- [ ] Pruning is sound: no pruned subtree ever contained a true top-k result (assert in + a debug "no-prune" oracle mode). +- [ ] Determinism: identical results across runs (fixed seeds; tie-break rule defined). +- [ ] `cargo test` + `cargo clippy` clean; the module builds in the workspace. + +## Explicit non-goals + +Real data · embeddings · GNN metric · dynamic updates · multi-metric quiver · +performance tuning · WASM. All deferred to M1+. + +## Risks + +- **Over-investment** — the main risk. Cap at 2–3 days; if the implementation is + fighting you, that itself is signal to simplify before M1. +- Separator heuristic quality in `ruvector-mincut` may need tuning even on toy graphs; + if so, note it — it foreshadows M1 difficulty. diff --git a/docs/plans/seprag-cch-retrieval/M1-blowup-measurement.md b/docs/plans/seprag-cch-retrieval/M1-blowup-measurement.md new file mode 100644 index 0000000000..bcd59bd4a3 --- /dev/null +++ b/docs/plans/seprag-cch-retrieval/M1-blowup-measurement.md @@ -0,0 +1,79 @@ +# M1 — Blowup Measurement on Real Data (the decisive go/no-go) + +**Status:** Planned · **Est:** 1–2 weeks · **Depends on:** [M0](M0-correctness-gate.md) +(reuses its implementation) · **Feeds:** [M2](M2-customization-loop.md) +**ADRs:** [196](../../adr/ADR-196-seprag-cch-hierarchical-retrieval.md), +[197](../../adr/ADR-197-navigation-graph-metric-independent-ordering.md), +[199](../../adr/ADR-199-public-corpus-benchmark-harness.md) + +## Purpose + +Answer the one question that can kill SepRAG: **do real embedding/graph data have small +enough separators that contraction does not blow up?** Measured, not argued. Because M1 +runs the M0-validated code, a bad result means the *data* is expander-like — a clean, +trustworthy go/no-go signal. + +## Dataset — ogbn-arxiv first (deliberately) + +Start on **ogbn-arxiv** (~170K nodes, ~1.2M citation edges, 128-d node features): +- Ingest is near-free — ships as a downloadable graph with features and standard splits. +- Real citation structure of known character (good separator-quality probe). +- Avoids Wikipedia/Wikidata KG alignment, which is real data engineering deferred to M3. + +Node features serve as embeddings; the citation graph is the relational backbone. + +## Navigation graph `G_nav` (ADR-197) + +Build `G_nav` as: citation edges ∪ a **degree-bounded / α-pruned kNN graph** over node +features (RNG-style; reuse `ruvector-diskann` Vamana pruning). **Do not** use the dense +kNN graph — pruning is what creates separators. Static metric = `1 − cos` on `G_nav` +edges (no GNN yet). + +## Tasks + +1. **Ingest** ogbn-arxiv → `ruvector-graph` (nodes + citation edges) + features into a + vector store (`ruvector-diskann` / `ruvector-rabitq` quantized). +2. **Build `G_nav`** — citation ∪ α-pruned kNN; record degree distribution. +3. **Phase 1** — nested-dissection order + symbolic contraction (M0 code, real scale). +4. **Phase 2** — static customization (single `1 − cos` metric). +5. **Instrument blowup** — `|G+|`, `|G_nav|`, ratio; separator-size distribution by + tree level; elimination-tree height. +6. **Query path** — separator-tree k-NN; build a **sampled** hybrid-distance oracle + (brute-force is expensive at 170K — sample ~1–5K queries). +7. **Baselines** — plain HNSW top-k and DiskANN beam search on the same vectors. +8. **Report** — into `ruvector-bench`, emitting the full metric table below. + +## Metrics (report together — ADR-199 §4) + +| Metric | Decision role | +|---|---| +| **Shortcut-blowup ratio `\|G+\|/\|G_nav\|`** | **Primary gate.** | +| Separator-size distribution; elim-tree height | Diagnoses road-like vs expander-like. | +| Recall@k vs sampled hybrid-distance oracle | SepRAG correctness at scale. | +| Latency p50/p95 vs HNSW/DiskANN | Performance claim. | +| Search-space size (vertices touched) | Mechanistic proof of pruning. | + +## Exit criteria / decision + +**GO** if: +- [ ] Blowup ratio is "small" — target **≤ ~3–5×** `|G_nav|` (tune; >10× is a red flag). +- [ ] Separator sizes are clearly **sublinear** (not Θ(n)); elim-tree height manageable. +- [ ] Recall@k vs the sampled oracle is high (≥0.95) — confirms M0 correctness holds at + scale on real (non-toy) structure. +- [ ] Search space is a small fraction of n (region pruning demonstrably works). + +**NO-GO / fallback** (descend the ladder in [README](README.md)): hyperbolic backbone → +GNN-learned order → bounded-degree contraction → abandon. Record which rung is reached +and the blowup numbers that triggered it. + +**Either outcome is a successful M1** — the point is a trustworthy signal, not a +predetermined yes. + +## Risks + +- **Confounding** — mitigated by M0 (validated code) and the sampled oracle recall check; + if recall is low here, suspect a scale-dependent bug before blaming the data. +- **Oracle cost** — brute-force graph distance at 170K is heavy; restrict to a sampled, + fixed query set. +- **Separator heuristic at scale** — `ruvector-mincut` balanced-cut quality directly + drives blowup; budget time to tune it before declaring NO-GO. diff --git a/docs/plans/seprag-cch-retrieval/M2-customization-loop.md b/docs/plans/seprag-cch-retrieval/M2-customization-loop.md new file mode 100644 index 0000000000..b803f6b502 --- /dev/null +++ b/docs/plans/seprag-cch-retrieval/M2-customization-loop.md @@ -0,0 +1,45 @@ +# M2 — Customization Loop & Self-Learning Payoff + +**Status:** Planned (gated on M1 = GO) · **Est:** 1–2 weeks · +**Depends on:** [M1](M1-blowup-measurement.md) · **Feeds:** [M3](M3-full-hybrid.md) +**ADRs:** [198](../../adr/ADR-198-customizable-metric-layer-self-learning.md) + +## Purpose + +Demonstrate the core self-learning thesis: a relevance-metric change costs a +**customization pass (seconds)**, not an index rebuild. Quantify the gap vs HNSW. + +## Tasks + +1. **GNN edge head** — `w(u,v) = f_θ(h_u, h_v, edge_feats)`, constrained `≥ 0` + (`ruvector-gnn`). Forward pass over `G+` edges → metric vector. +2. **Customization as solver specialization** — implement the bottom-up triangle sweep + over elimination-tree levels as a `ruvector-solver` kernel (`bmssp`/push family), + vectorized via `ruvector-solver::simd`. +3. **Re-customize-on-update** — drive `θ` updates (synthetic feedback first), re-run + customization, time it. Topology untouched. +4. **Multi-metric quiver** — maintain `w_semantic`, `w_recency`, `w_trust`, `w_task` + + on-the-fly blends `Σ λ_i w_i`; per-query lens selection (ADR-198 §3). +5. **EWC guard** — apply `ruvector-gnn::ewc` so metric drift doesn't collapse structure. + +## Metrics + +| Metric | Target | +|---|---| +| Customization time vs HNSW rebuild (same metric change) | orders-of-magnitude faster | +| Adaptation lag (queries until new feedback reflected in results) | low / bounded | +| Recall stability across re-weights | no collapse | +| Quiver memory cost `O(#metrics × \|G+\|)` | within configured lens budget | + +## Exit criteria + +- [ ] Re-customization is measurably ≪ HNSW rebuild for an equivalent metric change. +- [ ] A lens switch changes ranking correctly at near-zero marginal query cost. +- [ ] Recall does not degrade across a sequence of metric updates (EWC working). + +## Risks + +- Non-negativity / additivity constraint on the learned metric may limit expressiveness + — mitigated by the cost-mapping forms in ADR-198 §1. +- Deep narrow elimination trees parallelize customization poorly — foreshadowed by M1 + separator quality. diff --git a/docs/plans/seprag-cch-retrieval/M3-full-hybrid.md b/docs/plans/seprag-cch-retrieval/M3-full-hybrid.md new file mode 100644 index 0000000000..726e3b3098 --- /dev/null +++ b/docs/plans/seprag-cch-retrieval/M3-full-hybrid.md @@ -0,0 +1,56 @@ +# M3 — Full Hybrid Pipeline & Multi-Hop Evaluation + +**Status:** Planned (gated on M2) · **Est:** 2–3 weeks · +**Depends on:** [M2](M2-customization-loop.md) · **Feeds:** [M4](M4-integration.md) +**ADRs:** [196](../../adr/ADR-196-seprag-cch-hierarchical-retrieval.md), +[199](../../adr/ADR-199-public-corpus-benchmark-harness.md) + +## Purpose + +Assemble the end-to-end pipeline and prove the **crossover**: SepRAG wins on +constrained / multi-hop queries; parity (no regression) on pure-semantic IR. + +## Pipeline (ADR-196) + +``` +query ─► HNSW/DiskANN top-m (entry leaves) + ─► SepRAG separator-tree branch & bound (metric from M2, filters as subtree predicates) + ─► ruvector-attn-mincut rerank (cut-as-attention) + ─► top-k + elimination-tree path (provenance) +``` + +## Tasks + +1. **Entry-point bridge** — HNSW/DiskANN result → seed vertices in `G_nav`. +2. **Hierarchical filtering** — query constraints (tenant, recency, relation type, + entity reachability) as `SepNode.may_satisfy(filter)` subtree predicates; combine + with semantic LB pruning in one traversal. +3. **Rerank** — `ruvector-attn-mincut`, reusing separator cuts as attention masks. +4. **Provenance** — surface the elimination-tree path as a retrieval explanation. +5. **Upgrade corpus** — ingest Wikipedia + Wikidata + hyperlink graph (the deferred + data engineering); use precomputed embeddings (Cohere `wikipedia-22-12` / BEIR). +6. **Evaluation harness** — two query shapes: + - Pure-semantic IR: BEIR (NQ, FEVER, MS MARCO-style). Baseline: HNSW/DiskANN. + - Multi-hop: HotpotQA, 2WikiMultiHopQA, MuSiQue. Ground truth: supporting facts. + +## Metrics + +| Query shape | Metric | Expectation | +|---|---|---| +| Semantic | Recall@k / nDCG vs qrels | parity with HNSW (no regression) | +| Multi-hop | Supporting-passage coverage, topic coherence | SepRAG wins | +| Constrained | Latency + recall vs HNSW + post-filter | SepRAG wins | + +## Exit criteria + +- [ ] No regression vs HNSW on pure-semantic IR. +- [ ] Measurable win on multi-hop supporting-passage coverage and/or constrained-query + latency-at-recall. +- [ ] Provenance path is coherent (retrieved set stays on-topic / crosses gateways + deliberately). + +## Risks + +- Wikipedia/Wikidata ingestion + entity alignment is the heaviest data engineering in + the program — scope it as its own sub-task. +- Multi-hop ground-truth mapping (supporting facts → corpus passages) needs care. diff --git a/docs/plans/seprag-cch-retrieval/M4-integration.md b/docs/plans/seprag-cch-retrieval/M4-integration.md new file mode 100644 index 0000000000..ef115c5a11 --- /dev/null +++ b/docs/plans/seprag-cch-retrieval/M4-integration.md @@ -0,0 +1,34 @@ +# M4 — Integration (Postgres, Node, Snapshot) + +**Status:** Planned (gated on M3) · **Est:** 1–2 weeks · +**Depends on:** [M3](M3-full-hybrid.md) +**ADRs:** [196](../../adr/ADR-196-seprag-cch-hierarchical-retrieval.md) + +## Purpose + +Expose SepRAG through RuVector's existing surfaces and make topology/customizations +durable, hot-swappable artifacts. + +## Tasks + +1. **Postgres extension** — `ruvector-postgres` function + `seprag_knn(query vector, k int, lens text, filter jsonb) → setof (id, score, path)`. +2. **Node bindings** — `ruvector-node` API mirroring the Postgres signature. +3. **Persistence** — store `CchTopology` via `ruvector-snapshot`; treat each + `CchMetric` (lens) as a hot-swappable artifact loadable without rebuild. +4. **Incremental updates** — wire `jtree`/`linkcut` incremental insert; periodic + batch re-order job; weight-only updates trigger re-customization only. +5. **Operability** — metrics/telemetry (blowup ratio, customization time, query + latency) surfaced for monitoring. + +## Exit criteria + +- [ ] `seprag_knn()` callable end-to-end from SQL and Node, returning ranked results + with provenance. +- [ ] Topology survives restart via snapshot; lenses load without rebuild. +- [ ] Incremental insert + periodic re-order validated under a sustained write load. + +## Risks + +- Topology persistence format must version cleanly (order + shortcut set + sep tree). +- Re-order cadence vs query-time staleness is an operational tuning knob to document. diff --git a/docs/plans/seprag-cch-retrieval/README.md b/docs/plans/seprag-cch-retrieval/README.md new file mode 100644 index 0000000000..fa3047aa03 --- /dev/null +++ b/docs/plans/seprag-cch-retrieval/README.md @@ -0,0 +1,52 @@ +# SepRAG — CCH-Inspired Retrieval: Milestone Plans + +Implementation roadmap for the SepRAG retrieval layer described in +[ADR-196](../../adr/ADR-196-seprag-cch-hierarchical-retrieval.md) – +[ADR-199](../../adr/ADR-199-public-corpus-benchmark-harness.md). + +SepRAG adapts Customizable Contraction Hierarchies (CCH) — nested dissection, +balanced separators, contraction shortcuts, elimination trees, and the +separator-tree k-NN algorithm — to hybrid vector + knowledge-graph memory. It +**complements** HNSW/DiskANN (entry-point search) rather than replacing it, +targeting constrained, relational, multi-hop, and re-weightable retrieval. + +## The central bet + +CCH's speedup requires **small balanced separators** (low treewidth). Road networks +have them; dense embedding kNN graphs (expander-like) do not. The entire program +lives or dies on one measured number — the **shortcut-blowup ratio** `|G+| / |G_nav|` +on real data. The milestones are sequenced to surface that number as cheaply and as +early as possible, on a *correctness-validated* implementation (so the signal is not +confounded by bugs). + +## Milestone sequence + +| Plan | Goal | Retires which risk | Gate | +|------|------|--------------------|------| +| [M0](M0-correctness-gate.md) | Separator-tree k-NN correct on toy graphs | Implementation correctness | k-NN == brute-force oracle | +| [M1](M1-blowup-measurement.md) | Blowup ratio on ogbn-arxiv (static metric) | **Research viability (decisive)** | blowup small + separators sublinear → GO | +| [M2](M2-customization-loop.md) | GNN metric → customization; self-learning payoff | Re-weight cost vs rebuild | customize ≪ HNSW rebuild | +| [M3](M3-full-hybrid.md) | HNSW entry + filters + rerank; multi-hop QA | End-task quality / crossover | win on multi-hop, parity on semantic | +| [M4](M4-integration.md) | Postgres fn + node bindings + snapshot | Productionization | `seprag_knn()` callable end-to-end | + +## Key sequencing principle + +M0 is a **thin correctness gate (~2–3 days), not a destination**. Its only job is to +make M1's go/no-go number trustworthy: if blowup looks bad, we must know it is the +*data* (expander-like), not a bug in a fresh implementation. M1 reuses M0's exact +code, pointed at real data. Do **not** over-invest in toy benchmarks. + +## Fallback ladder (if M1 blowup is catastrophic) + +1. Hyperbolic backbone (`ruvector-hyperbolic-hnsw`) — tree-like geometry → small separators. +2. GNN-learned contraction order (ADR-196 extension E1) — learn an order minimizing fill-in. +3. Bounded-degree contraction (cap fill-in, trade exactness; recall-test). +4. Abandon SepRAG for this data class; keep flat ANN. + +## Crate reuse (this is composition, not green-field) + +`ruvector-mincut` (`jtree`, `expander`, `cluster`, `sparsify`, `linkcut`, +`algorithm`) · `ruvector-solver` (`bmssp`, `forward_push`, `backward_push`, `simd`) · +`ruvector-sparsifier` · `ruvector-diskann` · `ruvector-hyperbolic-hnsw` · +`ruvector-rabitq` · `ruvector-gnn` (`ewc`, `graphmae`, `query`) · `ruvector-attn-mincut` · +`ruvector-graph::hybrid` · `ruvector-bench`. From f1d215d26b24dd4c40d0609009a024f809abe4a6 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 02:25:07 -0400 Subject: [PATCH 02/15] =?UTF-8?q?feat(seprag):=20M0=20correctness=20gate?= =?UTF-8?q?=20=E2=80=94=20CCH=20separator-tree=20k-NN=20on=20toy=20graphs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New crate ruvector-seprag implementing the SepRAG M0 milestone (docs/plans/seprag-cch-retrieval/M0-correctness-gate.md): - graph: undirected weighted graph + Dijkstra brute-force k-NN oracle - order: nested-dissection ordering via BFS-layer separators + separator tree - contraction: metric-free symbolic contraction -> chordal upward graph + elim tree - customize: bottom-up triangle-sweep shortcut weighting (re-runnable per metric) - query: upward search, exhaustive CCH k-NN, bucket-based branch-and-bound k-NN with admissible early-stop and search-space accounting - gen: deterministic SBM/grid/path/clique generators (SplitMix64) - examples/blowup_report: M0->M1 diagnostic (blowup ratio, elim height, pruning) Tests (8 + doctest, all green): SepRAG k-NN == Dijkstra oracle on SBM/grid/ path/clique; pruned == unpruned (pruning sound); pruning reduces search space; determinism; bounded blowup. cargo clippy clean. Finding: query pruning eliminates 95-100% of scans regardless of structure; blowup/elim-height are dominated by separator quality, and the naive BFS separator degenerates on low-diameter dense graphs (SBM 18.6x) — motivating the ruvector-mincut separator swap planned for M1 (ADR-197). --- Cargo.toml | 1 + crates/ruvector-seprag/Cargo.toml | 27 +++ .../ruvector-seprag/examples/blowup_report.rs | 52 +++++ crates/ruvector-seprag/src/contraction.rs | 103 +++++++++ crates/ruvector-seprag/src/customize.rs | 51 ++++ crates/ruvector-seprag/src/gen.rs | 112 +++++++++ crates/ruvector-seprag/src/graph.rs | 126 ++++++++++ crates/ruvector-seprag/src/lib.rs | 110 +++++++++ crates/ruvector-seprag/src/order.rs | 218 ++++++++++++++++++ crates/ruvector-seprag/src/query.rs | 194 ++++++++++++++++ crates/ruvector-seprag/tests/correctness.rs | 174 ++++++++++++++ 11 files changed, 1168 insertions(+) create mode 100644 crates/ruvector-seprag/Cargo.toml create mode 100644 crates/ruvector-seprag/examples/blowup_report.rs create mode 100644 crates/ruvector-seprag/src/contraction.rs create mode 100644 crates/ruvector-seprag/src/customize.rs create mode 100644 crates/ruvector-seprag/src/gen.rs create mode 100644 crates/ruvector-seprag/src/graph.rs create mode 100644 crates/ruvector-seprag/src/lib.rs create mode 100644 crates/ruvector-seprag/src/order.rs create mode 100644 crates/ruvector-seprag/src/query.rs create mode 100644 crates/ruvector-seprag/tests/correctness.rs diff --git a/Cargo.toml b/Cargo.toml index 38128585a2..b9cd67416d 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -115,6 +115,7 @@ members = [ "crates/ruvector-solver", "crates/ruvector-solver-wasm", "crates/ruvector-solver-node", + "crates/ruvector-seprag", "examples/dna", "examples/OSpipe", "crates/ruvector-coherence", diff --git a/crates/ruvector-seprag/Cargo.toml b/crates/ruvector-seprag/Cargo.toml new file mode 100644 index 0000000000..9b871d195d --- /dev/null +++ b/crates/ruvector-seprag/Cargo.toml @@ -0,0 +1,27 @@ +[package] +name = "ruvector-seprag" +version.workspace = true +edition.workspace = true +rust-version.workspace = true +license.workspace = true +authors.workspace = true +repository.workspace = true +description = "SepRAG — CCH-inspired separator-tree retrieval for hybrid vector + graph memory (M0 correctness gate). See docs/plans/seprag-cch-retrieval/." +keywords = ["retrieval", "contraction-hierarchies", "nested-dissection", "knn", "graph"] +categories = ["algorithms", "data-structures"] + +[dependencies] +thiserror = { workspace = true } + +[dev-dependencies] +approx = "0.5" + +[lints.rust] +unexpected_cfgs = { level = "allow", priority = -1 } +dead_code = "allow" + +[lints.clippy] +all = { level = "warn", priority = -1 } +correctness = { level = "deny", priority = 0 } +suspicious = { level = "deny", priority = 0 } +needless_range_loop = "allow" diff --git a/crates/ruvector-seprag/examples/blowup_report.rs b/crates/ruvector-seprag/examples/blowup_report.rs new file mode 100644 index 0000000000..ba3a09df4d --- /dev/null +++ b/crates/ruvector-seprag/examples/blowup_report.rs @@ -0,0 +1,52 @@ +//! M0→M1 diagnostic: print the metrics that become M1's go/no-go signal +//! (ADR-199 §4) on synthetic graphs — shortcut-blowup ratio, elimination-tree +//! height, and pruned-vs-unpruned search space. +//! +//! Run: `cargo run -p ruvector-seprag --example blowup_report` + +use ruvector_seprag::query::{elim_depth, KnnIndex, QueryStats}; +use ruvector_seprag::{gen, Graph, SepRag}; + +fn report(name: &str, g: Graph) { + let n = g.n; + let m = g.edges().count(); + let pois: Vec = gen::sample_pois(n, (n / 2).max(1), 1); + let srcs = gen::sample_pois(n, 32.min(n), 2); + + let sr = SepRag::build(g); + let max_depth = (0..n as u32).map(|r| elim_depth(&sr.topo, r)).max().unwrap_or(0); + let idx = KnnIndex::build(&sr.topo, &sr.metric, &pois); + + let (mut pruned, mut unpruned, mut anc_vis, mut anc_prune) = (0usize, 0usize, 0usize, 0usize); + for &src in &srcs { + let mut sp = QueryStats::default(); + let _ = idx.knn(src, 10, true, &mut sp); + let mut su = QueryStats::default(); + let _ = idx.knn(src, 10, false, &mut su); + pruned += sp.bucket_entries_scanned; + unpruned += su.bucket_entries_scanned; + anc_vis += sp.ancestors_visited; + anc_prune += sp.ancestors_pruned; + } + let q = srcs.len().max(1); + println!( + "{name:<14} n={n:<5} m={m:<6} blowup={:>5.2}x elim_h={max_depth:<4} \ + scans/q: pruned={:<5} unpruned={:<5} ({:.0}% saved) anc_vis/q={} pruned/q={}", + sr.blowup_ratio(), + pruned / q, + unpruned / q, + 100.0 * (1.0 - pruned as f64 / unpruned.max(1) as f64), + anc_vis / q, + anc_prune / q, + ); +} + +fn main() { + println!("SepRAG M0 diagnostic — synthetic graphs (lower blowup + more pruning = more road-like)\n"); + report("grid-20x20", gen::grid(20, 20, 1)); + report("grid-40x40", gen::grid(40, 40, 1)); + report("sbm-clean", gen::sbm(8, 50, 0.25, 0.003, 1)); + report("sbm-dense", gen::sbm(8, 50, 0.25, 0.05, 1)); + report("path-1000", gen::path(1000, 1)); + println!("\nNote: synthetic only. The real go/no-go is M1 on ogbn-arxiv (ADR-199)."); +} diff --git a/crates/ruvector-seprag/src/contraction.rs b/crates/ruvector-seprag/src/contraction.rs new file mode 100644 index 0000000000..778b95ec59 --- /dev/null +++ b/crates/ruvector-seprag/src/contraction.rs @@ -0,0 +1,103 @@ +//! Symbolic contraction → chordal upward graph + elimination tree (ADR-197 §3). +//! +//! Everything here is in **rank space**: vertices are relabelled to their +//! contraction rank, so `up[r]` lists higher-ranked neighbours in ascending +//! order (cache-friendly, SIMD-amenable per ADR-197 §4). This phase is fully +//! metric-independent — the shortcut *set* depends only on order + topology. + +use crate::graph::{Graph, NodeId}; +use std::collections::BTreeSet; + +pub const NONE: u32 = u32::MAX; + +/// Metric-independent skeleton built once from `(Graph, order)`. +pub struct Topology { + pub n: usize, + /// `rank[orig] = contraction rank`. + pub rank: Vec, + /// `orig[rank] = original id` (inverse of `rank`). + pub orig: Vec, + /// Upward chordal arcs in rank space: `up[r]` = higher-ranked neighbours of + /// rank `r`, ascending. Includes original edges + fill-in shortcuts. + pub up: Vec>, + /// Initial weight of each upward arc, parallel to `up`. `+inf` for shortcuts + /// (filled in by customization); finite for original edges. + pub w0: Vec>, + /// Elimination-tree parent (rank space): lowest higher-ranked neighbour. + pub elim_parent: Vec, +} + +impl Topology { + /// Number of upward arcs (|G+| restricted to upward orientation) — the + /// numerator of the shortcut-blowup ratio in ADR-199. + #[must_use] + pub fn arc_count(&self) -> usize { + self.up.iter().map(Vec::len).sum() + } + + /// Index of arc `r -> hi` within `up[r]` (arcs are sorted, so binary search). + #[inline] + pub fn arc_pos(&self, r: u32, hi: u32) -> Option { + self.up[r as usize].binary_search(&hi).ok() + } +} + +/// Build the chordal upward graph and elimination tree from a contraction order. +#[must_use] +pub fn contract(g: &Graph, order: &[NodeId]) -> Topology { + let n = g.n; + let mut rank = vec![0u32; n]; + for (r, &v) in order.iter().enumerate() { + rank[v as usize] = r as u32; + } + let orig: Vec = order.to_vec(); + + // Original adjacency in rank space, plus original weights keyed by arc. + let mut nbrs: Vec> = vec![BTreeSet::new(); n]; + // up-weight lookup during build: orig weight of (min,max) ranks. + let mut orig_w: Vec> = vec![Default::default(); n]; + for (u, v, w) in g.edges() { + let (ru, rv) = (rank[u as usize], rank[v as usize]); + nbrs[ru as usize].insert(rv); + nbrs[rv as usize].insert(ru); + let (lo, hi) = if ru < rv { (ru, rv) } else { (rv, ru) }; + orig_w[lo as usize].insert(hi, w); + } + + let mut up: Vec> = vec![Vec::new(); n]; + let mut elim_parent = vec![NONE; n]; + + // Contract in increasing rank; eliminating r makes its higher neighbours a clique. + for r in 0..n as u32 { + let hi: Vec = nbrs[r as usize] + .iter() + .copied() + .filter(|&x| x > r) + .collect(); + elim_parent[r as usize] = hi.first().copied().unwrap_or(NONE); + // Fill-in: every pair of higher neighbours becomes adjacent. + for i in 0..hi.len() { + for j in (i + 1)..hi.len() { + let (a, b) = (hi[i], hi[j]); + nbrs[a as usize].insert(b); + nbrs[b as usize].insert(a); + } + } + up[r as usize] = hi; // already ascending (BTreeSet order preserved) + } + + // Initialise arc weights: original edges get their weight, shortcuts +inf. + let mut w0: Vec> = up + .iter() + .map(|row| vec![f64::INFINITY; row.len()]) + .collect(); + for r in 0..n { + for (i, &hi) in up[r].iter().enumerate() { + if let Some(&w) = orig_w[r].get(&hi) { + w0[r][i] = w; + } + } + } + + Topology { n, rank, orig, up, w0, elim_parent } +} diff --git a/crates/ruvector-seprag/src/customize.rs b/crates/ruvector-seprag/src/customize.rs new file mode 100644 index 0000000000..cb6a8b35ff --- /dev/null +++ b/crates/ruvector-seprag/src/customize.rs @@ -0,0 +1,51 @@ +//! Metric-dependent customization (ADR-198, Phase 2). +//! +//! Computes the weight of every upward arc (original + shortcut) by a bottom-up +//! triangle sweep over the elimination order. Re-run whenever the metric changes +//! — topology is untouched. For each lower vertex `r` and each pair of its upward +//! neighbours `(a, b)`, relax `w(a,b) = min(w(a,b), w(r,a) + w(r,b))`. +//! +//! Processing in increasing rank is correct: arc `(r, a)` is finalised by the +//! time we process `r`, because any improvement to it comes from a triangle with +//! a strictly lower-ranked apex, already processed. + +use crate::contraction::Topology; + +/// Per-arc weights, parallel to `Topology::up`. One `Metric` per relevance lens. +#[derive(Clone, Debug)] +pub struct Metric { + pub w: Vec>, +} + +/// Run customization for the metric whose original-edge weights are already in +/// `topo.w0`. (At M2 the initial weights come from a GNN edge head; for M0 they +/// are the graph's own edge weights.) +#[must_use] +pub fn customize(topo: &Topology) -> Metric { + let mut w = topo.w0.clone(); + for r in 0..topo.n as u32 { + let hi = &topo.up[r as usize]; + for i in 0..hi.len() { + let wri = w[r as usize][i]; + if !wri.is_finite() { + continue; + } + for j in (i + 1)..hi.len() { + let wrj = w[r as usize][j]; + if !wrj.is_finite() { + continue; + } + let (a, b) = (hi[i], hi[j]); // a < b, both > r + let cand = wri + wrj; + // Relax arc (a -> b). + if let Some(idx) = topo.arc_pos(a, b) { + let slot = &mut w[a as usize][idx]; + if cand < *slot { + *slot = cand; + } + } + } + } + } + Metric { w } +} diff --git a/crates/ruvector-seprag/src/gen.rs b/crates/ruvector-seprag/src/gen.rs new file mode 100644 index 0000000000..5abc97bf3c --- /dev/null +++ b/crates/ruvector-seprag/src/gen.rs @@ -0,0 +1,112 @@ +//! Deterministic synthetic graph generators for the M0 correctness gate. +//! +//! All randomness uses a seeded SplitMix64 so runs are reproducible (an M0 exit +//! criterion). Weights are drawn from a wide continuous range to make shortest +//! paths effectively tie-free, so top-k membership is unambiguous in tests. + +use crate::graph::{Graph, NodeId}; + +/// Minimal deterministic PRNG (SplitMix64). Zero external deps. +pub struct Rng(u64); + +impl Rng { + #[must_use] + pub fn new(seed: u64) -> Self { + Rng(seed) + } + fn next_u64(&mut self) -> u64 { + self.0 = self.0.wrapping_add(0x9E37_79B9_7F4A_7C15); + let mut z = self.0; + z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); + z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); + z ^ (z >> 31) + } + /// Uniform f64 in `[0, 1)`. + fn next_f64(&mut self) -> f64 { + (self.next_u64() >> 11) as f64 / (1u64 << 53) as f64 + } + /// Edge weight in `[0.5, 1.5)` — positive, wide, tie-free in practice. + fn weight(&mut self) -> f64 { + 0.5 + self.next_f64() + } +} + +/// Stochastic Block Model: `blocks` communities of `per_block` vertices each, +/// dense intra-block (`p_in`), sparse inter-block (`p_out`). Clean separators by +/// construction — the SepRAG happy path. +#[must_use] +pub fn sbm(blocks: usize, per_block: usize, p_in: f64, p_out: f64, seed: u64) -> Graph { + let n = blocks * per_block; + let mut g = Graph::new(n); + let mut rng = Rng::new(seed); + let block_of = |v: usize| v / per_block; + for u in 0..n { + for v in (u + 1)..n { + let p = if block_of(u) == block_of(v) { p_in } else { p_out }; + if rng.next_f64() < p { + g.add_edge(u as NodeId, v as NodeId, rng.weight()); + } + } + } + g +} + +/// `w x h` 4-neighbour grid with random edge weights. Known ~min(w,h) separators. +#[must_use] +pub fn grid(w: usize, h: usize, seed: u64) -> Graph { + let mut g = Graph::new(w * h); + let mut rng = Rng::new(seed); + let id = |x: usize, y: usize| (y * w + x) as NodeId; + for y in 0..h { + for x in 0..w { + if x + 1 < w { + g.add_edge(id(x, y), id(x + 1, y), rng.weight()); + } + if y + 1 < h { + g.add_edge(id(x, y), id(x, y + 1), rng.weight()); + } + } + } + g +} + +/// A path graph (degenerate: separators are single vertices, deep elim tree). +#[must_use] +pub fn path(n: usize, seed: u64) -> Graph { + let mut g = Graph::new(n); + let mut rng = Rng::new(seed); + for i in 0..n.saturating_sub(1) { + g.add_edge(i as NodeId, (i + 1) as NodeId, rng.weight()); + } + g +} + +/// A clique (degenerate worst case: full fill-in, no layer separator). +#[must_use] +pub fn clique(n: usize, seed: u64) -> Graph { + let mut g = Graph::new(n); + let mut rng = Rng::new(seed); + for u in 0..n { + for v in (u + 1)..n { + g.add_edge(u as NodeId, v as NodeId, rng.weight()); + } + } + g +} + +/// Deterministically sample `count` distinct POIs from `0..n`. +#[must_use] +pub fn sample_pois(n: usize, count: usize, seed: u64) -> Vec { + let mut rng = Rng::new(seed ^ 0x504F_4953_504F_4953); // "POISPOIS" + let mut chosen = vec![false; n]; + let mut out = Vec::new(); + let target = count.min(n); + while out.len() < target { + let v = (rng.next_u64() as usize) % n; + if !chosen[v] { + chosen[v] = true; + out.push(v as NodeId); + } + } + out +} diff --git a/crates/ruvector-seprag/src/graph.rs b/crates/ruvector-seprag/src/graph.rs new file mode 100644 index 0000000000..66938d0d16 --- /dev/null +++ b/crates/ruvector-seprag/src/graph.rs @@ -0,0 +1,126 @@ +//! Undirected, positively-weighted graph + the brute-force k-NN oracle. +//! +//! The oracle (Dijkstra over the raw graph) is the ground truth that every +//! SepRAG result is validated against in the M0 correctness gate. + +use std::cmp::Ordering; +use std::collections::BinaryHeap; + +pub type NodeId = u32; + +/// Simple adjacency-list graph. Edges are undirected; duplicate edges keep the +/// minimum weight; self-loops are ignored. Weights must be strictly positive +/// (a precondition for additive shortest-path semantics — see ADR-198). +#[derive(Clone, Debug, Default)] +pub struct Graph { + pub n: usize, + pub adj: Vec>, +} + +impl Graph { + #[must_use] + pub fn new(n: usize) -> Self { + Graph { n, adj: vec![Vec::new(); n] } + } + + /// Insert/relax an undirected edge. O(deg) due to the dedup scan — fine at + /// M0 scale; replaced by CSR ingestion at M1. + pub fn add_edge(&mut self, u: NodeId, v: NodeId, w: f64) { + if u == v { + return; + } + debug_assert!(w > 0.0, "edge weights must be strictly positive"); + Self::relax_dir(&mut self.adj[u as usize], v, w); + Self::relax_dir(&mut self.adj[v as usize], u, w); + } + + fn relax_dir(row: &mut Vec<(NodeId, f64)>, to: NodeId, w: f64) { + if let Some(e) = row.iter_mut().find(|(x, _)| *x == to) { + if w < e.1 { + e.1 = w; + } + } else { + row.push((to, w)); + } + } + + /// Iterate canonical undirected edges `(u, v, w)` with `u < v`. + pub fn edges(&self) -> impl Iterator + '_ { + self.adj.iter().enumerate().flat_map(|(u, row)| { + let u = u as NodeId; + row.iter() + .filter(move |(v, _)| *v > u) + .map(move |&(v, w)| (u, v, w)) + }) + } + + /// Single-source shortest paths from `src` (Dijkstra). `dist[v] = +inf` for + /// unreachable vertices. + #[must_use] + pub fn dijkstra(&self, src: NodeId) -> Vec { + let mut dist = vec![f64::INFINITY; self.n]; + dist[src as usize] = 0.0; + let mut heap = BinaryHeap::new(); + heap.push(HeapItem { dist: 0.0, node: src }); + while let Some(HeapItem { dist: d, node }) = heap.pop() { + if d > dist[node as usize] { + continue; + } + for &(v, w) in &self.adj[node as usize] { + let nd = d + w; + if nd < dist[v as usize] { + dist[v as usize] = nd; + heap.push(HeapItem { dist: nd, node: v }); + } + } + } + dist + } + + /// Brute-force k nearest POIs from `src` by graph distance. Deterministic + /// tie-break: ascending `(distance, node id)`. This is the oracle. + #[must_use] + pub fn knn_oracle(&self, src: NodeId, pois: &[NodeId], k: usize) -> Vec<(NodeId, f64)> { + let dist = self.dijkstra(src); + let mut cand: Vec<(NodeId, f64)> = pois + .iter() + .map(|&p| (p, dist[p as usize])) + .filter(|(_, d)| d.is_finite()) + .collect(); + cand.sort_by(|a, b| cmp_dist_id(*a, *b)); + cand.truncate(k); + cand + } +} + +/// Canonical ordering for `(node, dist)` results: distance asc, then id asc. +pub fn cmp_dist_id(a: (NodeId, f64), b: (NodeId, f64)) -> Ordering { + a.1.partial_cmp(&b.1) + .unwrap_or(Ordering::Equal) + .then(a.0.cmp(&b.0)) +} + +struct HeapItem { + dist: f64, + node: NodeId, +} +impl PartialEq for HeapItem { + fn eq(&self, o: &Self) -> bool { + self.dist == o.dist + } +} +impl Eq for HeapItem {} +impl PartialOrd for HeapItem { + fn partial_cmp(&self, o: &Self) -> Option { + Some(self.cmp(o)) + } +} +impl Ord for HeapItem { + // Reversed: BinaryHeap is a max-heap, we want min-distance first. + fn cmp(&self, o: &Self) -> Ordering { + o.dist + .partial_cmp(&self.dist) + .unwrap_or(Ordering::Equal) + .then(o.node.cmp(&self.node)) + } +} diff --git a/crates/ruvector-seprag/src/lib.rs b/crates/ruvector-seprag/src/lib.rs new file mode 100644 index 0000000000..ea61eb12e2 --- /dev/null +++ b/crates/ruvector-seprag/src/lib.rs @@ -0,0 +1,110 @@ +//! # ruvector-seprag +//! +//! **SepRAG** — CCH-inspired separator-tree retrieval for hybrid vector + graph +//! memory. This crate is the **M0 correctness gate** described in +//! `docs/plans/seprag-cch-retrieval/M0-correctness-gate.md` and the ADRs +//! 196–199 (`docs/adr/`). +//! +//! It adapts Customizable Contraction Hierarchies — nested dissection, balanced +//! separators, contraction shortcuts, elimination trees, and separator-tree +//! k-NN — to graph-distance retrieval. M0 validates **correctness** on synthetic +//! graphs against a brute-force Dijkstra oracle; it deliberately uses a simple +//! self-contained separator finder. Separator *quality* and real-data scale are +//! M1's concern (where `ruvector-mincut` machinery is swapped in). +//! +//! ## Pipeline +//! +//! ```text +//! Graph ─► nested_dissection ─► contract ─► customize ─► KnnIndex::knn +//! (order, sep tree) (G+, elim) (weights) (pruned k-NN) +//! ``` +//! +//! ## Example +//! +//! ``` +//! use ruvector_seprag::{gen, SepRag}; +//! +//! let g = gen::sbm(4, 25, 0.30, 0.01, 42); // 4 communities × 25 vertices +//! let pois: Vec = (0..100).step_by(3).collect(); +//! let sr = SepRag::build(g); +//! let idx = sr.index(&pois); +//! let topk = idx.query(7, 5); // 5 nearest POIs to vertex 7 +//! assert!(topk.len() <= 5); +//! ``` + +pub mod contraction; +pub mod customize; +pub mod gen; +pub mod graph; +pub mod order; +pub mod query; + +pub use contraction::Topology; +pub use customize::Metric; +pub use graph::{Graph, NodeId}; +pub use order::{SepNode, SepTree}; +pub use query::{KnnIndex, QueryStats}; + +/// A built SepRAG hierarchy: metric-independent topology + one customized metric. +pub struct SepRag { + pub graph: Graph, + pub topo: Topology, + pub metric: Metric, + pub sep_tree: SepTree, +} + +impl SepRag { + /// Build the full hierarchy from a graph (order → contract → customize) using + /// the graph's own edge weights as the metric. + #[must_use] + pub fn build(graph: Graph) -> Self { + let ord = order::nested_dissection(&graph); + let topo = contraction::contract(&graph, &ord.order); + let metric = customize::customize(&topo); + SepRag { graph, topo, metric, sep_tree: ord.sep_tree } + } + + /// Build a bucket index for a fixed POI set. + #[must_use] + pub fn index<'a>(&'a self, pois: &[NodeId]) -> Index<'a> { + Index { + inner: KnnIndex::build(&self.topo, &self.metric, pois), + } + } + + /// Shortcut-blowup ratio `|G+| / |G_nav|` — the ADR-199 go/no-go metric. + #[must_use] + pub fn blowup_ratio(&self) -> f64 { + let base = self.graph.edges().count().max(1); + self.topo.arc_count() as f64 / base as f64 + } +} + +/// Query handle over a fixed POI set. +pub struct Index<'a> { + inner: KnnIndex<'a>, +} + +impl Index<'_> { + /// k nearest POIs to `src` by graph distance (pruned branch-and-bound). + #[must_use] + pub fn query(&self, src: NodeId, k: usize) -> Vec<(NodeId, f64)> { + let mut stats = QueryStats::default(); + self.inner.knn(src, k, true, &mut stats) + } + + /// Same, returning search-space diagnostics alongside the result. + #[must_use] + pub fn query_with_stats(&self, src: NodeId, k: usize) -> (Vec<(NodeId, f64)>, QueryStats) { + let mut stats = QueryStats::default(); + let r = self.inner.knn(src, k, true, &mut stats); + (r, stats) + } + + #[doc(hidden)] // exposed for the no-prune correctness oracle in tests + #[must_use] + pub fn query_unpruned(&self, src: NodeId, k: usize) -> Vec<(NodeId, f64)> { + let mut stats = QueryStats::default(); + self.inner.knn(src, k, false, &mut stats) + } +} diff --git a/crates/ruvector-seprag/src/order.rs b/crates/ruvector-seprag/src/order.rs new file mode 100644 index 0000000000..2b611aa979 --- /dev/null +++ b/crates/ruvector-seprag/src/order.rs @@ -0,0 +1,218 @@ +//! Metric-independent nested-dissection ordering (ADR-197, Phase 1). +//! +//! M0 uses a self-contained BFS-layer separator finder: from a pseudo-peripheral +//! start, the BFS frontier at a balanced layer is a valid vertex separator +//! (removing a whole layer disconnects earlier layers from later ones). This is +//! intentionally simple — CCH *correctness* is independent of separator quality; +//! only search-space size depends on it. At M1 scale this finder is swapped for +//! `ruvector-mincut`'s expander/cluster balanced cuts. + +use crate::graph::{Graph, NodeId}; +use std::collections::VecDeque; + +/// Cells smaller than this become leaves (no further dissection). +pub const LEAF: usize = 8; + +/// A node of the separator decomposition tree. Every original vertex belongs to +/// exactly one node (either as a separator member or a leaf member), which is +/// what lets POIs be bucketed unambiguously during query (see `query.rs`). +#[derive(Clone, Debug)] +pub struct SepNode { + /// Separator vertices (original ids) "owned" by this node. + pub separator: Vec, + /// Child node indices in `SepTree::nodes`. + pub children: Vec, + /// All vertices in this node's subtree (the cell), separator included. + pub cell: Vec, +} + +#[derive(Clone, Debug)] +pub struct SepTree { + pub nodes: Vec, + pub root: usize, +} + +/// Result of ordering: `order[rank] = original id` (rank 0 = contracted first = +/// lowest importance) and the separator decomposition tree. +pub struct Ordering { + pub order: Vec, + pub sep_tree: SepTree, +} + +/// Compute a nested-dissection order over all `n` vertices of `g`. +#[must_use] +pub fn nested_dissection(g: &Graph) -> Ordering { + let mut builder = NdBuilder { + g, + order: Vec::with_capacity(g.n), + nodes: Vec::new(), + }; + let all: Vec = (0..g.n as NodeId).collect(); + let root = builder.dissect(all); + Ordering { + order: builder.order, + sep_tree: SepTree { nodes: builder.nodes, root }, + } +} + +struct NdBuilder<'a> { + g: &'a Graph, + order: Vec, + nodes: Vec, +} + +impl NdBuilder<'_> { + /// Dissect `verts`; append ranks to `order` (children before separators, so + /// separators rank higher); return the new node index. + fn dissect(&mut self, verts: Vec) -> usize { + // Disconnected cell → recurse per component under an empty-separator node. + let comps = connected_components(self.g, &verts); + if comps.len() > 1 { + let children: Vec = comps.into_iter().map(|c| self.dissect(c)).collect(); + return self.push_node(Vec::new(), children, verts); + } + + if verts.len() <= LEAF { + return self.leaf(verts); + } + + match bfs_separator(self.g, &verts) { + Some((sep, a, b)) => { + let ca = self.dissect(a); + let cb = self.dissect(b); + // Separators appended AFTER both subtrees → higher rank. + let mut sorted_sep = sep.clone(); + sorted_sep.sort_unstable(); + self.order.extend_from_slice(&sorted_sep); + self.push_node(sorted_sep, vec![ca, cb], verts) + } + // No usable balanced separator (e.g. clique-like) → treat as leaf. + None => self.leaf(verts), + } + } + + fn leaf(&mut self, verts: Vec) -> usize { + let mut sorted = verts.clone(); + sorted.sort_unstable(); + self.order.extend_from_slice(&sorted); + self.push_node(sorted, Vec::new(), verts) + } + + fn push_node(&mut self, separator: Vec, children: Vec, cell: Vec) -> usize { + let id = self.nodes.len(); + self.nodes.push(SepNode { separator, children, cell }); + id + } +} + +/// Connected components of the subgraph induced by `verts`. +fn connected_components(g: &Graph, verts: &[NodeId]) -> Vec> { + let in_set = membership(g.n, verts); + let mut seen = vec![false; g.n]; + let mut comps = Vec::new(); + for &s in verts { + if seen[s as usize] { + continue; + } + let mut comp = Vec::new(); + let mut q = VecDeque::from([s]); + seen[s as usize] = true; + while let Some(u) = q.pop_front() { + comp.push(u); + for &(v, _) in &g.adj[u as usize] { + if in_set[v as usize] && !seen[v as usize] { + seen[v as usize] = true; + q.push_back(v); + } + } + } + comps.push(comp); + } + comps +} + +/// Find a balanced vertex separator of the connected cell `verts` via BFS layers. +/// Returns `(separator, side_a, side_b)` with both sides non-empty, or `None`. +fn bfs_separator(g: &Graph, verts: &[NodeId]) -> Option<(Vec, Vec, Vec)> { + let in_set = membership(g.n, verts); + // Pseudo-peripheral start: farthest vertex from an arbitrary one. + let start = farthest(g, &in_set, verts[0]); + let (layer, max_layer) = bfs_layers(g, &in_set, start); + if max_layer == 0 { + return None; // single layer (e.g. clique) — no layer separator exists + } + + // Pick the split layer L (1..max_layer) whose "before" side is closest to half. + let half = verts.len() / 2; + let mut counts = vec![0usize; max_layer + 1]; + for &v in verts { + counts[layer[v as usize] as usize] += 1; + } + let mut prefix = 0usize; + let mut best_l = 1usize; + let mut best_bal = usize::MAX; + for l in 1..max_layer { + prefix += counts[l - 1]; // vertices in layers < l + let bal = prefix.abs_diff(half); + if bal < best_bal { + best_bal = bal; + best_l = l; + } + } + let l = best_l as u32; + + let mut sep = Vec::new(); + let mut a = Vec::new(); + let mut b = Vec::new(); + for &v in verts { + match layer[v as usize].cmp(&l) { + std::cmp::Ordering::Less => a.push(v), + std::cmp::Ordering::Equal => sep.push(v), + std::cmp::Ordering::Greater => b.push(v), + } + } + if a.is_empty() || b.is_empty() || sep.is_empty() { + return None; + } + Some((sep, a, b)) +} + +/// BFS hop-distance layers within the induced subgraph. Returns `(layer, max)`. +fn bfs_layers(g: &Graph, in_set: &[bool], start: NodeId) -> (Vec, usize) { + let mut layer = vec![u32::MAX; g.n]; + layer[start as usize] = 0; + let mut q = VecDeque::from([start]); + let mut max_layer = 0u32; + while let Some(u) = q.pop_front() { + let lu = layer[u as usize]; + for &(v, _) in &g.adj[u as usize] { + if in_set[v as usize] && layer[v as usize] == u32::MAX { + layer[v as usize] = lu + 1; + max_layer = max_layer.max(lu + 1); + q.push_back(v); + } + } + } + (layer, max_layer as usize) +} + +fn farthest(g: &Graph, in_set: &[bool], from: NodeId) -> NodeId { + let (layer, _) = bfs_layers(g, in_set, from); + let mut best = from; + let mut best_d = 0u32; + for (v, &d) in layer.iter().enumerate() { + if d != u32::MAX && d > best_d { + best_d = d; + best = v as NodeId; + } + } + best +} + +fn membership(n: usize, verts: &[NodeId]) -> Vec { + let mut m = vec![false; n]; + for &v in verts { + m[v as usize] = true; + } + m +} diff --git a/crates/ruvector-seprag/src/query.rs b/crates/ruvector-seprag/src/query.rs new file mode 100644 index 0000000000..01a24e6a07 --- /dev/null +++ b/crates/ruvector-seprag/src/query.rs @@ -0,0 +1,194 @@ +//! Query layer (ADR-196, Phase 3). +//! +//! Three k-NN paths, in increasing sophistication, all over the same topology: +//! +//! 1. [`upward`] — exact distances from a vertex to all its elimination-tree +//! ancestors (the CCH "search space"). +//! 2. [`knn_exhaustive`] — pairwise up-search meet for every POI. Exact by the +//! CH up-down theorem; used to validate customization against the Dijkstra +//! oracle. +//! 3. [`KnnIndex::knn`] — bucket-based branch-and-bound with admissible +//! early-stop. Must match (2) exactly while touching far fewer buckets. The +//! elimination-tree ancestors *are* the separator hierarchy, so stopping once +//! `d(s -> x) >= delta_k` prunes whole separator regions. + +use crate::contraction::{Topology, NONE}; +use crate::graph::{cmp_dist_id, NodeId}; +use std::collections::HashMap; + +/// Exact distances from rank `s` to every upward-reachable vertex (its ancestors +/// in the elimination tree), keyed by rank. +#[must_use] +pub fn upward(topo: &Topology, metric: &crate::customize::Metric, s: u32) -> HashMap { + // Collect the upward closure, then relax in ascending rank (a DAG order). + let mut reach: Vec = Vec::new(); + let mut seen = vec![false; topo.n]; + let mut stack = vec![s]; + seen[s as usize] = true; + while let Some(u) = stack.pop() { + reach.push(u); + for &x in &topo.up[u as usize] { + if !seen[x as usize] { + seen[x as usize] = true; + stack.push(x); + } + } + } + reach.sort_unstable(); + + let mut dist: HashMap = HashMap::new(); + dist.insert(s, 0.0); + for &u in &reach { + let du = match dist.get(&u) { + Some(&d) => d, + None => continue, + }; + for (i, &x) in topo.up[u as usize].iter().enumerate() { + let w = metric.w[u as usize][i]; + if !w.is_finite() { + continue; + } + let nd = du + w; + let e = dist.entry(x).or_insert(f64::INFINITY); + if nd < *e { + *e = nd; + } + } + } + dist +} + +/// Exhaustive CCH k-NN: combine the query's up-search with each POI's up-search. +/// `d(s,p) = min over common ancestors m of d(s,m) + d(p,m)`. Exact. +#[must_use] +pub fn knn_exhaustive( + topo: &Topology, + metric: &crate::customize::Metric, + src: NodeId, + pois: &[NodeId], + k: usize, +) -> Vec<(NodeId, f64)> { + let ds = upward(topo, metric, topo.rank[src as usize]); + let mut out: Vec<(NodeId, f64)> = Vec::new(); + for &p in pois { + let dp = upward(topo, metric, topo.rank[p as usize]); + // Iterate the smaller map for the intersection. + let (small, big) = if ds.len() <= dp.len() { (&ds, &dp) } else { (&dp, &ds) }; + let mut best = f64::INFINITY; + for (m, &dm) in small { + if let Some(&dother) = big.get(m) { + best = best.min(dm + dother); + } + } + if best.is_finite() { + out.push((p, best)); + } + } + out.sort_by(|a, b| cmp_dist_id(*a, *b)); + out.truncate(k); + out +} + +/// Pre-built bucket index over a fixed POI set for fast repeated queries. +pub struct KnnIndex<'a> { + topo: &'a Topology, + metric: &'a crate::customize::Metric, + /// `bucket[rank]` = POIs `p` whose ancestor set includes `rank`, with the + /// exact distance `d(p, rank)`. Sorted ascending by distance for early-out. + bucket: Vec>, +} + +/// Diagnostics for one query — the M0 search-space-reduction evidence. +#[derive(Clone, Copy, Debug, Default)] +pub struct QueryStats { + /// Distinct ancestor vertices of the query that were examined. + pub ancestors_visited: usize, + /// Bucket entries (POI, dist) actually inspected. + pub bucket_entries_scanned: usize, + /// Ancestor vertices skipped by the admissible early-stop (region pruning). + pub ancestors_pruned: usize, +} + +impl<'a> KnnIndex<'a> { + #[must_use] + pub fn build(topo: &'a Topology, metric: &'a crate::customize::Metric, pois: &[NodeId]) -> Self { + let mut bucket: Vec> = vec![Vec::new(); topo.n]; + for &p in pois { + for (anc, dp) in upward(topo, metric, topo.rank[p as usize]) { + bucket[anc as usize].push((p, dp)); + } + } + for row in &mut bucket { + row.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(std::cmp::Ordering::Equal)); + } + KnnIndex { topo, metric, bucket } + } + + /// k-NN with branch-and-bound. `prune = false` disables the early-stop (the + /// "no-prune oracle mode" of M0): results must be identical, proving the + /// pruning never drops a true top-k. + pub fn knn(&self, src: NodeId, k: usize, prune: bool, stats: &mut QueryStats) -> Vec<(NodeId, f64)> { + let ds = upward(self.topo, self.metric, self.topo.rank[src as usize]); + // Ancestors ordered by ascending d(s -> x): the key to admissible pruning. + let mut ancs: Vec<(u32, f64)> = ds.iter().map(|(&x, &d)| (x, d)).collect(); + ancs.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(std::cmp::Ordering::Equal)); + + let mut best: HashMap = HashMap::new(); + for (x, dsx) in ancs { + let delta_k = kth_smallest(best.values().copied(), k); + if prune && dsx >= delta_k { + // d(s,p) >= d(s,x) for the minimising x; nothing further can enter top-k. + stats.ancestors_pruned += 1; + continue; // (could break; continue keeps the count honest) + } + stats.ancestors_visited += 1; + let row = &self.bucket[x as usize]; + for &(p, dp) in row { + // Per-bucket early-out: rows are sorted by dp ascending. + if prune && dsx + dp >= delta_k && best.len() >= k { + break; + } + stats.bucket_entries_scanned += 1; + let cand = dsx + dp; + let e = best.entry(p).or_insert(f64::INFINITY); + if cand < *e { + *e = cand; + } + } + } + + let mut out: Vec<(NodeId, f64)> = best.into_iter().filter(|(_, d)| d.is_finite()).collect(); + out.sort_by(|a, b| cmp_dist_id(*a, *b)); + out.truncate(k); + out + } +} + +/// k-th smallest value of an iterator, or `+inf` if fewer than `k` present. +fn kth_smallest(vals: impl Iterator, k: usize) -> f64 { + if k == 0 { + return 0.0; + } + let mut v: Vec = vals.collect(); + if v.len() < k { + return f64::INFINITY; + } + v.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)); + v[k - 1] +} + +/// Convenience: elimination-tree depth (root-path length) of a rank — the query +/// search-space size bound. Useful for separator-quality diagnostics (ADR-199). +#[must_use] +pub fn elim_depth(topo: &Topology, mut r: u32) -> usize { + let mut d = 0; + while r != NONE { + let p = topo.elim_parent[r as usize]; + if p == NONE { + break; + } + r = p; + d += 1; + } + d +} diff --git a/crates/ruvector-seprag/tests/correctness.rs b/crates/ruvector-seprag/tests/correctness.rs new file mode 100644 index 0000000000..c6da15c659 --- /dev/null +++ b/crates/ruvector-seprag/tests/correctness.rs @@ -0,0 +1,174 @@ +//! M0 correctness gate — exit criteria from +//! `docs/plans/seprag-cch-retrieval/M0-correctness-gate.md`. +//! +//! 1. SepRAG k-NN == brute-force Dijkstra oracle (the gate). +//! 2. Pruned == unpruned (pruning never drops a true top-k). +//! 3. Pruning reduces search space (region pruning fires). +//! 4. Determinism. +//! 5. Blowup ratio is bounded on road-like synthetic graphs. + +use ruvector_seprag::graph::{cmp_dist_id, Graph, NodeId}; +use ruvector_seprag::query::{knn_exhaustive, KnnIndex, QueryStats}; +use ruvector_seprag::{contraction, customize, gen, order, SepRag}; + +const TOL: f64 = 1e-9; + +/// Assert two result lists are equal: same nodes in order, distances within TOL. +fn assert_results_eq(got: &[(NodeId, f64)], want: &[(NodeId, f64)], ctx: &str) { + assert_eq!(got.len(), want.len(), "{ctx}: length mismatch\n got={got:?}\nwant={want:?}"); + for (i, (g, w)) in got.iter().zip(want.iter()).enumerate() { + assert!( + (g.1 - w.1).abs() < TOL, + "{ctx}: distance mismatch at {i}: got {g:?} want {w:?}" + ); + assert_eq!(g.0, w.0, "{ctx}: node mismatch at {i}: got {g:?} want {w:?}"); + } +} + +fn check_graph_against_oracle(g: Graph, pois: &[NodeId], srcs: &[NodeId], label: &str) { + let ord = order::nested_dissection(&g); + // Sanity: ordering is a permutation of 0..n. + { + let mut seen = vec![false; g.n]; + assert_eq!(ord.order.len(), g.n, "{label}: order length"); + for &v in &ord.order { + assert!(!seen[v as usize], "{label}: duplicate in order"); + seen[v as usize] = true; + } + } + let topo = contraction::contract(&g, &ord.order); + let metric = customize::customize(&topo); + let idx = KnnIndex::build(&topo, &metric, pois); + + for &src in srcs { + for &k in &[1usize, 5, 10, 50] { + let oracle = g.knn_oracle(src, pois, k); + let exhaustive = knn_exhaustive(&topo, &metric, src, pois, k); + let mut s = QueryStats::default(); + let pruned = idx.knn(src, k, true, &mut s); + let unpruned = idx.knn(src, k, false, &mut QueryStats::default()); + + let ctx = format!("{label} src={src} k={k}"); + // Exhaustive CCH validates order+contraction+customization vs ground truth. + assert_results_eq(&exhaustive, &oracle, &format!("{ctx} [exhaustive vs oracle]")); + // Bucket index (unpruned) must equal exhaustive. + assert_results_eq(&unpruned, &exhaustive, &format!("{ctx} [bucket vs exhaustive]")); + // Pruning must not change the answer. + assert_results_eq(&pruned, &unpruned, &format!("{ctx} [pruned vs unpruned]")); + } + } +} + +#[test] +fn sbm_matches_oracle() { + for seed in [1u64, 7, 42, 1000] { + let g = gen::sbm(4, 25, 0.30, 0.01, seed); + let pois = gen::sample_pois(g.n, 40, seed); + let srcs = gen::sample_pois(g.n, 6, seed ^ 0xABCD); + check_graph_against_oracle(g, &pois, &srcs, &format!("sbm[seed={seed}]")); + } +} + +#[test] +fn grid_matches_oracle() { + for seed in [3u64, 99] { + let g = gen::grid(16, 16, seed); // 256 vertices, ~16-wide separators + let pois = gen::sample_pois(g.n, 50, seed); + let srcs = gen::sample_pois(g.n, 6, seed ^ 0x55); + check_graph_against_oracle(g, &pois, &srcs, &format!("grid[seed={seed}]")); + } +} + +#[test] +fn path_matches_oracle() { + // Degenerate: size-1 separators, deep elimination tree. + let g = gen::path(120, 5); + let pois = gen::sample_pois(g.n, 30, 5); + let srcs = vec![0, 17, 60, 119]; + check_graph_against_oracle(g, &pois, &srcs, "path"); +} + +#[test] +fn clique_matches_oracle() { + // Degenerate worst case: full fill-in, no layer separator (leaf fallback). + let g = gen::clique(24, 11); + let pois = gen::sample_pois(g.n, 20, 11); + let srcs = vec![0, 5, 23]; + check_graph_against_oracle(g, &pois, &srcs, "clique"); +} + +#[test] +fn pruning_reduces_search_space() { + // On a clean SBM, pruning should fire and scan fewer bucket entries. + let g = gen::sbm(6, 40, 0.25, 0.004, 77); // 240 vertices, well-separated + let pois = gen::sample_pois(g.n, 120, 77); + let sr = SepRag::build(g); + let idx = sr.index(&pois); + + let mut total_pruned = 0usize; + let mut pruned_scans = 0usize; + let mut unpruned_scans = 0usize; + let srcs = gen::sample_pois(sr.graph.n, 20, 0x1234); + for &src in &srcs { + let (_r, sp) = idx.query_with_stats(src, 10); + total_pruned += sp.ancestors_pruned; + pruned_scans += sp.bucket_entries_scanned; + unpruned_scans += unpruned_entry_count(&sr, &pois, src, 10); + } + assert!(total_pruned > 0, "expected region pruning to fire on a clean SBM"); + assert!( + pruned_scans < unpruned_scans, + "pruned scans ({pruned_scans}) should be < unpruned ({unpruned_scans})" + ); +} + +/// Helper: bucket entries scanned with pruning disabled. +fn unpruned_entry_count(sr: &SepRag, pois: &[NodeId], src: NodeId, k: usize) -> usize { + let idx = KnnIndex::build(&sr.topo, &sr.metric, pois); + let mut s = QueryStats::default(); + let _ = idx.knn(src, k, false, &mut s); + s.bucket_entries_scanned +} + +#[test] +fn deterministic_across_runs() { + let build = || { + let g = gen::sbm(4, 30, 0.28, 0.01, 2024); + let pois = gen::sample_pois(g.n, 50, 2024); + let sr = SepRag::build(g); + let idx = sr.index(&pois); + let mut all = Vec::new(); + for src in [0u32, 11, 55, 119] { + all.push(idx.query(src, 7)); + } + all + }; + let a = build(); + let b = build(); + for (qa, qb) in a.iter().zip(b.iter()) { + assert_results_eq(qa, qb, "determinism"); + } +} + +#[test] +fn blowup_ratio_is_bounded() { + // Road-like synthetic graphs should not explode under contraction. + let grid = SepRag::build(gen::grid(20, 20, 1)); + let sbm = SepRag::build(gen::sbm(5, 40, 0.22, 0.005, 1)); + let n_grid = grid.graph.n as f64; + let n_sbm = sbm.graph.n as f64; + // Sanity bound: |G+| should stay well below a complete graph. + assert!(grid.blowup_ratio() < n_grid, "grid blowup unbounded: {}", grid.blowup_ratio()); + assert!(sbm.blowup_ratio() < n_sbm, "sbm blowup unbounded: {}", sbm.blowup_ratio()); +} + +#[test] +fn results_are_canonically_sorted() { + let sr = SepRag::build(gen::grid(12, 12, 9)); + let pois = gen::sample_pois(sr.graph.n, 40, 9); + let idx = sr.index(&pois); + let r = idx.query(0, 10); + for w in r.windows(2) { + assert!(cmp_dist_id(w[0], w[1]) != std::cmp::Ordering::Greater, "not sorted: {r:?}"); + } +} From 80c1aef92c4ea481c7f18018b3b901bc6af869be Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 02:34:54 -0400 Subject: [PATCH 03/15] =?UTF-8?q?feat(seprag):=20M1=20first-pass=20harness?= =?UTF-8?q?=20=E2=80=94=20SepRAG=20on=20real=20ogbn-arxiv=20citation=20sub?= =?UTF-8?q?graph?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit examples/m1_arxiv: ingest the real ogbn-arxiv edge list (169K nodes / 1.16M edges), induce a connected BFS-ball subgraph, build the SepRAG hierarchy, and report the ADR-199 go/no-go metrics (blowup ratio, elim-tree height, build time) plus a sampled Dijkstra-oracle recall check. First-pass result (N=1500 citation-only subgraph, M0 BFS separator): - recall 50/50 vs Dijkstra oracle -> CORRECTNESS HOLDS on real data - query pruning saves ~100% of scans (364 vs 418555 bucket scans/query) - BUT blowup 56.9x and elim height ~= n -> the BFS separator degenerates completely on small-world citation graphs (picks a giant BFS layer as the separator), and the raw citation ball is dense (avg degree ~24). Verdict (ADR-199 fallback ladder): NO-GO for the naive separator + dense backbone. Trustworthy precisely because M0 proved the algorithm correct, so 57x is a separator-quality/backbone artifact, not a SepRAG refutation. Next: ruvector-mincut balanced separators + alpha-pruned sparse backbone (ADR-197). --- crates/ruvector-seprag/examples/m1_arxiv.rs | 161 ++++++++++++++++++++ 1 file changed, 161 insertions(+) create mode 100644 crates/ruvector-seprag/examples/m1_arxiv.rs diff --git a/crates/ruvector-seprag/examples/m1_arxiv.rs b/crates/ruvector-seprag/examples/m1_arxiv.rs new file mode 100644 index 0000000000..b0fb070dad --- /dev/null +++ b/crates/ruvector-seprag/examples/m1_arxiv.rs @@ -0,0 +1,161 @@ +//! M1 first-pass (ADR-199): SepRAG on a real ogbn-arxiv citation subgraph. +//! +//! Measures the go/no-go signal — shortcut-blowup ratio, elimination-tree +//! height, build time — plus a sampled Dijkstra-oracle recall check, on a +//! connected BFS-ball subgraph of the real citation network. +//! +//! Scope honesty: this uses (a) the citation graph only — α-pruned kNN over node +//! features is the next pass — and (b) the M0 BFS separator, which M0 showed +//! degenerates on low-diameter graphs. So a high blowup here is expected to be a +//! *separator-quality* artifact, not a verdict on SepRAG; the verdict needs +//! ruvector-mincut balanced separators at full scale. Treat this as pipeline +//! validation + a first real-data data point. +//! +//! Run: +//! gunzip -kc target/m1-data/arxiv/raw/edge.csv.gz > .../edge.csv # once +//! cargo run --release -p ruvector-seprag --example m1_arxiv -- + +use ruvector_seprag::graph::{cmp_dist_id, Graph, NodeId}; +use ruvector_seprag::query::{elim_depth, KnnIndex, QueryStats}; +use ruvector_seprag::{gen, SepRag}; +use std::collections::VecDeque; +use std::time::Instant; + +fn main() { + let args: Vec = std::env::args().collect(); + let path = args.get(1).cloned().unwrap_or_else(|| { + "target/m1-data/arxiv/raw/edge.csv".to_string() + }); + let n_target: usize = args.get(2).and_then(|s| s.parse().ok()).unwrap_or(6000); + let seed_node: u32 = args.get(3).and_then(|s| s.parse().ok()).unwrap_or(0); + + eprintln!("[m1] reading {path}"); + let t = Instant::now(); + let (adj, max_id) = read_edges(&path); + eprintln!( + "[m1] full graph: {} nodes, {} undirected edges, read in {:.1}s", + max_id + 1, + adj.iter().map(Vec::len).sum::() / 2, + t.elapsed().as_secs_f64() + ); + + // Connected induced subgraph via BFS ball from `seed_node`. + let (g, orig_ids) = bfs_ball(&adj, seed_node, n_target); + eprintln!( + "[m1] subgraph: {} nodes, {} edges (BFS ball from orig id {seed_node})", + g.n, + g.edges().count() + ); + + let t = Instant::now(); + let sr = SepRag::build(g); + let build_s = t.elapsed().as_secs_f64(); + let max_h = (0..sr.graph.n as u32).map(|r| elim_depth(&sr.topo, r)).max().unwrap_or(0); + + println!("\n=== M1 first-pass: ogbn-arxiv citation subgraph ==="); + println!("nodes {}", sr.graph.n); + println!("base edges |G_nav| {}", sr.graph.edges().count()); + println!("chordal arcs |G+| {}", sr.topo.arc_count()); + println!("BLOWUP RATIO {:.2}x (ADR-199 gate; target <=3-5x)", sr.blowup_ratio()); + println!("elim-tree height {max_h} (sublinear vs n={} is the goal)", sr.graph.n); + println!("build time {build_s:.2}s"); + + // Sampled-oracle recall: SepRAG k-NN vs Dijkstra over the subgraph. + let pois = gen::sample_pois(sr.graph.n, sr.graph.n / 2, 7); + let srcs = gen::sample_pois(sr.graph.n, 50, 13); + let idx = KnnIndex::build(&sr.topo, &sr.metric, &pois); + let (mut ok, mut pruned_scans, mut unpruned_scans) = (0usize, 0usize, 0usize); + for &src in &srcs { + let oracle = sr.graph.knn_oracle(src, &pois, 10); + let mut sp = QueryStats::default(); + let got = idx.knn(src, 10, true, &mut sp); + let mut su = QueryStats::default(); + let _ = idx.knn(src, 10, false, &mut su); + pruned_scans += sp.bucket_entries_scanned; + unpruned_scans += su.bucket_entries_scanned; + if dist_multiset_eq(&got, &oracle) { + ok += 1; + } + } + let q = srcs.len(); + println!("\nrecall sanity {ok}/{q} queries match Dijkstra oracle (distance multiset)"); + println!( + "search space pruned {} vs unpruned {} bucket scans/query ({:.0}% saved)", + pruned_scans / q, + unpruned_scans / q, + 100.0 * (1.0 - pruned_scans as f64 / unpruned_scans.max(1) as f64) + ); + let _ = orig_ids; +} + +/// Read "src,dst" edge CSV → undirected adjacency (dense ids) + max id. +fn read_edges(path: &str) -> (Vec>, usize) { + let data = std::fs::read_to_string(path).expect("read edge csv"); + let mut edges: Vec<(u32, u32)> = Vec::new(); + let mut max_id = 0u32; + for line in data.lines() { + let mut it = line.split(','); + if let (Some(a), Some(b)) = (it.next(), it.next()) { + if let (Ok(u), Ok(v)) = (a.trim().parse::(), b.trim().parse::()) { + max_id = max_id.max(u).max(v); + edges.push((u, v)); + } + } + } + let mut adj = vec![Vec::new(); max_id as usize + 1]; + for (u, v) in edges { + if u != v { + adj[u as usize].push(v); + adj[v as usize].push(u); + } + } + (adj, max_id as usize) +} + +/// Induced connected subgraph: BFS from `seed` collecting up to `n_target` nodes. +/// Unit edge weights (hop distance). Returns the graph + original-id map. +fn bfs_ball(adj: &[Vec], seed: u32, n_target: usize) -> (Graph, Vec) { + let mut order = Vec::new(); + let mut seen = vec![false; adj.len()]; + let mut q = VecDeque::from([seed]); + seen[seed as usize] = true; + while let Some(u) = q.pop_front() { + order.push(u); + if order.len() >= n_target { + break; + } + for &v in &adj[u as usize] { + if !seen[v as usize] { + seen[v as usize] = true; + q.push_back(v); + } + } + } + let mut remap = vec![u32::MAX; adj.len()]; + for (new, &old) in order.iter().enumerate() { + remap[old as usize] = new as u32; + } + let mut g = Graph::new(order.len()); + for &old in &order { + let nu = remap[old as usize]; + for &v in &adj[old as usize] { + let nv = remap[v as usize]; + if nv != u32::MAX && nu < nv { + g.add_edge(nu, nv, 1.0); + } + } + } + (g, order) +} + +fn dist_multiset_eq(a: &[(NodeId, f64)], b: &[(NodeId, f64)]) -> bool { + if a.len() != b.len() { + return false; + } + let mut da: Vec = a.iter().map(|x| x.1).collect(); + let mut db: Vec = b.iter().map(|x| x.1).collect(); + da.sort_by(|x, y| x.partial_cmp(y).unwrap()); + db.sort_by(|x, y| x.partial_cmp(y).unwrap()); + da.iter().zip(&db).all(|(x, y)| (x - y).abs() < 1e-9) + && { let _ = cmp_dist_id; true } +} From 4377c1c6991dfa47a80f7eea1fd016bd7b11a80c Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 02:47:08 -0400 Subject: [PATCH 04/15] feat(seprag): balanced separator + backbone-sparsify knob; M1 attribution MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit order.rs: add SeparatorKind::Balanced (grow half-size region, take only its boundary) alongside the M0 BfsLayer strategy; make Balanced the default. lib.rs: SepRag::build_with(graph, kind). m1_arxiv: add max_degree backbone sparsification + separator-kind args for A/B attribution. M1 attribution (ogbn-arxiv N=1500 citation BFS-ball): raw + layer blowup 56.9x elim_h 1443 build 56s raw + balanced blowup 23.8x elim_h 941 build 13s <- best deg<=10 + layer blowup 89.6x elim_h 1295 build 38s deg<=10 + bal blowup 60.1x elim_h 1035 build 18s (recall 50/50 and ~100% query pruning in all configs) Findings: (1) balanced separator is a real win (2.4x less fill, 4x faster); (2) hub-dampening degree-bound BACKFIRES (shrinks denominator faster than |G+|, destroys good cuts) — discard it; (3) even best config leaves 23.8x / elim_h~0.6n: the dense small-world citation ball is intrinsically high-treewidth (ADR-197 expander risk, measured). Next: feature-manifold/hyperbolic backbone, not the citation topology. --- crates/ruvector-seprag/examples/m1_arxiv.rs | 35 +++++++-- crates/ruvector-seprag/src/lib.rs | 10 ++- crates/ruvector-seprag/src/order.rs | 82 ++++++++++++++++++++- 3 files changed, 118 insertions(+), 9 deletions(-) diff --git a/crates/ruvector-seprag/examples/m1_arxiv.rs b/crates/ruvector-seprag/examples/m1_arxiv.rs index b0fb070dad..8055ac7596 100644 --- a/crates/ruvector-seprag/examples/m1_arxiv.rs +++ b/crates/ruvector-seprag/examples/m1_arxiv.rs @@ -17,7 +17,7 @@ use ruvector_seprag::graph::{cmp_dist_id, Graph, NodeId}; use ruvector_seprag::query::{elim_depth, KnnIndex, QueryStats}; -use ruvector_seprag::{gen, SepRag}; +use ruvector_seprag::{gen, SepRag, SeparatorKind}; use std::collections::VecDeque; use std::time::Instant; @@ -28,6 +28,12 @@ fn main() { }); let n_target: usize = args.get(2).and_then(|s| s.parse().ok()).unwrap_or(6000); let seed_node: u32 = args.get(3).and_then(|s| s.parse().ok()).unwrap_or(0); + // arg4: max degree (0 = no backbone sparsification). arg5: "bal" | "layer". + let max_degree: usize = args.get(4).and_then(|s| s.parse().ok()).unwrap_or(0); + let kind = match args.get(5).map(String::as_str) { + Some("layer") => SeparatorKind::BfsLayer, + _ => SeparatorKind::Balanced, + }; eprintln!("[m1] reading {path}"); let t = Instant::now(); @@ -40,15 +46,18 @@ fn main() { ); // Connected induced subgraph via BFS ball from `seed_node`. - let (g, orig_ids) = bfs_ball(&adj, seed_node, n_target); + let (g0, orig_ids) = bfs_ball(&adj, seed_node, n_target); + let g = if max_degree > 0 { degree_bound(&g0, max_degree) } else { g0 }; eprintln!( - "[m1] subgraph: {} nodes, {} edges (BFS ball from orig id {seed_node})", + "[m1] subgraph: {} nodes, {} edges (BFS ball from orig id {seed_node}); \ + backbone max_degree={max_degree} ({:?} separator)", g.n, - g.edges().count() + g.edges().count(), + kind, ); let t = Instant::now(); - let sr = SepRag::build(g); + let sr = SepRag::build_with(g, kind); let build_s = t.elapsed().as_secs_f64(); let max_h = (0..sr.graph.n as u32).map(|r| elim_depth(&sr.topo, r)).max().unwrap_or(0); @@ -148,6 +157,22 @@ fn bfs_ball(adj: &[Vec], seed: u32, n_target: usize) -> (Graph, Vec) { (g, order) } +/// Degree-bound backbone sparsification (ADR-197): keep, per node, edges to its +/// `d` lowest-degree neighbours (hub-dampening), unioned undirected. A cheap +/// stand-in for α-pruning when no vector metric is loaded yet. +fn degree_bound(g: &Graph, d: usize) -> Graph { + let deg: Vec = g.adj.iter().map(Vec::len).collect(); + let mut out = Graph::new(g.n); + for u in 0..g.n { + let mut nb = g.adj[u].clone(); + nb.sort_by(|a, b| deg[a.0 as usize].cmp(°[b.0 as usize]).then(a.0.cmp(&b.0))); + for &(v, w) in nb.iter().take(d) { + out.add_edge(u as NodeId, v, w); + } + } + out +} + fn dist_multiset_eq(a: &[(NodeId, f64)], b: &[(NodeId, f64)]) -> bool { if a.len() != b.len() { return false; diff --git a/crates/ruvector-seprag/src/lib.rs b/crates/ruvector-seprag/src/lib.rs index ea61eb12e2..66237e4791 100644 --- a/crates/ruvector-seprag/src/lib.rs +++ b/crates/ruvector-seprag/src/lib.rs @@ -42,7 +42,7 @@ pub mod query; pub use contraction::Topology; pub use customize::Metric; pub use graph::{Graph, NodeId}; -pub use order::{SepNode, SepTree}; +pub use order::{SepNode, SeparatorKind, SepTree}; pub use query::{KnnIndex, QueryStats}; /// A built SepRAG hierarchy: metric-independent topology + one customized metric. @@ -58,7 +58,13 @@ impl SepRag { /// the graph's own edge weights as the metric. #[must_use] pub fn build(graph: Graph) -> Self { - let ord = order::nested_dissection(&graph); + Self::build_with(graph, SeparatorKind::Balanced) + } + + /// Build with an explicit separator strategy (for M1 A/B attribution). + #[must_use] + pub fn build_with(graph: Graph, kind: SeparatorKind) -> Self { + let ord = order::nested_dissection_kind(&graph, kind); let topo = contraction::contract(&graph, &ord.order); let metric = customize::customize(&topo); SepRag { graph, topo, metric, sep_tree: ord.sep_tree } diff --git a/crates/ruvector-seprag/src/order.rs b/crates/ruvector-seprag/src/order.rs index 2b611aa979..8be0604c05 100644 --- a/crates/ruvector-seprag/src/order.rs +++ b/crates/ruvector-seprag/src/order.rs @@ -13,6 +13,16 @@ use std::collections::VecDeque; /// Cells smaller than this become leaves (no further dissection). pub const LEAF: usize = 8; +/// Separator-finding strategy. `BfsLayer` (M0 baseline) takes a whole BFS +/// frontier — fine on grids, degenerate on low-diameter graphs. `Balanced` +/// grows a half-size region and takes only its boundary, giving small +/// separators on sparse graphs (the M1 fix; ADR-197). +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +pub enum SeparatorKind { + BfsLayer, + Balanced, +} + /// A node of the separator decomposition tree. Every original vertex belongs to /// exactly one node (either as a separator member or a leaf member), which is /// what lets POIs be bucketed unambiguously during query (see `query.rs`). @@ -39,11 +49,19 @@ pub struct Ordering { pub sep_tree: SepTree, } -/// Compute a nested-dissection order over all `n` vertices of `g`. +/// Compute a nested-dissection order over all `n` vertices of `g` using the +/// `Balanced` separator (the default since M1). #[must_use] pub fn nested_dissection(g: &Graph) -> Ordering { + nested_dissection_kind(g, SeparatorKind::Balanced) +} + +/// Nested dissection with an explicit separator strategy (for A/B attribution). +#[must_use] +pub fn nested_dissection_kind(g: &Graph, kind: SeparatorKind) -> Ordering { let mut builder = NdBuilder { g, + kind, order: Vec::with_capacity(g.n), nodes: Vec::new(), }; @@ -57,6 +75,7 @@ pub fn nested_dissection(g: &Graph) -> Ordering { struct NdBuilder<'a> { g: &'a Graph, + kind: SeparatorKind, order: Vec, nodes: Vec, } @@ -76,7 +95,11 @@ impl NdBuilder<'_> { return self.leaf(verts); } - match bfs_separator(self.g, &verts) { + let sep_result = match self.kind { + SeparatorKind::BfsLayer => bfs_separator(self.g, &verts), + SeparatorKind::Balanced => balanced_separator(self.g, &verts), + }; + match sep_result { Some((sep, a, b)) => { let ca = self.dissect(a); let cb = self.dissect(b); @@ -177,6 +200,61 @@ fn bfs_separator(g: &Graph, verts: &[NodeId]) -> Option<(Vec, Vec Option<(Vec, Vec, Vec)> { + let in_set = membership(g.n, verts); + let start = farthest(g, &in_set, verts[0]); + let visit = bfs_order(g, &in_set, start); + if visit.len() < 2 { + return None; + } + let half = verts.len() / 2; + let mut in_a = vec![false; g.n]; + for &v in &visit[..half] { + in_a[v as usize] = true; + } + let mut sep = Vec::new(); + let mut a = Vec::new(); + for &v in &visit[..half] { + // Boundary iff some in-cell neighbour lies outside region A. + let on_boundary = g.adj[v as usize] + .iter() + .any(|&(u, _)| in_set[u as usize] && !in_a[u as usize]); + if on_boundary { + sep.push(v); + } else { + a.push(v); + } + } + let b: Vec = visit[half..].to_vec(); + if a.is_empty() || b.is_empty() || sep.is_empty() { + return None; + } + Some((sep, a, b)) +} + +/// BFS visitation order within the induced subgraph, starting at `start`. +fn bfs_order(g: &Graph, in_set: &[bool], start: NodeId) -> Vec { + let mut order = Vec::new(); + let mut seen = vec![false; g.n]; + seen[start as usize] = true; + let mut q = VecDeque::from([start]); + while let Some(u) = q.pop_front() { + order.push(u); + for &(v, _) in &g.adj[u as usize] { + if in_set[v as usize] && !seen[v as usize] { + seen[v as usize] = true; + q.push_back(v); + } + } + } + order +} + /// BFS hop-distance layers within the induced subgraph. Returns `(layer, max)`. fn bfs_layers(g: &Graph, in_set: &[bool], start: NodeId) -> (Vec, usize) { let mut layer = vec![u32::MAX; g.n]; From bf1a310c411a54490977d1dfcf50d83ed4a7f640 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 08:14:42 -0400 Subject: [PATCH 05/15] feat(seprag): road-network control + feature-manifold backbone tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit m1_arxiv: robust edge reader (skip # comments, comma/tab/space separators) so SNAP road networks load through the same harness. m1_manifold: alpha-pruned kNN backbone over real ogbn-arxiv 128-d node features (Vamana RobustPrune) — the decisive ADR-197 thesis test. Results (N=1500, balanced separator, recall 50/50 everywhere): roadNet-PA blowup 7.6x elim_h 136 (~3.5 sqrt n) <- CCH works citation (arxiv) blowup 23.8x elim_h 941 (~0.6 n) feature-manifold k10 blowup 42.4x elim_h 837 (~0.56 n) <- worse feature-manifold k6 blowup 45.1x elim_h 699 (~0.47 n) Conclusion: the road control proves the implementation is sound (planar sqrt(n) separators -> 7.6x, instant build). Both embedding-derived backbones (citation small-world AND Euclidean feature kNN) are intrinsically high-treewidth (elim_h ~ 0.5n), so CCH contraction blows up regardless of separator quality or degree. The expander risk (ADR-197) is confirmed across two independent backbones. Query pruning stays ~100% effective; the cost is preprocessing. Last untested rung: hyperbolic backbone (needs real hyperbolic embeddings). --- crates/ruvector-seprag/examples/m1_arxiv.rs | 6 +- .../ruvector-seprag/examples/m1_manifold.rs | 119 ++++++++++++++++++ 2 files changed, 124 insertions(+), 1 deletion(-) create mode 100644 crates/ruvector-seprag/examples/m1_manifold.rs diff --git a/crates/ruvector-seprag/examples/m1_arxiv.rs b/crates/ruvector-seprag/examples/m1_arxiv.rs index 8055ac7596..94e9aa73c1 100644 --- a/crates/ruvector-seprag/examples/m1_arxiv.rs +++ b/crates/ruvector-seprag/examples/m1_arxiv.rs @@ -103,7 +103,11 @@ fn read_edges(path: &str) -> (Vec>, usize) { let mut edges: Vec<(u32, u32)> = Vec::new(); let mut max_id = 0u32; for line in data.lines() { - let mut it = line.split(','); + // Skip SNAP-style comment lines; accept comma/tab/space separators. + if line.starts_with('#') || line.is_empty() { + continue; + } + let mut it = line.split(|c| c == ',' || c == '\t' || c == ' ').filter(|s| !s.is_empty()); if let (Some(a), Some(b)) = (it.next(), it.next()) { if let (Ok(u), Ok(v)) = (a.trim().parse::(), b.trim().parse::()) { max_id = max_id.max(u).max(v); diff --git a/crates/ruvector-seprag/examples/m1_manifold.rs b/crates/ruvector-seprag/examples/m1_manifold.rs new file mode 100644 index 0000000000..94e1f91079 --- /dev/null +++ b/crates/ruvector-seprag/examples/m1_manifold.rs @@ -0,0 +1,119 @@ +//! M1 decisive thesis test (ADR-197/199): does the embedding *manifold* have +//! smaller separators than the citation topology? +//! +//! Builds an α-pruned kNN graph (DiskANN/Vamana-style RobustPrune) over real +//! ogbn-arxiv 128-d node features, then runs the SepRAG hierarchy and reports +//! the same go/no-go metrics as `m1_arxiv`. Compare blowup / elim-tree height +//! against the road control (~7.6× / ~136) and the citation graph (~23.8× / ~941). +//! +//! Run: +//! gunzip -c arxiv/raw/node-feat.csv.gz | head -2000 > node-feat-2000.csv +//! cargo run --release -p ruvector-seprag --example m1_manifold -- + +use ruvector_seprag::graph::{Graph, NodeId}; +use ruvector_seprag::query::{elim_depth, KnnIndex, QueryStats}; +use ruvector_seprag::{gen, SepRag}; +use std::time::Instant; + +fn main() { + let args: Vec = std::env::args().collect(); + let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-2000.csv".into()); + let n: usize = args.get(2).and_then(|s| s.parse().ok()).unwrap_or(1500); + let k: usize = args.get(3).and_then(|s| s.parse().ok()).unwrap_or(10); + let alpha: f64 = args.get(4).and_then(|s| s.parse().ok()).unwrap_or(1.2); + + eprintln!("[manifold] reading features from {path}"); + let feats = read_features(&path, n); + let n = feats.len(); + let dim = feats.first().map_or(0, Vec::len); + eprintln!("[manifold] {n} nodes x {dim} dims; building k={k} graph, alpha-prune alpha={alpha}"); + + let norms: Vec = feats.iter().map(|v| v.iter().map(|x| x * x).sum::().sqrt().max(1e-12)).collect(); + let dist = |i: usize, j: usize| -> f64 { + let dot: f64 = feats[i].iter().zip(&feats[j]).map(|(a, b)| a * b).sum(); + (1.0 - dot / (norms[i] * norms[j])).max(1e-6) // cosine distance, kept positive + }; + + // Exact kNN per node (brute force; fine at this scale). + let t = Instant::now(); + let mut knn: Vec> = vec![Vec::new(); n]; + for i in 0..n { + let mut cand: Vec<(usize, f64)> = (0..n).filter(|&j| j != i).map(|j| (j, dist(i, j))).collect(); + cand.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap()); + cand.truncate(k.max(8) * 2); // keep a candidate pool for the prune step + knn[i] = cand; + } + eprintln!("[manifold] kNN built in {:.1}s", t.elapsed().as_secs_f64()); + + // α-prune (Vamana RobustPrune): keep q unless a closer kept r dominates it. + let mut g = Graph::new(n); + let mut kept_edges = 0usize; + for i in 0..n { + let mut kept: Vec<(usize, f64)> = Vec::new(); + for &(q, dq) in &knn[i] { + let dominated = kept.iter().any(|&(r, _)| alpha * dist(r, q) <= dq); + if !dominated { + kept.push((q, dq)); + if kept.len() >= k { + break; + } + } + } + for (q, dq) in kept { + g.add_edge(i as NodeId, q as NodeId, dq); + kept_edges += 1; + } + } + + let sr = SepRag::build(g); + let max_h = (0..sr.graph.n as u32).map(|r| elim_depth(&sr.topo, r)).max().unwrap_or(0); + let avg_deg = 2.0 * sr.graph.edges().count() as f64 / n as f64; + + println!("\n=== M1 manifold test: ogbn-arxiv feature kNN (k={k}, alpha={alpha}) ==="); + println!("nodes {n}"); + println!("base edges |G_nav| {} (avg degree {avg_deg:.1}, directed kept {kept_edges})", sr.graph.edges().count()); + println!("chordal arcs |G+| {}", sr.topo.arc_count()); + println!("BLOWUP RATIO {:.2}x (road ~7.6x, citation ~23.8x)", sr.blowup_ratio()); + println!("elim-tree height {max_h} (road ~136, citation ~941; sqrt(n)~{:.0})", (n as f64).sqrt()); + + // Recall sanity vs Dijkstra oracle on the manifold graph. + let pois = gen::sample_pois(n, n / 2, 7); + let srcs = gen::sample_pois(n, 50, 13); + let idx = KnnIndex::build(&sr.topo, &sr.metric, &pois); + let (mut ok, mut pr, mut un) = (0usize, 0usize, 0usize); + for &src in &srcs { + let oracle = sr.graph.knn_oracle(src, &pois, 10); + let mut sp = QueryStats::default(); + let got = idx.knn(src, 10, true, &mut sp); + let mut su = QueryStats::default(); + let _ = idx.knn(src, 10, false, &mut su); + pr += sp.bucket_entries_scanned; + un += su.bucket_entries_scanned; + if multiset_eq(&got, &oracle) { + ok += 1; + } + } + let q = srcs.len(); + println!("recall sanity {ok}/{q} match Dijkstra oracle"); + println!("search space pruned {} vs unpruned {} scans/query ({:.0}% saved)", pr / q, un / q, 100.0 * (1.0 - pr as f64 / un.max(1) as f64)); +} + +fn read_features(path: &str, n: usize) -> Vec> { + let data = std::fs::read_to_string(path).expect("read features"); + data.lines() + .take(n) + .map(|line| line.split(',').filter_map(|s| s.trim().parse::().ok()).collect()) + .filter(|v: &Vec| !v.is_empty()) + .collect() +} + +fn multiset_eq(a: &[(NodeId, f64)], b: &[(NodeId, f64)]) -> bool { + if a.len() != b.len() { + return false; + } + let mut da: Vec = a.iter().map(|x| x.1).collect(); + let mut db: Vec = b.iter().map(|x| x.1).collect(); + da.sort_by(|x, y| x.partial_cmp(y).unwrap()); + db.sort_by(|x, y| x.partial_cmp(y).unwrap()); + da.iter().zip(&db).all(|(x, y)| (x - y).abs() < 1e-9) +} From 7189a7a757a4f477cd01dfa067014fe4f489ed2a Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 08:59:01 -0400 Subject: [PATCH 06/15] docs(seprag): record empirical NO-GO outcome in ADRs 196/197/199 + plan The M1 go/no-go gate ran on real public data and returned NO-GO for CCH full contraction on embedding/citation retrieval graphs. Recorded the measured evidence in ADR-199 (Empirical Outcome section), updated ADR-196/197 status, and marked the milestone tracker (M0 done, M1 NO-GO, M2-M4 not pursued). Evidence (N=1500, recall 50/50 everywhere): roadNet-PA control blowup 7.6x elim_h ~3.5 sqrt n (CCH works) ogbn-arxiv citation blowup 23.8x elim_h ~0.6 n ogbn-arxiv feature kNN blowup 42.4x elim_h ~0.56 n Implementation is sound (road control + exact recall); embedding backbones are intrinsically high-treewidth. Query pruning works (~100%); preprocessing fill-in is the blocker. ruvector-seprag retained as a validated reference. --- ...R-196-seprag-cch-hierarchical-retrieval.md | 13 +++-- ...ation-graph-metric-independent-ordering.md | 12 +++-- ...ADR-199-public-corpus-benchmark-harness.md | 48 ++++++++++++++++++- docs/plans/seprag-cch-retrieval/README.md | 37 ++++++++++---- 4 files changed, 94 insertions(+), 16 deletions(-) diff --git a/docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md b/docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md index 30bebacf83..1272107390 100644 --- a/docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md +++ b/docs/adr/ADR-196-seprag-cch-hierarchical-retrieval.md @@ -12,9 +12,16 @@ tags: [ruvector, retrieval, cch, contraction-hierarchies, graph-rag, mincut, jtr ## Status -**Proposed.** Keystone design ADR. Depends on the navigation-graph and ordering -decisions in [ADR-197], the metric layer in [ADR-198], and is validated by the -benchmark harness in [ADR-199]. No code yet; prototype lands behind a feature gate. +**Proposed → empirically NO-GO for embedding retrieval (2026-06-04).** Keystone design +ADR. Depends on the navigation-graph and ordering decisions in [ADR-197], the metric +layer in [ADR-198], and is validated by the benchmark harness in [ADR-199]. + +Prototyped in `crates/ruvector-seprag` (M0 + M1). The separator-tree **query** algorithm +is correct (exact recall) and prunes ~100% of search space — but CCH **full contraction** +blows up on embedding/citation backbones (high treewidth; see [ADR-199] Empirical +Outcome). The design's stated edge — *constrained / relational* retrieval over a sparse +structured backbone, not pure embedding kNN — remains the only plausible niche and is +unvalidated against HNSW. Treat the contraction-based core as not-fit-for-embedding-kNN. ## Context diff --git a/docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md b/docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md index edb5ad6a16..a4df8d57dd 100644 --- a/docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md +++ b/docs/adr/ADR-197-navigation-graph-metric-independent-ordering.md @@ -12,9 +12,15 @@ tags: [ruvector, cch, nested-dissection, separators, mincut, jtree, diskann, hyp ## Status -**Proposed.** Implements Phase 1 of [ADR-196]. This is the **make-or-break** -technical decision: if the navigation graph has large separators, the whole SepRAG -approach fails, so the choices here directly determine viability. +**Proposed — make-or-break question answered NO for embedding backbones (2026-06-04).** +Implements Phase 1 of [ADR-196]. This was correctly identified as the make-or-break +decision: the navigation graph's separator size determines viability. Measured outcome +([ADR-199] Empirical Outcome): the road-network control has small separators (elim height +≈ 3.5·√n), but **both embedding backbones — citation small-world and Euclidean α-pruned +feature kNN — have near-linear treewidth (elim height ≈ 0.5·n)**. The "expander risk" +flagged below is real and confirmed. Degree-bounding the backbone made it worse, not +better. The hyperbolic-backbone mitigation remains untested (needs genuine hyperbolic +embeddings). ## Context diff --git a/docs/adr/ADR-199-public-corpus-benchmark-harness.md b/docs/adr/ADR-199-public-corpus-benchmark-harness.md index 208cd31ee4..7957a42b40 100644 --- a/docs/adr/ADR-199-public-corpus-benchmark-harness.md +++ b/docs/adr/ADR-199-public-corpus-benchmark-harness.md @@ -12,7 +12,13 @@ tags: [ruvector, benchmark, evaluation, wikipedia, wikidata, beir, hotpotqa, ogb ## Status -**Proposed.** This is the experimental backbone for [ADR-196]–[ADR-198]. It exists to +**Accepted — executed; outcome recorded below (2026-06-04).** The go/no-go gate ran on +real data and returned **NO-GO for CCH full-contraction on embedding/citation backbones** +(see [Empirical Outcome](#empirical-outcome-2026-06-04)). The implementation is validated +(road-network control + exact recall everywhere); the thesis fails its structural +prerequisite (small separators) on embedding-derived graphs. + +This is the experimental backbone for [ADR-196]–[ADR-198]. It exists to *answer empirically* the design questions the other ADRs leave open, using large public datasets rather than synthetic graphs or a priori reasoning. @@ -121,3 +127,43 @@ the expander risk is always visible. the point. - **Embed everything ourselves.** Rejected for v1 — precomputed embeddings de-risk the experiment and isolate retrieval performance from embedder throughput ([ADR-194]). + +## Empirical Outcome (2026-06-04) + +Implemented in `crates/ruvector-seprag` and run on real public data. All runs use the +balanced separator at N=1500 (BFS-ball subgraph), **recall 50/50 vs the Dijkstra oracle +in every configuration** (the algorithm is correct everywhere). + +| Backbone | avg degree | blowup `\|G+\|/\|G_nav\|` | elim-tree height | build | +|---|---|---|---|---| +| **roadNet-PA (control)** | 2.8 | **7.6×** | 136 (~3.5·√n) | 0.01s | +| ogbn-arxiv citation | 24 | 23.8× | 941 (~0.6·n) | 13s | +| ogbn-arxiv feature kNN (k=10, α=1.2) | 11.7 | 42.4× | 837 (~0.56·n) | — | +| ogbn-arxiv feature kNN (k=6, α=1.2) | 8.7 | 45.1× | 699 (~0.47·n) | — | + +**Findings.** +1. **Implementation is sound.** The road control reproduces textbook CCH behaviour — + planar O(√n) separators → 7.6× blowup, elim height ≈ 3.5·√n, instant build. +2. **Embedding backbones are intrinsically high-treewidth.** Both the citation + small-world graph *and* the Euclidean α-pruned feature manifold have elimination-tree + height ≈ 0.5·n. The feature manifold is *worse* than citation despite lower degree, so + the cause is structural (treewidth), not density or separator-heuristic quality. +3. **The cost is preprocessing, not query.** Separator-tree branch-and-bound pruning + eliminates ~100% of search-space scans on *every* backbone; recall is exact. CCH's + failure here is fill-in/build blowup, which does not scale. +4. Earlier negative result: hub-dampening degree-bounding makes blowup *worse* (shrinks + the denominator faster than `|G+|` and destroys good cuts). Judge by absolute `|G+|` + and elim-tree height, not the ratio. + +**Verdict: NO-GO for CCH full-contraction on embedding/citation retrieval graphs.** For +embedding kNN, a navigable small-world hierarchy (HNSW) already provides what CCH cannot +here. The expander risk in [ADR-197] is confirmed across two independent backbones. + +**Untested rung (future, not pursued):** a hyperbolic backbone (needs genuine hyperbolic +embeddings, not Euclidean features reinterpreted) — odds judged low given two negatives. + +**Preserved value:** the validated separator-tree pruning *query* + the road-control +success indicate the only plausible niche is **sparse, structured, relational backbones +with re-customizable metrics** (the constrained/relational scope of [ADR-196], not pure +embedding kNN) — to be revisited only with evidence it beats HNSW. The `ruvector-seprag` +crate stands as a correct, tested reference implementation. diff --git a/docs/plans/seprag-cch-retrieval/README.md b/docs/plans/seprag-cch-retrieval/README.md index fa3047aa03..baf7b0dd9f 100644 --- a/docs/plans/seprag-cch-retrieval/README.md +++ b/docs/plans/seprag-cch-retrieval/README.md @@ -19,15 +19,34 @@ on real data. The milestones are sequenced to surface that number as cheaply and early as possible, on a *correctness-validated* implementation (so the signal is not confounded by bugs). -## Milestone sequence - -| Plan | Goal | Retires which risk | Gate | -|------|------|--------------------|------| -| [M0](M0-correctness-gate.md) | Separator-tree k-NN correct on toy graphs | Implementation correctness | k-NN == brute-force oracle | -| [M1](M1-blowup-measurement.md) | Blowup ratio on ogbn-arxiv (static metric) | **Research viability (decisive)** | blowup small + separators sublinear → GO | -| [M2](M2-customization-loop.md) | GNN metric → customization; self-learning payoff | Re-weight cost vs rebuild | customize ≪ HNSW rebuild | -| [M3](M3-full-hybrid.md) | HNSW entry + filters + rerank; multi-hop QA | End-task quality / crossover | win on multi-hop, parity on semantic | -| [M4](M4-integration.md) | Postgres fn + node bindings + snapshot | Productionization | `seprag_knn()` callable end-to-end | +## Outcome (2026-06-04): M0 ✅ · M1 ❌ NO-GO · M2–M4 not pursued + +M0 and M1 ran; the decisive M1 gate returned **NO-GO** (full evidence in +[ADR-199 Empirical Outcome](../../adr/ADR-199-public-corpus-benchmark-harness.md#empirical-outcome-2026-06-04)). +The `ruvector-seprag` crate is a correct, tested reference implementation; CCH full +contraction does **not** fit embedding/citation retrieval graphs (near-linear treewidth). +M2–M4 are not pursued. Summary: + +| Backbone (N=1500, recall 50/50) | blowup | elim height | +|---|---|---| +| roadNet-PA (control) | 7.6× | 136 (~3.5·√n) — CCH works | +| ogbn-arxiv citation | 23.8× | 941 (~0.6·n) | +| ogbn-arxiv feature kNN (k=10) | 42.4× | 837 (~0.56·n) | + +The separator-tree **query** (pruning ~100% of scans, exact recall) is the salvageable +piece; the failure is preprocessing fill-in. Only plausible future niche: sparse +relational backbone + re-customizable metrics (ADR-196 scope), revisited only if it beats +HNSW. + +## Milestone sequence (as originally planned) + +| Plan | Goal | Retires which risk | Gate | Status | +|------|------|--------------------|------|--------| +| [M0](M0-correctness-gate.md) | Separator-tree k-NN correct on toy graphs | Implementation correctness | k-NN == brute-force oracle | ✅ done | +| [M1](M1-blowup-measurement.md) | Blowup ratio on ogbn-arxiv (static metric) | **Research viability (decisive)** | blowup small + separators sublinear → GO | ❌ NO-GO | +| [M2](M2-customization-loop.md) | GNN metric → customization; self-learning payoff | Re-weight cost vs rebuild | customize ≪ HNSW rebuild | ⏸ not pursued | +| [M3](M3-full-hybrid.md) | HNSW entry + filters + rerank; multi-hop QA | End-task quality / crossover | win on multi-hop, parity on semantic | ⏸ not pursued | +| [M4](M4-integration.md) | Postgres fn + node bindings + snapshot | Productionization | `seprag_knn()` callable end-to-end | ⏸ not pursued | ## Key sequencing principle From 391816329fbe08658d094ca5cce7d20f2dc2e841 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 09:17:08 -0400 Subject: [PATCH 07/15] =?UTF-8?q?feat(seprag):=20BET=201=20=E2=80=94=20cus?= =?UTF-8?q?tomizable=20re-weight=20vs=20rebuild=20(ADR-200,=20WIN=20on=20l?= =?UTF-8?q?inear=20drift)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Salvage ADR-198's customizable-metric idea, decoupled from CCH, as a rigorous pre-registered head-to-head: does a fixed ANN topology + recomputed distances absorb metric drift as well as a full rebuild? examples/reweight_vs_rebuild.rs: self-contained Vamana-lite (RobustPrune + greedy beam search); drift modelled as a vector-space transform M=A^T A; sweeps diagonal AND adversarial dense-Mahalanobis (rotational) drift; A=reuse topology, B=rebuild, C=stale control. Result (n=2000 ogbn-arxiv embeddings, recall@10, pre-registered gate): A (re-weight, 0 rebuild) within 0.2% of B (full rebuild) up to 36% relevant-set churn, under both drift modes. C (stale) loses up to 29 points -> benchmark has teeth, A's parity is genuine adaptation. WIN. Honest claim: COST win at equal quality (rebuilds become free under LINEAR drift). Boundaries (ADR-200): non-linear/region-local drift + scale untested. Next: non-linear learned metric (decisive adversarial test). Also adds docs/plans/.../FUTURE-DIRECTIONS.md (4-bet backlog + prove-not-hype protocol) and ADR-200. --- .../examples/reweight_vs_rebuild.rs | 299 ++++++++++++++++++ ...omizable-reweighting-fixed-topology-ann.md | 108 +++++++ .../seprag-cch-retrieval/FUTURE-DIRECTIONS.md | 61 ++++ 3 files changed, 468 insertions(+) create mode 100644 crates/ruvector-seprag/examples/reweight_vs_rebuild.rs create mode 100644 docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md create mode 100644 docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md diff --git a/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs b/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs new file mode 100644 index 0000000000..9489df2358 --- /dev/null +++ b/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs @@ -0,0 +1,299 @@ +//! BET 1 (ADR-198, decoupled from CCH): does a FIXED proximity-graph topology + +//! cheap re-weighting absorb metric drift as well as a full rebuild? +//! +//! Self-learning systems change their relevance metric over time. A flat ANN +//! index (HNSW/Vamana) is built *for* a metric; when the metric drifts its graph +//! becomes suboptimal and the textbook fix is a costly rebuild. This harness +//! tests whether reusing the old topology under the new metric ("re-weight", +//! zero build cost) keeps recall close to a rebuild — and quantifies how much +//! drift fixed topology tolerates before a rebuild is actually required. +//! +//! Three strategies, recall@10 measured vs brute-force truth under the CURRENT +//! (drifted) metric: +//! A re-weight : graph built under w0, searched under w_t (build cost: 0) +//! B rebuild : graph rebuilt under w_t, searched under w_t (build cost: full) +//! C stale : graph built under w0, searched under w0 (ignores drift; floor) +//! +//! Pre-registered gate — WIN: recall(A) within 2% of recall(B) across the drift +//! sweep. KILL: recall(A) drops >2% below B even at small drift. +//! +//! Run: cargo run --release -p ruvector-seprag --example reweight_vs_rebuild -- + +use std::collections::HashSet; +use std::time::Instant; + +type Vec32 = Vec; + +// ----- deterministic RNG (SplitMix64) ----- +struct Rng(u64); +impl Rng { + fn new(s: u64) -> Self { Rng(s) } + fn next(&mut self) -> u64 { + self.0 = self.0.wrapping_add(0x9E37_79B9_7F4A_7C15); + let mut z = self.0; + z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); + z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); + z ^ (z >> 31) + } + fn f32(&mut self) -> f32 { (self.next() >> 40) as f32 / (1u64 << 24) as f32 } + fn below(&mut self, n: usize) -> usize { (self.next() % n as u64) as usize } +} + +// ----- weighted squared-L2 metric ----- +#[inline] +fn dist(a: &[f32], b: &[f32], w: &[f32]) -> f32 { + let mut s = 0.0f32; + for i in 0..a.len() { + let d = a[i] - b[i]; + s += w[i] * d * d; + } + s +} + +fn brute_topk(vecs: &[Vec32], w: &[f32], q: usize, k: usize) -> Vec { + let mut d: Vec<(f32, u32)> = (0..vecs.len()) + .filter(|&j| j != q) + .map(|j| (dist(&vecs[q], &vecs[j], w), j as u32)) + .collect(); + d.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); + d.truncate(k); + d.into_iter().map(|(_, n)| n).collect() +} + +// ----- Vamana-lite proximity graph ----- +struct Params { r: usize, l: usize, alpha: f32, k: usize } + +fn medoid(vecs: &[Vec32], w: &[f32]) -> u32 { + let dim = vecs[0].len(); + let mut c = vec![0.0f32; dim]; + for v in vecs { + for i in 0..dim { c[i] += v[i]; } + } + for x in &mut c { *x /= vecs.len() as f32; } + (0..vecs.len()).min_by(|&a, &b| dist(&vecs[a], &c, w).partial_cmp(&dist(&vecs[b], &c, w)).unwrap()).unwrap() as u32 +} + +/// Greedy beam search. Returns (top-k, set of all visited nodes, #distance evals). +fn greedy(graph: &[Vec], vecs: &[Vec32], w: &[f32], entry: u32, q: &[f32], beam: usize, k: usize) -> (Vec, Vec, usize) { + let mut seen: HashSet = HashSet::new(); + let mut expanded: HashSet = HashSet::new(); + let mut pool: Vec<(f32, u32)> = vec![(dist(&vecs[entry as usize], q, w), entry)]; + seen.insert(entry); + let mut evals = 1usize; + loop { + let next = pool.iter().filter(|(_, n)| !expanded.contains(n)).min_by(|a, b| a.0.partial_cmp(&b.0).unwrap()).copied(); + let (_, u) = match next { Some(x) => x, None => break }; + expanded.insert(u); + for &v in &graph[u as usize] { + if seen.insert(v) { + pool.push((dist(&vecs[v as usize], q, w), v)); + evals += 1; + } + } + pool.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); + pool.truncate(beam); + } + pool.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); + let topk = pool.iter().take(k).map(|&(_, n)| n).collect(); + let visited = seen.into_iter().collect(); + (topk, visited, evals) +} + +/// RobustPrune: keep up to R diverse neighbours (Vamana alpha-pruning). +fn robust_prune(p: u32, cands: &[u32], vecs: &[Vec32], w: &[f32], alpha: f32, r: usize) -> Vec { + let mut pool: Vec<(f32, u32)> = cands.iter().filter(|&&c| c != p).map(|&c| (dist(&vecs[p as usize], &vecs[c as usize], w), c)).collect(); + pool.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); + let mut out: Vec = Vec::new(); + let mut i = 0; + while i < pool.len() && out.len() < r { + let (_, pstar) = pool[i]; + out.push(pstar); + pool.retain(|&(dq, q)| alpha * dist(&vecs[pstar as usize], &vecs[q as usize], w) > dq); + i = 0; // pool shrank; restart scan from front of remaining + // skip already-chosen + pool.retain(|&(_, q)| !out.contains(&q)); + } + out +} + +/// Build a Vamana-lite graph under metric `w`. Returns (graph, #distance evals). +fn build(vecs: &[Vec32], w: &[f32], p: &Params, seed: u64) -> (Vec>, usize) { + let n = vecs.len(); + let mut rng = Rng::new(seed); + // init: random R-regular + let mut graph: Vec> = (0..n) + .map(|i| { + let mut s = HashSet::new(); + while s.len() < p.r.min(n - 1) { + let j = rng.below(n); + if j != i { s.insert(j as u32); } + } + s.into_iter().collect() + }) + .collect(); + let med = medoid(vecs, w); + let mut order: Vec = (0..n).collect(); + for i in (1..n).rev() { order.swap(i, rng.below(i + 1)); } // shuffle + let mut evals = 0usize; + for &node in &order { + let (_, visited, e) = greedy(&graph, vecs, w, med, &vecs[node], p.l, p.k); + evals += e; + let nbrs = robust_prune(node as u32, &visited, vecs, w, p.alpha, p.r); + graph[node] = nbrs.clone(); + for q in nbrs { + let qi = q as usize; + if !graph[qi].contains(&(node as u32)) { + graph[qi].push(node as u32); + if graph[qi].len() > p.r { + let cand = graph[qi].clone(); + graph[qi] = robust_prune(q, &cand, vecs, w, p.alpha, p.r); + } + } + } + } + (graph, evals) +} + +fn recall(got: &[u32], truth: &[u32]) -> f64 { + let t: HashSet = truth.iter().copied().collect(); + got.iter().filter(|g| t.contains(g)).count() as f64 / truth.len() as f64 +} + +fn read_vectors(path: &str, n: usize) -> Vec { + let data = std::fs::read_to_string(path).expect("read features"); + data.lines().take(n) + .map(|l| l.split(',').filter_map(|s| s.trim().parse::().ok()).collect()) + .filter(|v: &Vec32| !v.is_empty()) + .collect() +} + +// ---- metric drift modelled as a vector-space transform A (row-major dim*dim) ---- +// metric M = A^T A; equivalently transform vectors by A and use plain L2. +fn gaussian(rng: &mut Rng) -> f32 { + let u1 = (rng.f32() as f64).max(1e-9); + let u2 = rng.f32() as f64; + ((-2.0 * u1.ln()).sqrt() * (std::f64::consts::TAU * u2).cos()) as f32 +} + +fn random_rotation(dim: usize, rng: &mut Rng) -> Vec { + // Gram-Schmidt on a Gaussian matrix → orthonormal rows. + let mut m: Vec> = (0..dim).map(|_| (0..dim).map(|_| gaussian(rng)).collect()).collect(); + for i in 0..dim { + for j in 0..i { + let dot: f32 = (0..dim).map(|k| m[i][k] * m[j][k]).sum(); + for k in 0..dim { m[i][k] -= dot * m[j][k]; } + } + let norm: f32 = m[i].iter().map(|x| x * x).sum::().sqrt().max(1e-9); + for k in 0..dim { m[i][k] /= norm; } + } + m.into_iter().flatten().collect() +} + +fn identity(dim: usize) -> Vec { + let mut a = vec![0.0f32; dim * dim]; + for i in 0..dim { a[i * dim + i] = 1.0; } + a +} + +/// Diagonal drift target: A = diag(sqrt(scale)), scale in [0.2, 3.0]. +fn target_diag(dim: usize, rng: &mut Rng) -> Vec { + let mut a = vec![0.0f32; dim * dim]; + for i in 0..dim { a[i * dim + i] = (0.2 + 2.8 * rng.f32()).sqrt(); } + a +} + +/// Dense/rotational drift target: A = diag(sqrt(scale)) · R (anisotropic scaling +/// along rotated axes — a general Mahalanobis metric; the adversarial case). +fn target_rot(dim: usize, rng: &mut Rng) -> Vec { + let r = random_rotation(dim, rng); + let mut a = vec![0.0f32; dim * dim]; + for i in 0..dim { + let s = (0.2 + 2.8 * rng.f32()).sqrt(); + for j in 0..dim { a[i * dim + j] = s * r[i * dim + j]; } + } + a +} + +fn lerp_mat(a0: &[f32], a1: &[f32], t: f32) -> Vec { + a0.iter().zip(a1).map(|(x, y)| x * (1.0 - t) + y * t).collect() +} + +fn apply(a: &[f32], vecs: &[Vec32], dim: usize) -> Vec { + vecs.iter().map(|v| { + (0..dim).map(|i| { + let row = &a[i * dim..(i + 1) * dim]; + row.iter().zip(v).map(|(x, y)| x * y).sum() + }).collect() + }).collect() +} + +fn main() { + let args: Vec = std::env::args().collect(); + let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-2000.csv".into()); + let n: usize = args.get(2).and_then(|s| s.parse().ok()).unwrap_or(2000); + let vecs = read_vectors(&path, n); + let n = vecs.len(); + let dim = vecs[0].len(); + let p = Params { r: 24, l: 64, alpha: 1.2, k: 10 }; + + // Query set: 100 sampled nodes (their own vectors as queries; self excluded). + let mut qrng = Rng::new(999); + let queries: Vec = (0..100.min(n)).map(|_| qrng.below(n)).collect(); + let ones = vec![1.0f32; dim]; + + eprintln!("[bet1] {n} vectors x {dim} dims; Vamana R={} L={} alpha={} k={}", p.r, p.l, p.alpha, p.k); + // Base graph built once in the ORIGINAL space (drift t=0 == identity transform). + let t0 = Instant::now(); + let (g0, e0) = build(&vecs, &ones, &p, 7); + let med0 = medoid(&vecs, &ones); + eprintln!("[bet1] base graph built once in {:.2}s ({e0} dist evals)\n", t0.elapsed().as_secs_f64()); + + run_mode("DIAGONAL drift (per-axis rescale)", &vecs, &g0, med0, &queries, &p, dim, target_diag(dim, &mut Rng::new(12345))); + run_mode("ROTATIONAL drift (anisotropic scale on rotated axes — adversarial)", &vecs, &g0, med0, &queries, &p, dim, target_rot(dim, &mut Rng::new(54321))); + + println!("\nGate: WIN if A within 2% of B across the sweep; KILL if A drops >2% below B."); + println!("A's rebuild cost is 0 (topology reused); B pays a full rebuild per drift step."); +} + +#[allow(clippy::too_many_arguments)] +fn run_mode(label: &str, vecs: &[Vec32], g0: &[Vec], med0: u32, queries: &[usize], p: &Params, dim: usize, a_target: Vec) { + let ones = vec![1.0f32; dim]; + let id = identity(dim); + let truth0: Vec> = queries.iter().map(|&q| brute_topk(vecs, &ones, q, p.k)).collect(); + + println!("=== BET 1: {label} ==="); + println!("{:>5} {:>10} | {:>9} {:>9} {:>9} | {:>9} {:>8}", "t", "churn", "A reweit", "B rebuild", "C stale", "B build s", "A-B"); + println!("{}", "-".repeat(74)); + + for &t in &[0.0f32, 0.1, 0.25, 0.5, 0.75, 1.0] { + let at = lerp_mat(&id, &a_target, t); + let vt = apply(&at, vecs, dim); // vectors in the drifted metric space + let truth: Vec> = queries.iter().map(|&q| brute_topk(&vt, &ones, q, p.k)).collect(); + let churn: f64 = truth.iter().zip(&truth0).map(|(a, b)| 1.0 - recall(a, b)).sum::() / queries.len() as f64; + + // A: reuse the original-space graph, but compute distances in the drifted space. + let ra: f64 = queries.iter().zip(&truth).map(|(&q, tr)| { + let (got, _, _) = greedy(g0, &vt, &ones, med0, &vt[q], p.l, p.k); + recall(&got, tr) + }).sum::() / queries.len() as f64; + + // B: rebuild the graph in the drifted space. + let tb = Instant::now(); + let (gt, _) = build(&vt, &ones, p, 7); + let bt = tb.elapsed().as_secs_f64(); + let medt = medoid(&vt, &ones); + let rb: f64 = queries.iter().zip(&truth).map(|(&q, tr)| { + let (got, _, _) = greedy(>, &vt, &ones, medt, &vt[q], p.l, p.k); + recall(&got, tr) + }).sum::() / queries.len() as f64; + + // C: stale — search the original graph in the ORIGINAL space, score vs drifted truth. + let rc: f64 = queries.iter().zip(&truth).map(|(&q, tr)| { + let (got, _, _) = greedy(g0, vecs, &ones, med0, &vecs[q], p.l, p.k); + recall(&got, tr) + }).sum::() / queries.len() as f64; + + println!("{:>5.2} {:>9.0}% | {:>8.1}% {:>8.1}% {:>8.1}% | {:>9.2} {:>+7.1}%", t, churn * 100.0, ra * 100.0, rb * 100.0, rc * 100.0, bt, (ra - rb) * 100.0); + } + println!(); +} diff --git a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md new file mode 100644 index 0000000000..86561d3bb9 --- /dev/null +++ b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md @@ -0,0 +1,108 @@ +--- +adr: 200 +title: "Customizable Re-Weighting: Fixed-Topology ANN Under Metric Drift" +status: proposed +date: 2026-06-04 +authors: [ofershaal, claude-flow] +related: [ADR-196, ADR-198, ADR-199] +tags: [ruvector, retrieval, ann, vamana, hnsw, self-learning, metric-drift, customization] +--- + +# ADR-200 — Customizable Re-Weighting: Fixed-Topology ANN Under Metric Drift + +## Status + +**Proposed — experimentally validated for linear drift, bounded (2026-06-04).** This +salvages the one idea from the SepRAG exploration ([ADR-196]) that survived every test — +the *customizable metric* of [ADR-198] — and re-tests it **standalone, decoupled from +CCH**, since CCH full-contraction was found NO-GO on embedding graphs ([ADR-199]). + +## Context + +RuVector is a self-learning memory: a GNN continuously re-estimates relevance, so the +effective distance/relevance metric **drifts** over time. A flat ANN index +(HNSW / `ruvector-diskann` Vamana) is built *for* a metric; when the metric drifts, its +proximity graph becomes suboptimal and the textbook remedy is a costly **rebuild** +(superlinear; minutes-to-hours at corpus scale). + +ADR-198 proposed that topology and metric can be decoupled — re-weight cheaply, rebuild +rarely. CCH was one (failed) vehicle for that. The question this ADR answers: **does a +fixed ANN topology, with only distances recomputed under the new metric, retain recall +as well as a full rebuild — and for how much drift?** + +## Decision / Finding + +**Reuse the navigation topology under metric drift; recompute only distances. Rebuild is +deferred, not per-update.** Validated head-to-head (pre-registered gate) against a full +rebuild, on real ogbn-arxiv embeddings, with a stale-index negative control. + +Harness: `crates/ruvector-seprag/examples/reweight_vs_rebuild.rs` (self-contained +Vamana-lite: RobustPrune + greedy beam search). Drift modelled as a vector-space +transform `A`, metric `M = AᵀA`; sweep `A(t) = (1−t)I + t·A_target`. + +Strategies (recall@10 vs brute-force truth **under the drifted metric**): +- **A re-weight** — graph built once in the original space, searched under the drifted + metric. Rebuild cost: **0**. +- **B rebuild** — graph rebuilt under the drifted metric. Rebuild cost: full. +- **C stale** — original graph searched under the *original* metric (ignores drift). Floor. + +### Evidence (n=2000, dim=128, Vamana R=24 L=64 α=1.2, k=10) + +DIAGONAL drift (per-axis rescale): + +| t | set churn | A re-weight | B rebuild | C stale | A−B | +|---|---|---|---|---|---| +| 0.25 | 8% | 90.1% | 90.2% | 86.0% | −0.1% | +| 0.50 | 15% | 90.1% | 90.0% | 80.4% | +0.1% | +| 1.00 | 27% | 90.0% | 90.0% | 70.0% | +0.0% | + +ROTATIONAL drift (anisotropic scale on rotated axes — adversarial, general Mahalanobis): + +| t | set churn | A re-weight | B rebuild | C stale | A−B | +|---|---|---|---|---|---| +| 0.10 | 10% | 90.1% | 90.1% | 84.3% | +0.0% | +| 0.25 | 25% | 90.1% | 90.0% | 70.3% | +0.1% | +| 0.50 | 36% | 90.0% | 90.1% | 61.0% | −0.1% | +| 1.00 | 23% | 90.1% | 90.0% | 73.0% | +0.1% | + +**Gate (pre-registered): WIN** — A within 0.2% of B across *both* drift modes, up to 36% +relevant-set churn. The C control degrades up to 29 points, proving the graph matters +(the benchmark is not insensitive) — so A's parity is genuine adaptation. + +**Mechanism:** a RobustPrune graph is a *navigation scaffold* with diversified long-range +edges; greedy search uses the *new* distances to choose direction, while the *old* edges +remain sufficient to navigate. Navigability is robust to smooth (linear) remetrization. + +## Consequences + +**Positive.** +- A self-learning system can **defer/avoid index rebuilds under linear metric drift** at + no recall cost — the customizable-metric capability HNSW lacks. Cost asymmetry grows + with corpus size (rebuild is superlinear; re-weight is free), so the value increases at + scale. +- This is a *cost* win at equal *quality* (not higher recall) — stated precisely to avoid + overclaiming. + +**Boundaries / not yet proven (the honest caveats).** +- **Linear drift only.** Diagonal + dense Mahalanobis tested; a **non-linear** learned + metric is the next adversarial frontier and could break navigability. +- **Scale.** n=2000; recall-at-scale (n≥10⁵) and the rebuild-cost curve unconfirmed. +- **Global drift.** Same transform for all points; **region-local** metric change (different + relevance in different regions) is harder and untested. +- **Baseline.** Compared vs *full* rebuild; an *incremental*-update baseline is not yet in. + +## Next steps + +1. Non-linear drift (a small learned/MLP metric) — the decisive adversarial test. +2. Scale to n≥10⁵ on a real ANN index (`ruvector-diskann`) + measure rebuild-cost curve. +3. Region-local drift. +4. Incremental-rebuild baseline for a fair cost comparison. +5. If 1–2 hold: wire re-weight-on-drift into the `ruvector-diskann`/GNN loop behind a flag. + +## Alternatives considered + +- **Rebuild on every metric update** — the incumbent; the cost this ADR removes (kept as + the baseline B). +- **CCH customization** ([ADR-198] via [ADR-196]) — rejected: contraction blows up on + embedding graphs ([ADR-199]). The *idea* (cheap re-weight) is retained; the *vehicle* + (CCH) is dropped in favour of plain fixed-topology ANN. diff --git a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md new file mode 100644 index 0000000000..29b052fa8d --- /dev/null +++ b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md @@ -0,0 +1,61 @@ +# SepRAG — Future Directions & Research Backlog + +After the CCH-contraction NO-GO on embedding graphs ([ADR-199 Empirical +Outcome](../../adr/ADR-199-public-corpus-benchmark-harness.md#empirical-outcome-2026-06-04)), +two ideas survived every test and several promising directions remain. This file keeps +them alive so they're explored deliberately, each under the same discipline. + +## The "prove not hype" protocol (mandatory for every bet) + +A result only counts if it satisfies **all five**: + +1. **One claim, one number.** e.g. "N× cheaper at equal recall@10," not "faster." +2. **Beat the strongest in-repo incumbent, tuned** — HNSW / `ruvector-diskann` (Vamana) / + `ruvector-acorn` (filtered ANN) — never a strawman. +3. **Public data + ground truth** (ogbn-arxiv in hand; BEIR / filtered-ANN sets available). +4. **Pre-register the win AND kill condition** before running. A loss is an acceptable, + reportable outcome. +5. **Adversarial check.** Explicitly ask "would the baseline win if tuned harder?" and + include that variant. + +## Backlog (ranked by upside × provability) + +### BET 1 — Customizable re-weight vs rebuild ✅ WIN (linear drift), see [ADR-200] +Salvages ADR-198 (the customizable metric), decoupled from CCH. Result: a **fixed ANN +topology + recomputed distances** matches full Vamana **rebuild** recall within 0.2% up to +**36% relevant-set churn**, under *both* diagonal and adversarial dense-Mahalanobis +(rotational) drift — at **zero** rebuild cost. Stale-index control loses up to 29 points +(benchmark has teeth). Full evidence + boundaries in +[ADR-200](../../adr/ADR-200-customizable-reweighting-fixed-topology-ann.md). +Harness: `crates/ruvector-seprag/examples/reweight_vs_rebuild.rs`. +- **Open (the honest caveats, ranked):** (1) **non-linear** learned metric — the decisive + next adversarial test; (2) scale to n≥10⁵ on `ruvector-diskann` + rebuild-cost curve; + (3) region-local drift; (4) incremental-rebuild baseline. Do (1) next. + +### BET 2 — Filtered ANN vs `ruvector-acorn` +Region/predicate pruning for constrained ("nearest among items matching X") retrieval — a +real flat-ANN weakness. Higher effort; ACORN is a strong specialized incumbent in-repo, so +it's a harder, longer fight. Needs a filtered-ANN benchmark with selectivity sweeps. + +### BET 3 — Multi-hop Graph-RAG on a sparse curated KG +ADR-196's true scope: structural + semantic retrieval on a Wikidata-style KG (sparse, +bounded relation degree — possibly more road-like than the dense citation graph that +failed), with multi-hop QA ground truth (HotpotQA / MuSiQue / 2WikiMultiHop). Biggest +upside, most data engineering, least clean head-to-head. + +### BET 4 — Region pruning on an IVF/clustering hierarchy +Structural pivot: move the validated separator-tree **pruning query** off separators (which +need small treewidth) and onto a **clustering/IVF hierarchy** (`rairs-ivf`, ADR-193), which +is treewidth-immune. Most novel; define the baseline (plain IVF probe) before building. + +## Salvaged, validated assets (reusable regardless of bet) + +- `ruvector-seprag` — correct, tested CCH nested-dissection + separator-tree k-NN reference. +- The separator-tree **branch-and-bound pruning query** — exact recall, ~100% search-space + reduction, *treewidth-independent*. The reusable kernel. +- Road-control + manifold + citation harnesses — a treewidth probe for any new backbone. + +## Dead (do not revisit without new evidence) + +- CCH **full contraction on embedding / dense-similarity graphs** — high treewidth, + confirmed across citation + Euclidean feature backbones. HNSW already owns embedding kNN. From 8ceab24576487189ea259c87e65abaa3b7293253 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 09:25:39 -0400 Subject: [PATCH 08/15] =?UTF-8?q?feat(seprag):=20BET=201=20complete=20?= =?UTF-8?q?=E2=80=94=20non-linear=20drift=20+=20query-cost=20check=20(tota?= =?UTF-8?q?l=20WIN)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extend the re-weight-vs-rebuild harness with the decisive adversarial cases and close the proof: - non-linear drift mode (residual tanh warp v + s*tanh(Wv)) — removes the 'linear only' caveat; A still matches B within 0.2% up to 35% churn. - per-query distance-eval columns — A and B match within ~1%, disproving any hidden query-cost trade. Reuse is equal recall AND equal query cost. - fix display bug (C/stale was double-divided by query count; control now correctly reads 90% at t=0, validating the negative control). - drift modelled via transform closure (diag/rot/nonlin share one code path). - clippy: idiomatic char-class split in m1_arxiv reader. ADR-200 + FUTURE-DIRECTIONS updated: WIN across diagonal/rotational/non-linear drift; only open caveats are scale (n>=1e5, decisive next), region-local drift, incremental-rebuild baseline. --- crates/ruvector-seprag/examples/m1_arxiv.rs | 2 +- .../examples/reweight_vs_rebuild.rs | 66 ++++++++++++++----- ...omizable-reweighting-fixed-topology-ann.md | 60 ++++++++++++----- .../seprag-cch-retrieval/FUTURE-DIRECTIONS.md | 18 ++--- 4 files changed, 100 insertions(+), 46 deletions(-) diff --git a/crates/ruvector-seprag/examples/m1_arxiv.rs b/crates/ruvector-seprag/examples/m1_arxiv.rs index 94e9aa73c1..2459aa2fc9 100644 --- a/crates/ruvector-seprag/examples/m1_arxiv.rs +++ b/crates/ruvector-seprag/examples/m1_arxiv.rs @@ -107,7 +107,7 @@ fn read_edges(path: &str) -> (Vec>, usize) { if line.starts_with('#') || line.is_empty() { continue; } - let mut it = line.split(|c| c == ',' || c == '\t' || c == ' ').filter(|s| !s.is_empty()); + let mut it = line.split(|c: char| matches!(c, ',' | '\t' | ' ')).filter(|s| !s.is_empty()); if let (Some(a), Some(b)) = (it.next(), it.next()) { if let (Ok(u), Ok(v)) = (a.trim().parse::(), b.trim().parse::()) { max_id = max_id.max(u).max(v); diff --git a/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs b/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs index 9489df2358..92a22b4bc4 100644 --- a/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs +++ b/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs @@ -227,6 +227,19 @@ fn apply(a: &[f32], vecs: &[Vec32], dim: usize) -> Vec { }).collect() } +/// Non-linear residual warp: f_s(v) = v + s · tanh(W v). At s=0 it is the +/// identity; growing s bends the space non-linearly (the adversarial case the +/// "navigability survives *linear* remetrization" argument does NOT cover). +fn apply_nonlin(w: &[f32], vecs: &[Vec32], s: f32, dim: usize) -> Vec { + vecs.iter().map(|v| { + (0..dim).map(|i| { + let row = &w[i * dim..(i + 1) * dim]; + let u: f32 = row.iter().zip(v).map(|(x, y)| x * y).sum(); + v[i] + s * u.tanh() + }).collect() + }).collect() +} + fn main() { let args: Vec = std::env::args().collect(); let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-2000.csv".into()); @@ -248,52 +261,69 @@ fn main() { let med0 = medoid(&vecs, &ones); eprintln!("[bet1] base graph built once in {:.2}s ({e0} dist evals)\n", t0.elapsed().as_secs_f64()); - run_mode("DIAGONAL drift (per-axis rescale)", &vecs, &g0, med0, &queries, &p, dim, target_diag(dim, &mut Rng::new(12345))); - run_mode("ROTATIONAL drift (anisotropic scale on rotated axes — adversarial)", &vecs, &g0, med0, &queries, &p, dim, target_rot(dim, &mut Rng::new(54321))); + let id = identity(dim); + let diag = target_diag(dim, &mut Rng::new(12345)); + let rot = target_rot(dim, &mut Rng::new(54321)); + // Non-linear warp matrix (scaled so tanh operates in its non-linear regime). + let warp = random_rotation(dim, &mut Rng::new(7)); + let beta = 4.0f32; + + run_mode("DIAGONAL drift (per-axis rescale)", &g0, med0, &queries, &p, dim, + |t| apply(&lerp_mat(&id, &diag, t), &vecs, dim)); + run_mode("ROTATIONAL drift (anisotropic scale on rotated axes — adversarial linear)", &g0, med0, &queries, &p, dim, + |t| apply(&lerp_mat(&id, &rot, t), &vecs, dim)); + run_mode("NON-LINEAR drift (residual tanh warp — adversarial non-linear)", &g0, med0, &queries, &p, dim, + |t| apply_nonlin(&warp, &vecs, t * beta, dim)); println!("\nGate: WIN if A within 2% of B across the sweep; KILL if A drops >2% below B."); println!("A's rebuild cost is 0 (topology reused); B pays a full rebuild per drift step."); } #[allow(clippy::too_many_arguments)] -fn run_mode(label: &str, vecs: &[Vec32], g0: &[Vec], med0: u32, queries: &[usize], p: &Params, dim: usize, a_target: Vec) { +fn run_mode Vec>(label: &str, g0: &[Vec], med0: u32, queries: &[usize], p: &Params, dim: usize, vt_of: F) { let ones = vec![1.0f32; dim]; - let id = identity(dim); - let truth0: Vec> = queries.iter().map(|&q| brute_topk(vecs, &ones, q, p.k)).collect(); + let v0 = vt_of(0.0); // original space (drift t=0) + let truth0: Vec> = queries.iter().map(|&q| brute_topk(&v0, &ones, q, p.k)).collect(); println!("=== BET 1: {label} ==="); - println!("{:>5} {:>10} | {:>9} {:>9} {:>9} | {:>9} {:>8}", "t", "churn", "A reweit", "B rebuild", "C stale", "B build s", "A-B"); + println!("{:>5} {:>7} | {:>8} {:>8} {:>8} | {:>8} | {:>7} {:>7}", "t", "churn", "A rewt", "B rebld", "C stale", "B bld s", "A ev/q", "B ev/q"); println!("{}", "-".repeat(74)); for &t in &[0.0f32, 0.1, 0.25, 0.5, 0.75, 1.0] { - let at = lerp_mat(&id, &a_target, t); - let vt = apply(&at, vecs, dim); // vectors in the drifted metric space + let vt = vt_of(t); // vectors in the drifted metric space let truth: Vec> = queries.iter().map(|&q| brute_topk(&vt, &ones, q, p.k)).collect(); let churn: f64 = truth.iter().zip(&truth0).map(|(a, b)| 1.0 - recall(a, b)).sum::() / queries.len() as f64; // A: reuse the original-space graph, but compute distances in the drifted space. - let ra: f64 = queries.iter().zip(&truth).map(|(&q, tr)| { - let (got, _, _) = greedy(g0, &vt, &ones, med0, &vt[q], p.l, p.k); - recall(&got, tr) - }).sum::() / queries.len() as f64; + let (mut ra, mut a_ev) = (0.0f64, 0usize); + for (&q, tr) in queries.iter().zip(&truth) { + let (got, _, ev) = greedy(g0, &vt, &ones, med0, &vt[q], p.l, p.k); + ra += recall(&got, tr); + a_ev += ev; + } // B: rebuild the graph in the drifted space. let tb = Instant::now(); let (gt, _) = build(&vt, &ones, p, 7); let bt = tb.elapsed().as_secs_f64(); let medt = medoid(&vt, &ones); - let rb: f64 = queries.iter().zip(&truth).map(|(&q, tr)| { - let (got, _, _) = greedy(>, &vt, &ones, medt, &vt[q], p.l, p.k); - recall(&got, tr) - }).sum::() / queries.len() as f64; + let (mut rb, mut b_ev) = (0.0f64, 0usize); + for (&q, tr) in queries.iter().zip(&truth) { + let (got, _, ev) = greedy(>, &vt, &ones, medt, &vt[q], p.l, p.k); + rb += recall(&got, tr); + b_ev += ev; + } // C: stale — search the original graph in the ORIGINAL space, score vs drifted truth. let rc: f64 = queries.iter().zip(&truth).map(|(&q, tr)| { - let (got, _, _) = greedy(g0, vecs, &ones, med0, &vecs[q], p.l, p.k); + let (got, _, _) = greedy(g0, &v0, &ones, med0, &v0[q], p.l, p.k); recall(&got, tr) }).sum::() / queries.len() as f64; - println!("{:>5.2} {:>9.0}% | {:>8.1}% {:>8.1}% {:>8.1}% | {:>9.2} {:>+7.1}%", t, churn * 100.0, ra * 100.0, rb * 100.0, rc * 100.0, bt, (ra - rb) * 100.0); + let nq = queries.len() as f64; + // ra, rb are sums (divide here); rc is already a mean. + println!("{:>5.2} {:>6.0}% | {:>7.1}% {:>7.1}% {:>7.1}% | {:>8.2} | {:>7.0} {:>7.0}", + t, churn * 100.0, ra / nq * 100.0, rb / nq * 100.0, rc * 100.0, bt, a_ev as f64 / nq, b_ev as f64 / nq); } println!(); } diff --git a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md index 86561d3bb9..f321807b9c 100644 --- a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md +++ b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md @@ -12,10 +12,13 @@ tags: [ruvector, retrieval, ann, vamana, hnsw, self-learning, metric-drift, cust ## Status -**Proposed — experimentally validated for linear drift, bounded (2026-06-04).** This -salvages the one idea from the SepRAG exploration ([ADR-196]) that survived every test — -the *customizable metric* of [ADR-198] — and re-tests it **standalone, decoupled from -CCH**, since CCH full-contraction was found NO-GO on embedding graphs ([ADR-199]). +**Proposed — experimentally validated across diagonal, rotational, and non-linear drift; +bounded by scale (2026-06-04).** This salvages the one idea from the SepRAG exploration +([ADR-196]) that survived every test — the *customizable metric* of [ADR-198] — and +re-tests it **standalone, decoupled from CCH**, since CCH full-contraction was found NO-GO +on embedding graphs ([ADR-199]). The fixed topology matches full rebuild on **both recall +and per-query cost** at **zero** rebuild cost; the only open caveats are scale, region-local +drift, and an incremental-rebuild baseline. ## Context @@ -65,13 +68,29 @@ ROTATIONAL drift (anisotropic scale on rotated axes — adversarial, general Mah | 0.50 | 36% | 90.0% | 90.1% | 61.0% | −0.1% | | 1.00 | 23% | 90.1% | 90.0% | 73.0% | +0.1% | -**Gate (pre-registered): WIN** — A within 0.2% of B across *both* drift modes, up to 36% -relevant-set churn. The C control degrades up to 29 points, proving the graph matters -(the benchmark is not insensitive) — so A's parity is genuine adaptation. +NON-LINEAR drift (residual tanh warp `v + s·tanh(Wv)` — adversarial non-linear): -**Mechanism:** a RobustPrune graph is a *navigation scaffold* with diversified long-range -edges; greedy search uses the *new* distances to choose direction, while the *old* edges -remain sufficient to navigate. Navigability is robust to smooth (linear) remetrization. +| t | set churn | A re-weight | B rebuild | C stale | A−B | +|---|---|---|---|---|---| +| 0.10 | 24% | 90.1% | 90.1% | 72.1% | +0.0% | +| 0.25 | 35% | 90.0% | 90.1% | 61.6% | −0.1% | +| 0.50 | 29% | 90.0% | 90.0% | 67.2% | +0.0% | +| 1.00 | 18% | 90.1% | 89.9% | 77.7% | +0.2% | + +**Gate (pre-registered): WIN** — A within 0.2% of B across *all three* drift modes, up to +36% relevant-set churn. The C control degrades up to 29 points, proving the graph matters +(the benchmark is not insensitive) — so A's parity is genuine adaptation, not insensitivity. + +**Query cost is also equal.** Mean distance-evals/query: A ≈ B within ~1% in every row +(e.g. 590 vs 583 at peak churn). So reuse does **not** trade build savings for slower +queries — it matches B on recall *and* per-query work. + +**Mechanism:** a RobustPrune graph is a *navigation scaffold* of diversified directions; +greedy search uses the *new* distances to choose direction, while the *old* edges remain +sufficient to navigate. For navigable graphs, top-k recall is governed by navigability + +beam width, not edge metric-optimality — and navigability survives smooth remetrization, +linear *or* non-linear. (Edge optimality would matter more for path length / efficiency, +which is why we also checked per-query evals and found them equal.) ## Consequences @@ -84,20 +103,25 @@ remain sufficient to navigate. Navigability is robust to smooth (linear) remetri overclaiming. **Boundaries / not yet proven (the honest caveats).** -- **Linear drift only.** Diagonal + dense Mahalanobis tested; a **non-linear** learned - metric is the next adversarial frontier and could break navigability. -- **Scale.** n=2000; recall-at-scale (n≥10⁵) and the rebuild-cost curve unconfirmed. +- **Scale.** n=2000; recall-at-scale (n≥10⁵) and the rebuild-cost curve unconfirmed. This + is now the *primary* open question — and the cost asymmetry only grows with n. - **Global drift.** Same transform for all points; **region-local** metric change (different relevance in different regions) is harder and untested. - **Baseline.** Compared vs *full* rebuild; an *incremental*-update baseline is not yet in. +- **Synthetic drift.** Drift is parametric (diag/rot/tanh), not a real learned-GNN metric + trajectory — realistic, but the live GNN loop is the eventual proof. + +*(Resolved: the "linear drift only" caveat — non-linear tanh-warp drift now tested and +passes, so navigability robustness is not limited to linear remetrization.)* ## Next steps -1. Non-linear drift (a small learned/MLP metric) — the decisive adversarial test. -2. Scale to n≥10⁵ on a real ANN index (`ruvector-diskann`) + measure rebuild-cost curve. -3. Region-local drift. -4. Incremental-rebuild baseline for a fair cost comparison. -5. If 1–2 hold: wire re-weight-on-drift into the `ruvector-diskann`/GNN loop behind a flag. +1. **Scale to n≥10⁵** on a real ANN index (`ruvector-diskann`) + measure the rebuild-cost + curve — the decisive remaining test (cost asymmetry grows with n). +2. Region-local drift. +3. Incremental-rebuild baseline for a fair cost comparison. +4. Wire re-weight-on-drift into the `ruvector-diskann`/GNN loop behind a flag and validate + on a real learned-metric trajectory. ## Alternatives considered diff --git a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md index 29b052fa8d..5651e63655 100644 --- a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md +++ b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md @@ -20,17 +20,17 @@ A result only counts if it satisfies **all five**: ## Backlog (ranked by upside × provability) -### BET 1 — Customizable re-weight vs rebuild ✅ WIN (linear drift), see [ADR-200] +### BET 1 — Customizable re-weight vs rebuild ✅ WIN (diag + rot + non-linear), see [ADR-200] Salvages ADR-198 (the customizable metric), decoupled from CCH. Result: a **fixed ANN -topology + recomputed distances** matches full Vamana **rebuild** recall within 0.2% up to -**36% relevant-set churn**, under *both* diagonal and adversarial dense-Mahalanobis -(rotational) drift — at **zero** rebuild cost. Stale-index control loses up to 29 points -(benchmark has teeth). Full evidence + boundaries in -[ADR-200](../../adr/ADR-200-customizable-reweighting-fixed-topology-ann.md). +topology + recomputed distances** matches full Vamana **rebuild** on **both recall (±0.2%) +and per-query cost (±1%)** up to **36% relevant-set churn**, across diagonal, dense- +Mahalanobis (rotational), AND non-linear (tanh-warp) drift — at **zero** rebuild cost. +Stale-index control loses up to 29 points (benchmark has teeth). Full evidence + boundaries +in [ADR-200](../../adr/ADR-200-customizable-reweighting-fixed-topology-ann.md). Harness: `crates/ruvector-seprag/examples/reweight_vs_rebuild.rs`. -- **Open (the honest caveats, ranked):** (1) **non-linear** learned metric — the decisive - next adversarial test; (2) scale to n≥10⁵ on `ruvector-diskann` + rebuild-cost curve; - (3) region-local drift; (4) incremental-rebuild baseline. Do (1) next. +- **Open (ranked):** (1) **scale to n≥10⁵** on `ruvector-diskann` + rebuild-cost curve — + the decisive remaining test; (2) region-local drift; (3) incremental-rebuild baseline; + (4) wire into the real GNN loop behind a flag. Do (1) next. ### BET 2 — Filtered ANN vs `ruvector-acorn` Region/predicate pruning for constrained ("nearest among items matching X") retrieval — a From 11bef01e31d97380e723af0581b9d55463eaa1c7 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 09:25:59 -0400 Subject: [PATCH 09/15] chore(seprag): idiomatic char-array split (clippy clean) --- crates/ruvector-seprag/examples/m1_arxiv.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crates/ruvector-seprag/examples/m1_arxiv.rs b/crates/ruvector-seprag/examples/m1_arxiv.rs index 2459aa2fc9..f6c6fda7cb 100644 --- a/crates/ruvector-seprag/examples/m1_arxiv.rs +++ b/crates/ruvector-seprag/examples/m1_arxiv.rs @@ -107,7 +107,7 @@ fn read_edges(path: &str) -> (Vec>, usize) { if line.starts_with('#') || line.is_empty() { continue; } - let mut it = line.split(|c: char| matches!(c, ',' | '\t' | ' ')).filter(|s| !s.is_empty()); + let mut it = line.split([',', '\t', ' ']).filter(|s| !s.is_empty()); if let (Some(a), Some(b)) = (it.next(), it.next()) { if let (Ok(u), Ok(v)) = (a.trim().parse::(), b.trim().parse::()) { max_id = max_id.max(u).max(v); From 413db9fa4fcf4fe550b47ae20631f5b3f108eb40 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 09:45:42 -0400 Subject: [PATCH 10/15] refactor(seprag): shared ann engine + scale harness for BET 1 Extract the Vamana-lite ANN + metric-drift helpers into a reusable lib module (src/ann.rs) with an efficient two-heap greedy beam search (replaces the O(L) linear-scan beam, ~2x faster, needed for n>=1e5). Thin reweight_vs_rebuild to use it (regression-checked: identical recall/evals at n=2000). Add scale_drift example: sweeps N (5k..100k), measures recall(reuse) vs recall(rebuild) at the adversarial rotational drift point plus the rebuild-cost curve. --- .../examples/reweight_vs_rebuild.rs | 283 ++---------------- .../ruvector-seprag/examples/scale_drift.rs | 89 ++++++ crates/ruvector-seprag/src/ann.rs | 249 +++++++++++++++ crates/ruvector-seprag/src/lib.rs | 1 + 4 files changed, 360 insertions(+), 262 deletions(-) create mode 100644 crates/ruvector-seprag/examples/scale_drift.rs create mode 100644 crates/ruvector-seprag/src/ann.rs diff --git a/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs b/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs index 92a22b4bc4..cab112d602 100644 --- a/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs +++ b/crates/ruvector-seprag/examples/reweight_vs_rebuild.rs @@ -1,245 +1,13 @@ -//! BET 1 (ADR-198, decoupled from CCH): does a FIXED proximity-graph topology + -//! cheap re-weighting absorb metric drift as well as a full rebuild? -//! -//! Self-learning systems change their relevance metric over time. A flat ANN -//! index (HNSW/Vamana) is built *for* a metric; when the metric drifts its graph -//! becomes suboptimal and the textbook fix is a costly rebuild. This harness -//! tests whether reusing the old topology under the new metric ("re-weight", -//! zero build cost) keeps recall close to a rebuild — and quantifies how much -//! drift fixed topology tolerates before a rebuild is actually required. -//! -//! Three strategies, recall@10 measured vs brute-force truth under the CURRENT -//! (drifted) metric: -//! A re-weight : graph built under w0, searched under w_t (build cost: 0) -//! B rebuild : graph rebuilt under w_t, searched under w_t (build cost: full) -//! C stale : graph built under w0, searched under w0 (ignores drift; floor) -//! -//! Pre-registered gate — WIN: recall(A) within 2% of recall(B) across the drift -//! sweep. KILL: recall(A) drops >2% below B even at small drift. +//! BET 1 (ADR-200): does a FIXED ANN topology + recomputed distances absorb +//! metric drift as well as a full rebuild? Three drift modes (diagonal, +//! rotational, non-linear), recall@10 + per-query cost vs full rebuild, with a +//! stale-index negative control. Shared engine in `ruvector_seprag::ann`. //! //! Run: cargo run --release -p ruvector-seprag --example reweight_vs_rebuild -- -use std::collections::HashSet; +use ruvector_seprag::ann::*; use std::time::Instant; -type Vec32 = Vec; - -// ----- deterministic RNG (SplitMix64) ----- -struct Rng(u64); -impl Rng { - fn new(s: u64) -> Self { Rng(s) } - fn next(&mut self) -> u64 { - self.0 = self.0.wrapping_add(0x9E37_79B9_7F4A_7C15); - let mut z = self.0; - z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); - z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); - z ^ (z >> 31) - } - fn f32(&mut self) -> f32 { (self.next() >> 40) as f32 / (1u64 << 24) as f32 } - fn below(&mut self, n: usize) -> usize { (self.next() % n as u64) as usize } -} - -// ----- weighted squared-L2 metric ----- -#[inline] -fn dist(a: &[f32], b: &[f32], w: &[f32]) -> f32 { - let mut s = 0.0f32; - for i in 0..a.len() { - let d = a[i] - b[i]; - s += w[i] * d * d; - } - s -} - -fn brute_topk(vecs: &[Vec32], w: &[f32], q: usize, k: usize) -> Vec { - let mut d: Vec<(f32, u32)> = (0..vecs.len()) - .filter(|&j| j != q) - .map(|j| (dist(&vecs[q], &vecs[j], w), j as u32)) - .collect(); - d.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); - d.truncate(k); - d.into_iter().map(|(_, n)| n).collect() -} - -// ----- Vamana-lite proximity graph ----- -struct Params { r: usize, l: usize, alpha: f32, k: usize } - -fn medoid(vecs: &[Vec32], w: &[f32]) -> u32 { - let dim = vecs[0].len(); - let mut c = vec![0.0f32; dim]; - for v in vecs { - for i in 0..dim { c[i] += v[i]; } - } - for x in &mut c { *x /= vecs.len() as f32; } - (0..vecs.len()).min_by(|&a, &b| dist(&vecs[a], &c, w).partial_cmp(&dist(&vecs[b], &c, w)).unwrap()).unwrap() as u32 -} - -/// Greedy beam search. Returns (top-k, set of all visited nodes, #distance evals). -fn greedy(graph: &[Vec], vecs: &[Vec32], w: &[f32], entry: u32, q: &[f32], beam: usize, k: usize) -> (Vec, Vec, usize) { - let mut seen: HashSet = HashSet::new(); - let mut expanded: HashSet = HashSet::new(); - let mut pool: Vec<(f32, u32)> = vec![(dist(&vecs[entry as usize], q, w), entry)]; - seen.insert(entry); - let mut evals = 1usize; - loop { - let next = pool.iter().filter(|(_, n)| !expanded.contains(n)).min_by(|a, b| a.0.partial_cmp(&b.0).unwrap()).copied(); - let (_, u) = match next { Some(x) => x, None => break }; - expanded.insert(u); - for &v in &graph[u as usize] { - if seen.insert(v) { - pool.push((dist(&vecs[v as usize], q, w), v)); - evals += 1; - } - } - pool.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); - pool.truncate(beam); - } - pool.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); - let topk = pool.iter().take(k).map(|&(_, n)| n).collect(); - let visited = seen.into_iter().collect(); - (topk, visited, evals) -} - -/// RobustPrune: keep up to R diverse neighbours (Vamana alpha-pruning). -fn robust_prune(p: u32, cands: &[u32], vecs: &[Vec32], w: &[f32], alpha: f32, r: usize) -> Vec { - let mut pool: Vec<(f32, u32)> = cands.iter().filter(|&&c| c != p).map(|&c| (dist(&vecs[p as usize], &vecs[c as usize], w), c)).collect(); - pool.sort_by(|a, b| a.0.partial_cmp(&b.0).unwrap()); - let mut out: Vec = Vec::new(); - let mut i = 0; - while i < pool.len() && out.len() < r { - let (_, pstar) = pool[i]; - out.push(pstar); - pool.retain(|&(dq, q)| alpha * dist(&vecs[pstar as usize], &vecs[q as usize], w) > dq); - i = 0; // pool shrank; restart scan from front of remaining - // skip already-chosen - pool.retain(|&(_, q)| !out.contains(&q)); - } - out -} - -/// Build a Vamana-lite graph under metric `w`. Returns (graph, #distance evals). -fn build(vecs: &[Vec32], w: &[f32], p: &Params, seed: u64) -> (Vec>, usize) { - let n = vecs.len(); - let mut rng = Rng::new(seed); - // init: random R-regular - let mut graph: Vec> = (0..n) - .map(|i| { - let mut s = HashSet::new(); - while s.len() < p.r.min(n - 1) { - let j = rng.below(n); - if j != i { s.insert(j as u32); } - } - s.into_iter().collect() - }) - .collect(); - let med = medoid(vecs, w); - let mut order: Vec = (0..n).collect(); - for i in (1..n).rev() { order.swap(i, rng.below(i + 1)); } // shuffle - let mut evals = 0usize; - for &node in &order { - let (_, visited, e) = greedy(&graph, vecs, w, med, &vecs[node], p.l, p.k); - evals += e; - let nbrs = robust_prune(node as u32, &visited, vecs, w, p.alpha, p.r); - graph[node] = nbrs.clone(); - for q in nbrs { - let qi = q as usize; - if !graph[qi].contains(&(node as u32)) { - graph[qi].push(node as u32); - if graph[qi].len() > p.r { - let cand = graph[qi].clone(); - graph[qi] = robust_prune(q, &cand, vecs, w, p.alpha, p.r); - } - } - } - } - (graph, evals) -} - -fn recall(got: &[u32], truth: &[u32]) -> f64 { - let t: HashSet = truth.iter().copied().collect(); - got.iter().filter(|g| t.contains(g)).count() as f64 / truth.len() as f64 -} - -fn read_vectors(path: &str, n: usize) -> Vec { - let data = std::fs::read_to_string(path).expect("read features"); - data.lines().take(n) - .map(|l| l.split(',').filter_map(|s| s.trim().parse::().ok()).collect()) - .filter(|v: &Vec32| !v.is_empty()) - .collect() -} - -// ---- metric drift modelled as a vector-space transform A (row-major dim*dim) ---- -// metric M = A^T A; equivalently transform vectors by A and use plain L2. -fn gaussian(rng: &mut Rng) -> f32 { - let u1 = (rng.f32() as f64).max(1e-9); - let u2 = rng.f32() as f64; - ((-2.0 * u1.ln()).sqrt() * (std::f64::consts::TAU * u2).cos()) as f32 -} - -fn random_rotation(dim: usize, rng: &mut Rng) -> Vec { - // Gram-Schmidt on a Gaussian matrix → orthonormal rows. - let mut m: Vec> = (0..dim).map(|_| (0..dim).map(|_| gaussian(rng)).collect()).collect(); - for i in 0..dim { - for j in 0..i { - let dot: f32 = (0..dim).map(|k| m[i][k] * m[j][k]).sum(); - for k in 0..dim { m[i][k] -= dot * m[j][k]; } - } - let norm: f32 = m[i].iter().map(|x| x * x).sum::().sqrt().max(1e-9); - for k in 0..dim { m[i][k] /= norm; } - } - m.into_iter().flatten().collect() -} - -fn identity(dim: usize) -> Vec { - let mut a = vec![0.0f32; dim * dim]; - for i in 0..dim { a[i * dim + i] = 1.0; } - a -} - -/// Diagonal drift target: A = diag(sqrt(scale)), scale in [0.2, 3.0]. -fn target_diag(dim: usize, rng: &mut Rng) -> Vec { - let mut a = vec![0.0f32; dim * dim]; - for i in 0..dim { a[i * dim + i] = (0.2 + 2.8 * rng.f32()).sqrt(); } - a -} - -/// Dense/rotational drift target: A = diag(sqrt(scale)) · R (anisotropic scaling -/// along rotated axes — a general Mahalanobis metric; the adversarial case). -fn target_rot(dim: usize, rng: &mut Rng) -> Vec { - let r = random_rotation(dim, rng); - let mut a = vec![0.0f32; dim * dim]; - for i in 0..dim { - let s = (0.2 + 2.8 * rng.f32()).sqrt(); - for j in 0..dim { a[i * dim + j] = s * r[i * dim + j]; } - } - a -} - -fn lerp_mat(a0: &[f32], a1: &[f32], t: f32) -> Vec { - a0.iter().zip(a1).map(|(x, y)| x * (1.0 - t) + y * t).collect() -} - -fn apply(a: &[f32], vecs: &[Vec32], dim: usize) -> Vec { - vecs.iter().map(|v| { - (0..dim).map(|i| { - let row = &a[i * dim..(i + 1) * dim]; - row.iter().zip(v).map(|(x, y)| x * y).sum() - }).collect() - }).collect() -} - -/// Non-linear residual warp: f_s(v) = v + s · tanh(W v). At s=0 it is the -/// identity; growing s bends the space non-linearly (the adversarial case the -/// "navigability survives *linear* remetrization" argument does NOT cover). -fn apply_nonlin(w: &[f32], vecs: &[Vec32], s: f32, dim: usize) -> Vec { - vecs.iter().map(|v| { - (0..dim).map(|i| { - let row = &w[i * dim..(i + 1) * dim]; - let u: f32 = row.iter().zip(v).map(|(x, y)| x * y).sum(); - v[i] + s * u.tanh() - }).collect() - }).collect() -} - fn main() { let args: Vec = std::env::args().collect(); let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-2000.csv".into()); @@ -247,31 +15,27 @@ fn main() { let vecs = read_vectors(&path, n); let n = vecs.len(); let dim = vecs[0].len(); - let p = Params { r: 24, l: 64, alpha: 1.2, k: 10 }; + let p = AnnParams { r: 24, l: 64, alpha: 1.2, k: 10 }; - // Query set: 100 sampled nodes (their own vectors as queries; self excluded). let mut qrng = Rng::new(999); let queries: Vec = (0..100.min(n)).map(|_| qrng.below(n)).collect(); - let ones = vec![1.0f32; dim]; eprintln!("[bet1] {n} vectors x {dim} dims; Vamana R={} L={} alpha={} k={}", p.r, p.l, p.alpha, p.k); - // Base graph built once in the ORIGINAL space (drift t=0 == identity transform). let t0 = Instant::now(); - let (g0, e0) = build(&vecs, &ones, &p, 7); - let med0 = medoid(&vecs, &ones); - eprintln!("[bet1] base graph built once in {:.2}s ({e0} dist evals)\n", t0.elapsed().as_secs_f64()); + let g0 = build(&vecs, &p, 7); + let med0 = medoid(&vecs); + eprintln!("[bet1] base graph built once in {:.2}s\n", t0.elapsed().as_secs_f64()); let id = identity(dim); let diag = target_diag(dim, &mut Rng::new(12345)); let rot = target_rot(dim, &mut Rng::new(54321)); - // Non-linear warp matrix (scaled so tanh operates in its non-linear regime). let warp = random_rotation(dim, &mut Rng::new(7)); let beta = 4.0f32; run_mode("DIAGONAL drift (per-axis rescale)", &g0, med0, &queries, &p, dim, - |t| apply(&lerp_mat(&id, &diag, t), &vecs, dim)); + |t| apply_linear(&lerp_mat(&id, &diag, t), &vecs, dim)); run_mode("ROTATIONAL drift (anisotropic scale on rotated axes — adversarial linear)", &g0, med0, &queries, &p, dim, - |t| apply(&lerp_mat(&id, &rot, t), &vecs, dim)); + |t| apply_linear(&lerp_mat(&id, &rot, t), &vecs, dim)); run_mode("NON-LINEAR drift (residual tanh warp — adversarial non-linear)", &g0, med0, &queries, &p, dim, |t| apply_nonlin(&warp, &vecs, t * beta, dim)); @@ -280,48 +44,43 @@ fn main() { } #[allow(clippy::too_many_arguments)] -fn run_mode Vec>(label: &str, g0: &[Vec], med0: u32, queries: &[usize], p: &Params, dim: usize, vt_of: F) { - let ones = vec![1.0f32; dim]; - let v0 = vt_of(0.0); // original space (drift t=0) - let truth0: Vec> = queries.iter().map(|&q| brute_topk(&v0, &ones, q, p.k)).collect(); +fn run_mode Vec>(label: &str, g0: &[Vec], med0: u32, queries: &[usize], p: &AnnParams, _dim: usize, vt_of: F) { + let v0 = vt_of(0.0); + let truth0: Vec> = queries.iter().map(|&q| brute_topk(&v0, q, p.k)).collect(); println!("=== BET 1: {label} ==="); println!("{:>5} {:>7} | {:>8} {:>8} {:>8} | {:>8} | {:>7} {:>7}", "t", "churn", "A rewt", "B rebld", "C stale", "B bld s", "A ev/q", "B ev/q"); println!("{}", "-".repeat(74)); for &t in &[0.0f32, 0.1, 0.25, 0.5, 0.75, 1.0] { - let vt = vt_of(t); // vectors in the drifted metric space - let truth: Vec> = queries.iter().map(|&q| brute_topk(&vt, &ones, q, p.k)).collect(); + let vt = vt_of(t); + let truth: Vec> = queries.iter().map(|&q| brute_topk(&vt, q, p.k)).collect(); let churn: f64 = truth.iter().zip(&truth0).map(|(a, b)| 1.0 - recall(a, b)).sum::() / queries.len() as f64; - // A: reuse the original-space graph, but compute distances in the drifted space. let (mut ra, mut a_ev) = (0.0f64, 0usize); for (&q, tr) in queries.iter().zip(&truth) { - let (got, _, ev) = greedy(g0, &vt, &ones, med0, &vt[q], p.l, p.k); + let (got, _, ev) = search(g0, &vt, med0, &vt[q], p.l, p.k); ra += recall(&got, tr); a_ev += ev; } - // B: rebuild the graph in the drifted space. let tb = Instant::now(); - let (gt, _) = build(&vt, &ones, p, 7); + let gt = build(&vt, p, 7); let bt = tb.elapsed().as_secs_f64(); - let medt = medoid(&vt, &ones); + let medt = medoid(&vt); let (mut rb, mut b_ev) = (0.0f64, 0usize); for (&q, tr) in queries.iter().zip(&truth) { - let (got, _, ev) = greedy(>, &vt, &ones, medt, &vt[q], p.l, p.k); + let (got, _, ev) = search(>, &vt, medt, &vt[q], p.l, p.k); rb += recall(&got, tr); b_ev += ev; } - // C: stale — search the original graph in the ORIGINAL space, score vs drifted truth. let rc: f64 = queries.iter().zip(&truth).map(|(&q, tr)| { - let (got, _, _) = greedy(g0, &v0, &ones, med0, &v0[q], p.l, p.k); + let (got, _, _) = search(g0, &v0, med0, &v0[q], p.l, p.k); recall(&got, tr) }).sum::() / queries.len() as f64; let nq = queries.len() as f64; - // ra, rb are sums (divide here); rc is already a mean. println!("{:>5.2} {:>6.0}% | {:>7.1}% {:>7.1}% {:>7.1}% | {:>8.2} | {:>7.0} {:>7.0}", t, churn * 100.0, ra / nq * 100.0, rb / nq * 100.0, rc * 100.0, bt, a_ev as f64 / nq, b_ev as f64 / nq); } diff --git a/crates/ruvector-seprag/examples/scale_drift.rs b/crates/ruvector-seprag/examples/scale_drift.rs new file mode 100644 index 0000000000..f497808752 --- /dev/null +++ b/crates/ruvector-seprag/examples/scale_drift.rs @@ -0,0 +1,89 @@ +//! BET 1 scale test (ADR-200 next step): does the re-weight-vs-rebuild win hold +//! at n≥10⁵, and how big is the rebuild-cost gap? +//! +//! For each N, build a base graph, apply the adversarial ROTATIONAL drift +//! (t=0.5, the ~36%-churn point), then compare: +//! A re-weight : reuse base graph under the drifted metric (rebuild cost: 0) +//! B rebuild : rebuild under the drifted metric (rebuild cost: measured) +//! Recall@10 vs brute-force truth under the drifted metric. The B-build-seconds +//! column is the rebuild-cost curve; A's update cost is ~0 (topology reused). +//! +//! Pre-registered gate: recall(A) within 2% of recall(B) at every N, AND rebuild +//! cost grows with N (so the saved cost grows). Win = scale-robust + large gap. +//! +//! Run: cargo run --release -p ruvector-seprag --example scale_drift -- [Ns...] + +use ruvector_seprag::ann::*; +use std::time::Instant; + +fn main() { + let args: Vec = std::env::args().collect(); + let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-100k.csv".into()); + let ns: Vec = if args.len() > 2 { + args[2..].iter().filter_map(|s| s.parse().ok()).collect() + } else { + vec![5000, 10000, 25000, 50000, 100000] + }; + let max_n = *ns.iter().max().unwrap(); + + eprintln!("[scale] loading up to {max_n} vectors from {path}"); + let all = read_vectors(&path, max_n); + let dim = all[0].len(); + eprintln!("[scale] loaded {} vectors x {dim} dims\n", all.len()); + + let p = AnnParams { r: 24, l: 64, alpha: 1.2, k: 10 }; + let id = identity(dim); + let rot = target_rot(dim, &mut Rng::new(54321)); + let drift = lerp_mat(&id, &rot, 0.5); // adversarial ~36%-churn point + + println!("=== BET 1 @ scale: rotational drift (t=0.5), recall@{} ===", p.k); + println!("{:>8} | {:>8} {:>8} {:>6} | {:>9} {:>10} | {:>7} {:>7}", + "N", "A rewt", "B rebld", "churn", "B build s", "A update s", "A ev/q", "B ev/q"); + println!("{}", "-".repeat(80)); + + for &n in &ns { + if n > all.len() { continue; } + let vecs: Vec = all[..n].to_vec(); + let vt = apply_linear(&drift, &vecs, dim); + + // queries + ground truth under the drifted metric. + let mut qrng = Rng::new(999); + let queries: Vec = (0..100).map(|_| qrng.below(n)).collect(); + let truth: Vec> = queries.iter().map(|&q| brute_topk(&vt, q, p.k)).collect(); + let truth0: Vec> = queries.iter().map(|&q| brute_topk(&vecs, q, p.k)).collect(); + let churn: f64 = truth.iter().zip(&truth0).map(|(a, b)| 1.0 - recall(a, b)).sum::() / queries.len() as f64; + + // Base graph (built once under the original metric; this IS the cost A avoids re-paying). + let g0 = build(&vecs, &p, 7); + + // A: reuse g0 under the drifted metric. "Update cost" = recompute medoid only. + let ta = Instant::now(); + let medt_for_a = medoid(&vt); // A needs an entry point in the new metric; O(N), cheap + let a_update = ta.elapsed().as_secs_f64(); + let (mut ra, mut a_ev) = (0.0f64, 0usize); + for (&q, tr) in queries.iter().zip(&truth) { + let (got, _, ev) = search(&g0, &vt, medt_for_a, &vt[q], p.l, p.k); + ra += recall(&got, tr); + a_ev += ev; + } + + // B: full rebuild under the drifted metric (the cost we are trying to avoid). + let tb = Instant::now(); + let gt = build(&vt, &p, 7); + let b_build = tb.elapsed().as_secs_f64(); + let medt = medoid(&vt); + let (mut rb, mut b_ev) = (0.0f64, 0usize); + for (&q, tr) in queries.iter().zip(&truth) { + let (got, _, ev) = search(>, &vt, medt, &vt[q], p.l, p.k); + rb += recall(&got, tr); + b_ev += ev; + } + + let nq = queries.len() as f64; + println!("{:>8} | {:>7.1}% {:>7.1}% {:>5.0}% | {:>9.2} {:>10.3} | {:>7.0} {:>7.0}", + n, ra / nq * 100.0, rb / nq * 100.0, churn * 100.0, b_build, a_update, a_ev as f64 / nq, b_ev as f64 / nq); + } + + println!("\nGate: WIN if A within 2% of B at every N AND rebuild cost grows with N."); + println!("'A update s' = re-weight cost (medoid recompute only); B build s = rebuild cost avoided."); +} diff --git a/crates/ruvector-seprag/src/ann.rs b/crates/ruvector-seprag/src/ann.rs new file mode 100644 index 0000000000..3c4ec930b2 --- /dev/null +++ b/crates/ruvector-seprag/src/ann.rs @@ -0,0 +1,249 @@ +//! Minimal Vamana-style approximate-nearest-neighbour engine + metric-drift +//! helpers, shared by the BET-1 experiments (ADR-200). +//! +//! Vectors are plain `Vec` and the metric is squared-L2; metric *drift* is +//! modelled by transforming the vectors (a re-metrization `M = AᵀA` is L2 in the +//! transformed space), so the ANN code itself never needs a weight vector. +//! +//! The search is a standard two-heap greedy beam search (frontier min-heap + +//! bounded result max-heap), which scales to n≥10⁵ where the earlier +//! linear-scan beam did not. + +use std::cmp::Ordering; +use std::collections::BinaryHeap; + +pub type Vec32 = Vec; + +/// Squared-L2 distance. +#[inline] +pub fn l2(a: &[f32], b: &[f32]) -> f32 { + let mut s = 0.0f32; + for i in 0..a.len() { + let d = a[i] - b[i]; + s += d * d; + } + s +} + +/// Total-order wrapper for f32 so it can live in a `BinaryHeap`. +#[derive(Clone, Copy, PartialEq)] +struct F(f32); +impl Eq for F {} +impl PartialOrd for F { + fn partial_cmp(&self, o: &Self) -> Option { Some(self.cmp(o)) } +} +impl Ord for F { + fn cmp(&self, o: &Self) -> Ordering { self.0.total_cmp(&o.0) } +} + +pub struct AnnParams { pub r: usize, pub l: usize, pub alpha: f32, pub k: usize } + +/// Brute-force exact top-k (the ground-truth oracle), excluding `q` itself. +#[must_use] +pub fn brute_topk(vecs: &[Vec32], q: usize, k: usize) -> Vec { + let mut d: Vec<(f32, u32)> = (0..vecs.len()) + .filter(|&j| j != q) + .map(|j| (l2(&vecs[q], &vecs[j]), j as u32)) + .collect(); + d.sort_by(|a, b| a.0.total_cmp(&b.0)); + d.truncate(k); + d.into_iter().map(|(_, n)| n).collect() +} + +#[must_use] +pub fn recall(got: &[u32], truth: &[u32]) -> f64 { + use std::collections::HashSet; + let t: HashSet = truth.iter().copied().collect(); + got.iter().filter(|g| t.contains(g)).count() as f64 / truth.len().max(1) as f64 +} + +#[must_use] +pub fn medoid(vecs: &[Vec32]) -> u32 { + let dim = vecs[0].len(); + let mut c = vec![0.0f32; dim]; + for v in vecs { + for i in 0..dim { c[i] += v[i]; } + } + for x in &mut c { *x /= vecs.len() as f32; } + (0..vecs.len()).min_by(|&a, &b| l2(&vecs[a], &c).total_cmp(&l2(&vecs[b], &c))).unwrap() as u32 +} + +/// Two-heap greedy beam search. Returns (top-k, visited nodes, #distance evals). +#[must_use] +pub fn search(graph: &[Vec], vecs: &[Vec32], entry: u32, query: &[f32], beam: usize, k: usize) -> (Vec, Vec, usize) { + let mut visited = vec![false; vecs.len()]; + let mut frontier: BinaryHeap> = BinaryHeap::new(); // nearest first + let mut result: BinaryHeap<(F, u32)> = BinaryHeap::new(); // worst (max) on top, capped at beam + let d0 = l2(&vecs[entry as usize], query); + visited[entry as usize] = true; + frontier.push(std::cmp::Reverse((F(d0), entry))); + result.push((F(d0), entry)); + let mut visited_list = vec![entry]; + let mut evals = 1usize; + + while let Some(std::cmp::Reverse((F(d), u))) = frontier.pop() { + if result.len() >= beam && d > result.peek().unwrap().0 .0 { + break; + } + for &v in &graph[u as usize] { + if visited[v as usize] { + continue; + } + visited[v as usize] = true; + visited_list.push(v); + let dv = l2(&vecs[v as usize], query); + evals += 1; + if result.len() < beam || dv < result.peek().unwrap().0 .0 { + frontier.push(std::cmp::Reverse((F(dv), v))); + result.push((F(dv), v)); + if result.len() > beam { + result.pop(); + } + } + } + } + let mut out: Vec<(f32, u32)> = result.into_iter().map(|(F(d), n)| (d, n)).collect(); + out.sort_by(|a, b| a.0.total_cmp(&b.0)); + out.truncate(k); + (out.into_iter().map(|(_, n)| n).collect(), visited_list, evals) +} + +/// Vamana RobustPrune: keep up to R diverse neighbours. +fn robust_prune(p: u32, cands: &[u32], vecs: &[Vec32], alpha: f32, r: usize) -> Vec { + let mut pool: Vec<(f32, u32)> = cands.iter().filter(|&&c| c != p).map(|&c| (l2(&vecs[p as usize], &vecs[c as usize]), c)).collect(); + pool.sort_by(|a, b| a.0.total_cmp(&b.0)); + let mut out: Vec = Vec::new(); + while !pool.is_empty() && out.len() < r { + let (_, pstar) = pool[0]; + out.push(pstar); + pool.retain(|&(dq, q)| alpha * l2(&vecs[pstar as usize], &vecs[q as usize]) > dq && q != pstar); + } + out +} + +/// Build a Vamana-lite graph. Returns (graph, #distance evals, wall-build helper deferred to caller). +#[must_use] +pub fn build(vecs: &[Vec32], p: &AnnParams, seed: u64) -> Vec> { + let n = vecs.len(); + let mut rng = Rng::new(seed); + let mut graph: Vec> = (0..n) + .map(|i| { + let mut s = std::collections::HashSet::new(); + while s.len() < p.r.min(n - 1) { + let j = rng.below(n); + if j != i { s.insert(j as u32); } + } + s.into_iter().collect() + }) + .collect(); + let med = medoid(vecs); + let mut order: Vec = (0..n).collect(); + for i in (1..n).rev() { order.swap(i, rng.below(i + 1)); } + for &node in &order { + let (_, visited, _) = search(&graph, vecs, med, &vecs[node], p.l, p.k); + let nbrs = robust_prune(node as u32, &visited, vecs, p.alpha, p.r); + graph[node] = nbrs.clone(); + for q in nbrs { + let qi = q as usize; + if !graph[qi].contains(&(node as u32)) { + graph[qi].push(node as u32); + if graph[qi].len() > p.r { + let cand = graph[qi].clone(); + graph[qi] = robust_prune(q, &cand, vecs, p.alpha, p.r); + } + } + } + } + graph +} + +// ---------------- deterministic RNG ---------------- +pub struct Rng(u64); +impl Rng { + #[must_use] + pub fn new(s: u64) -> Self { Rng(s) } + pub fn next(&mut self) -> u64 { + self.0 = self.0.wrapping_add(0x9E37_79B9_7F4A_7C15); + let mut z = self.0; + z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); + z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); + z ^ (z >> 31) + } + pub fn f32(&mut self) -> f32 { (self.next() >> 40) as f32 / (1u64 << 24) as f32 } + pub fn below(&mut self, n: usize) -> usize { (self.next() % n as u64) as usize } +} + +// ---------------- metric-drift transforms ---------------- +pub fn gaussian(rng: &mut Rng) -> f32 { + let u1 = (rng.f32() as f64).max(1e-9); + let u2 = rng.f32() as f64; + ((-2.0 * u1.ln()).sqrt() * (std::f64::consts::TAU * u2).cos()) as f32 +} + +pub fn random_rotation(dim: usize, rng: &mut Rng) -> Vec { + let mut m: Vec> = (0..dim).map(|_| (0..dim).map(|_| gaussian(rng)).collect()).collect(); + for i in 0..dim { + for j in 0..i { + let dot: f32 = (0..dim).map(|k| m[i][k] * m[j][k]).sum(); + for k in 0..dim { m[i][k] -= dot * m[j][k]; } + } + let norm: f32 = m[i].iter().map(|x| x * x).sum::().sqrt().max(1e-9); + for k in 0..dim { m[i][k] /= norm; } + } + m.into_iter().flatten().collect() +} + +pub fn identity(dim: usize) -> Vec { + let mut a = vec![0.0f32; dim * dim]; + for i in 0..dim { a[i * dim + i] = 1.0; } + a +} + +pub fn target_diag(dim: usize, rng: &mut Rng) -> Vec { + let mut a = vec![0.0f32; dim * dim]; + for i in 0..dim { a[i * dim + i] = (0.2 + 2.8 * rng.f32()).sqrt(); } + a +} + +pub fn target_rot(dim: usize, rng: &mut Rng) -> Vec { + let r = random_rotation(dim, rng); + let mut a = vec![0.0f32; dim * dim]; + for i in 0..dim { + let s = (0.2 + 2.8 * rng.f32()).sqrt(); + for j in 0..dim { a[i * dim + j] = s * r[i * dim + j]; } + } + a +} + +pub fn lerp_mat(a0: &[f32], a1: &[f32], t: f32) -> Vec { + a0.iter().zip(a1).map(|(x, y)| x * (1.0 - t) + y * t).collect() +} + +pub fn apply_linear(a: &[f32], vecs: &[Vec32], dim: usize) -> Vec { + vecs.iter().map(|v| { + (0..dim).map(|i| { + let row = &a[i * dim..(i + 1) * dim]; + row.iter().zip(v).map(|(x, y)| x * y).sum() + }).collect() + }).collect() +} + +pub fn apply_nonlin(w: &[f32], vecs: &[Vec32], s: f32, dim: usize) -> Vec { + vecs.iter().map(|v| { + (0..dim).map(|i| { + let row = &w[i * dim..(i + 1) * dim]; + let u: f32 = row.iter().zip(v).map(|(x, y)| x * y).sum(); + v[i] + s * u.tanh() + }).collect() + }).collect() +} + +/// Read up to `n` comma-separated f32 rows from a CSV. +#[must_use] +pub fn read_vectors(path: &str, n: usize) -> Vec { + let data = std::fs::read_to_string(path).expect("read features"); + data.lines().take(n) + .map(|l| l.split(',').filter_map(|s| s.trim().parse::().ok()).collect()) + .filter(|v: &Vec32| !v.is_empty()) + .collect() +} diff --git a/crates/ruvector-seprag/src/lib.rs b/crates/ruvector-seprag/src/lib.rs index 66237e4791..0cfdb7a08f 100644 --- a/crates/ruvector-seprag/src/lib.rs +++ b/crates/ruvector-seprag/src/lib.rs @@ -32,6 +32,7 @@ //! assert!(topk.len() <= 5); //! ``` +pub mod ann; pub mod contraction; pub mod customize; pub mod gen; From 9ec8de47723bff3bfaedf718ab86f8c08c461b42 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 09:54:17 -0400 Subject: [PATCH 11/15] =?UTF-8?q?feat(seprag):=20BET=201=20scale=20result?= =?UTF-8?q?=20to=20n=3D100k=20(ADR-200)=20=E2=80=94=20win=20holds,=20gap?= =?UTF-8?q?=20widens?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit scale_drift sweep (5k->100k, rotational drift ~40% churn, recall@10): N A reuse B rebuild gap rebuild update ratio 5000 90.2% 90.0% +0.2% 3.6s 0.001s ~3600x 10000 89.5% 90.3% -0.8% 10.2s 0.004s ~2500x 25000 88.5% 89.2% -0.7% 21.4s 0.009s ~2400x 50000 87.7% 88.6% -0.9% 47.1s 0.043s ~1100x 100000 85.0% 86.7% -1.7% 141.8s 0.035s ~4000x Verdict: WIN within the 2% gate through 100k at ~1000-4000x lower update cost, BUT the recall gap widens with N (-0.2%->-1.7%) => defer/batch rebuilds, not never-rebuild. Honest caveats: both A&B recall fall with N (fixed beam); 100 queries => ~+-1% noise, confirm trend with more queries. Also: rename Rng::next->next_u64 (clippy). ADR-200 + FUTURE-DIRECTIONS updated with scale evidence, widening-gap caveat, and a hybrid re-weight+periodic-rebuild policy as a next step. --- crates/ruvector-seprag/src/ann.rs | 6 +-- ...omizable-reweighting-fixed-topology-ann.md | 52 ++++++++++++++----- .../seprag-cch-retrieval/FUTURE-DIRECTIONS.md | 10 ++-- 3 files changed, 50 insertions(+), 18 deletions(-) diff --git a/crates/ruvector-seprag/src/ann.rs b/crates/ruvector-seprag/src/ann.rs index 3c4ec930b2..671abf9850 100644 --- a/crates/ruvector-seprag/src/ann.rs +++ b/crates/ruvector-seprag/src/ann.rs @@ -162,15 +162,15 @@ pub struct Rng(u64); impl Rng { #[must_use] pub fn new(s: u64) -> Self { Rng(s) } - pub fn next(&mut self) -> u64 { + pub fn next_u64(&mut self) -> u64 { self.0 = self.0.wrapping_add(0x9E37_79B9_7F4A_7C15); let mut z = self.0; z = (z ^ (z >> 30)).wrapping_mul(0xBF58_476D_1CE4_E5B9); z = (z ^ (z >> 27)).wrapping_mul(0x94D0_49BB_1331_11EB); z ^ (z >> 31) } - pub fn f32(&mut self) -> f32 { (self.next() >> 40) as f32 / (1u64 << 24) as f32 } - pub fn below(&mut self, n: usize) -> usize { (self.next() % n as u64) as usize } + pub fn f32(&mut self) -> f32 { (self.next_u64() >> 40) as f32 / (1u64 << 24) as f32 } + pub fn below(&mut self, n: usize) -> usize { (self.next_u64() % n as u64) as usize } } // ---------------- metric-drift transforms ---------------- diff --git a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md index f321807b9c..36afc26c63 100644 --- a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md +++ b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md @@ -12,13 +12,16 @@ tags: [ruvector, retrieval, ann, vamana, hnsw, self-learning, metric-drift, cust ## Status -**Proposed — experimentally validated across diagonal, rotational, and non-linear drift; -bounded by scale (2026-06-04).** This salvages the one idea from the SepRAG exploration -([ADR-196]) that survived every test — the *customizable metric* of [ADR-198] — and -re-tests it **standalone, decoupled from CCH**, since CCH full-contraction was found NO-GO -on embedding graphs ([ADR-199]). The fixed topology matches full rebuild on **both recall -and per-query cost** at **zero** rebuild cost; the only open caveats are scale, region-local -drift, and an incremental-rebuild baseline. +**Proposed — validated across drift types AND to n=10⁵ (2026-06-04).** This salvages the +one idea from the SepRAG exploration ([ADR-196]) that survived every test — the +*customizable metric* of [ADR-198] — re-tested **standalone, decoupled from CCH** (CCH +full-contraction was NO-GO on embedding graphs, [ADR-199]). The fixed topology matches full +rebuild within the pre-registered 2% recall gate across diagonal/rotational/non-linear drift +and across n=5k…100k, at **~1,000–4,000× lower update cost**. **Caveat (honest):** the +recall gap widens mildly with scale (−0.2% → −1.7% at 100k), so this is a *defer/batch +rebuilds* strategy, not *never rebuild*. Remaining open: region-local drift, an incremental +baseline, a real GNN-metric trajectory, and tighter (more-query) confirmation of the +scale-gap trend. ## Context @@ -81,6 +84,25 @@ NON-LINEAR drift (residual tanh warp `v + s·tanh(Wv)` — adversarial non-linea 36% relevant-set churn. The C control degrades up to 29 points, proving the graph matters (the benchmark is not insensitive) — so A's parity is genuine adaptation, not insensitivity. +### Scale (n = 5k…100k, rotational drift t=0.5, ~40% churn) + +`scale_drift.rs`, recall@10, 100 queries: + +| N | A re-weight | B rebuild | gap | rebuild cost | re-weight update cost | cost ratio | +|---|---|---|---|---|---|---| +| 5,000 | 90.2% | 90.0% | +0.2% | 3.6s | 0.001s | ~3,600× | +| 10,000 | 89.5% | 90.3% | −0.8% | 10.2s | 0.004s | ~2,500× | +| 25,000 | 88.5% | 89.2% | −0.7% | 21.4s | 0.009s | ~2,400× | +| 50,000 | 87.7% | 88.6% | −0.9% | 47.1s | 0.043s | ~1,100× | +| 100,000 | 85.0% | 86.7% | −1.7% | 141.8s | 0.035s | ~4,000× | + +**Read:** recall parity stays within the 2% gate through 100k at ~10³–10⁴× lower update +cost (rebuild is super-linear; re-weight ≈ a medoid recompute). The gap **widens mildly** +with N (−0.2% → −1.7%), so the honest framing is *defer/batch rebuilds*, not *never +rebuild*. (Both A and B recall fall with N — fixed beam L=64 weakens relatively as N grows; +the A−B gap, not the absolute, is the signal.) With 100 queries, per-point noise is ~±1%, +so the trend should be confirmed with more queries before being treated as definitive. + **Query cost is also equal.** Mean distance-evals/query: A ≈ B within ~1% in every row (e.g. 590 vs 583 at peak churn). So reuse does **not** trade build savings for slower queries — it matches B on recall *and* per-query work. @@ -116,12 +138,18 @@ passes, so navigability robustness is not limited to linear remetrization.)* ## Next steps -1. **Scale to n≥10⁵** on a real ANN index (`ruvector-diskann`) + measure the rebuild-cost - curve — the decisive remaining test (cost asymmetry grows with n). -2. Region-local drift. -3. Incremental-rebuild baseline for a fair cost comparison. +1. ~~Scale to n≥10⁵~~ **done** (self-contained Vamana-lite; recall parity within 2% at + ~10³–10⁴× lower update cost). Follow-up: re-run with more queries (≥500) to confirm + whether the −1.7% gap at 100k is a real trend or noise; and port to the production + `ruvector-diskann` index to confirm on its graph. +2. **Region-local drift** — the most likely thing to break reuse (different metric in + different regions could strand the old topology locally). +3. Incremental-rebuild baseline for a fair cost comparison (vs full rebuild). 4. Wire re-weight-on-drift into the `ruvector-diskann`/GNN loop behind a flag and validate - on a real learned-metric trajectory. + on a real learned-metric trajectory (the eventual production proof). +5. A *hybrid policy*: cheap re-weight every step + a full rebuild every K steps (or when a + drift-monitor predicts the gap will cross a threshold) — captures most of the cost win + while bounding recall loss. ## Alternatives considered diff --git a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md index 5651e63655..7e41801639 100644 --- a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md +++ b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md @@ -28,9 +28,13 @@ Mahalanobis (rotational), AND non-linear (tanh-warp) drift — at **zero** rebui Stale-index control loses up to 29 points (benchmark has teeth). Full evidence + boundaries in [ADR-200](../../adr/ADR-200-customizable-reweighting-fixed-topology-ann.md). Harness: `crates/ruvector-seprag/examples/reweight_vs_rebuild.rs`. -- **Open (ranked):** (1) **scale to n≥10⁵** on `ruvector-diskann` + rebuild-cost curve — - the decisive remaining test; (2) region-local drift; (3) incremental-rebuild baseline; - (4) wire into the real GNN loop behind a flag. Do (1) next. +- **Scale ✅** (`examples/scale_drift.rs`): recall parity within 2% from 5k→100k at + **~1,000–4,000× lower update cost** (rebuild 142s vs reuse 0.035s at 100k). Honest caveat: + gap widens with N (−0.2%→−1.7%) → *defer/batch rebuilds*, not *never*. +- **Open (ranked):** (1) confirm the 100k gap trend with ≥500 queries + port to + `ruvector-diskann`; (2) **region-local drift** (most likely to break reuse); + (3) hybrid policy (re-weight every step + rebuild every K); (4) incremental-rebuild baseline; + (5) wire into the real GNN loop behind a flag. ### BET 2 — Filtered ANN vs `ruvector-acorn` Region/predicate pruning for constrained ("nearest among items matching X") retrieval — a From 96e91608167bfd430c03732378e197ce906dfa4c Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 10:16:04 -0400 Subject: [PATCH 12/15] =?UTF-8?q?feat(seprag):=20region-local=20drift=20te?= =?UTF-8?q?st=20(ADR-200)=20=E2=80=94=20reuse=20holds=20locally,=20gate=20?= =?UTF-8?q?PASS?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit examples/region_drift.rs: warp only a 15% local cluster, grade recall separately for queries INSIDE vs OUTSIDE the drifted region (a global average would hide a local failure). Result (n=20k, recall@10): t churnIn A_in B_in | churnOut A_out B_out 0.25 44% 89.8% 81.4% | 21% 87.9% 89.0% 0.50 53% 89.3% 90.0% | 21% 87.9% 89.0% 1.00 45% 89.5% 90.0% | 21% 87.9% 89.0% Gate PASS: reuse holds inside the drifted region (A_in within 0.7% of B_in, and ABOVE it at t=0.25) even at 53% in-region churn; out-region ~unchanged. Region- local drift did NOT break reuse. Honest caveat: the t=0.25 B_in dip to 81% (reuse beats rebuild by 8pts) is a build-variance artifact of the simplified single-pass Vamana baseline, not a smooth effect — strongest argument to port the baseline to production ruvector-diskann. ADR-200 + FUTURE-DIRECTIONS + status updated. --- .../ruvector-seprag/examples/region_drift.rs | 112 ++++++++++++++++++ ...omizable-reweighting-fixed-topology-ann.md | 46 +++++-- .../seprag-cch-retrieval/FUTURE-DIRECTIONS.md | 12 +- 3 files changed, 158 insertions(+), 12 deletions(-) create mode 100644 crates/ruvector-seprag/examples/region_drift.rs diff --git a/crates/ruvector-seprag/examples/region_drift.rs b/crates/ruvector-seprag/examples/region_drift.rs new file mode 100644 index 0000000000..c9ee3d8ba4 --- /dev/null +++ b/crates/ruvector-seprag/examples/region_drift.rs @@ -0,0 +1,112 @@ +//! BET 1 adversarial test (ADR-200): REGION-LOCAL metric drift. +//! +//! All prior drift was global (one transform for every point). The realistic +//! harder case for a self-learning system is *local*: the metric/embedding for +//! ONE region of the space changes a lot while the rest is stationary (e.g. the +//! GNN re-learns structure for one topic). This is the scenario most likely to +//! strand a reused topology *locally*. +//! +//! Method: pick a local cluster R (the nearest `region_frac` of points to a +//! random centre); apply a strong rotational warp to ONLY those vectors. Then +//! compare reuse (A) vs rebuild (B), **reporting recall separately for queries +//! inside R (the drifted region) vs outside** — a global average would hide a +//! local failure. +//! +//! Gate: WIN if A within 2% of B for IN-region queries (not just overall). KILL +//! if A_in drops >2% below B_in → reuse fails locally → need local/periodic rebuild. +//! +//! Run: cargo run --release -p ruvector-seprag --example region_drift -- + +use ruvector_seprag::ann::*; +use std::time::Instant; + +fn matvec(a: &[f32], v: &[f32], dim: usize) -> Vec { + (0..dim).map(|i| { + let row = &a[i * dim..(i + 1) * dim]; + row.iter().zip(v).map(|(x, y)| x * y).sum() + }).collect() +} + +fn main() { + let args: Vec = std::env::args().collect(); + let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-100k.csv".into()); + let n: usize = args.get(2).and_then(|s| s.parse().ok()).unwrap_or(20000); + let region_frac = 0.15f32; + + let vecs = read_vectors(&path, n); + let n = vecs.len(); + let dim = vecs[0].len(); + let p = AnnParams { r: 24, l: 64, alpha: 1.2, k: 10 }; + + // Region R = the nearest `region_frac` of points to a random centre. + let mut rng = Rng::new(2024); + let centre = vecs[rng.below(n)].clone(); + let mut by_dist: Vec<(f32, usize)> = (0..n).map(|i| (l2(&vecs[i], ¢re), i)).collect(); + by_dist.sort_by(|a, b| a.0.total_cmp(&b.0)); + let region_size = (n as f32 * region_frac) as usize; + let mut in_region = vec![false; n]; + for &(_, i) in by_dist.iter().take(region_size) { + in_region[i] = true; + } + let region_ids: Vec = (0..n).filter(|&i| in_region[i]).collect(); + let outside_ids: Vec = (0..n).filter(|&i| !in_region[i]).collect(); + + // Query sets: 100 inside R (the stressed region), 100 outside. + let mut qr = Rng::new(77); + let q_in: Vec = (0..100).map(|_| region_ids[qr.below(region_ids.len())]).collect(); + let q_out: Vec = (0..100).map(|_| outside_ids[qr.below(outside_ids.len())]).collect(); + + eprintln!("[region] n={n} dim={dim}; region R = {region_size} pts ({:.0}%) warped, rest stationary", region_frac * 100.0); + let t0 = Instant::now(); + let g0 = build(&vecs, &p, 7); + eprintln!("[region] base graph built in {:.1}s\n", t0.elapsed().as_secs_f64()); + + let id = identity(dim); + let rot = target_rot(dim, &mut Rng::new(54321)); + + println!("=== BET 1: REGION-LOCAL drift ({:.0}% of space warped) ===", region_frac * 100.0); + println!("recall@{} split by query location; gate = A_in within 2% of B_in\n", p.k); + println!("{:>5} | {:>7} {:>7} {:>7} | {:>7} {:>7} {:>7} | {:>8}", + "t", "churnIn", "A_in", "B_in", "chrnOut", "A_out", "B_out", "B bld s"); + println!("{}", "-".repeat(72)); + + for &t in &[0.0f32, 0.25, 0.5, 0.75, 1.0] { + let a = lerp_mat(&id, &rot, t); + // Warp ONLY region-R vectors; everything else stays put. + let mut vt = vecs.clone(); + for &i in ®ion_ids { + vt[i] = matvec(&a, &vecs[i], dim); + } + + let med0 = medoid(&vt); // entry point in the (mostly stationary) drifted space + let tb = Instant::now(); + let gt = build(&vt, &p, 7); + let bt = tb.elapsed().as_secs_f64(); + let medt = medoid(&vt); + + let eval = |qs: &[usize]| -> (f64, f64, f64) { + let mut churn = 0.0; + let mut ra = 0.0; + let mut rb = 0.0; + for &q in qs { + let truth = brute_topk(&vt, q, p.k); + let truth0 = brute_topk(&vecs, q, p.k); + churn += 1.0 - recall(&truth, &truth0); + let (ga, _, _) = search(&g0, &vt, med0, &vt[q], p.l, p.k); + ra += recall(&ga, &truth); + let (gb, _, _) = search(>, &vt, medt, &vt[q], p.l, p.k); + rb += recall(&gb, &truth); + } + let m = qs.len() as f64; + (churn / m * 100.0, ra / m * 100.0, rb / m * 100.0) + }; + + let (ci, ai, bi) = eval(&q_in); + let (co, ao, bo) = eval(&q_out); + println!("{:>5.2} | {:>6.0}% {:>6.1}% {:>6.1}% | {:>6.0}% {:>6.1}% {:>6.1}% | {:>8.2}", + t, ci, ai, bi, co, ao, bo, bt); + } + + println!("\nA_in/B_in = recall for queries INSIDE the drifted region (the stress test)."); + println!("A_out/B_out = queries outside it (should stay ~unchanged)."); +} diff --git a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md index 36afc26c63..9645a9028b 100644 --- a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md +++ b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md @@ -16,12 +16,15 @@ tags: [ruvector, retrieval, ann, vamana, hnsw, self-learning, metric-drift, cust one idea from the SepRAG exploration ([ADR-196]) that survived every test — the *customizable metric* of [ADR-198] — re-tested **standalone, decoupled from CCH** (CCH full-contraction was NO-GO on embedding graphs, [ADR-199]). The fixed topology matches full -rebuild within the pre-registered 2% recall gate across diagonal/rotational/non-linear drift -and across n=5k…100k, at **~1,000–4,000× lower update cost**. **Caveat (honest):** the -recall gap widens mildly with scale (−0.2% → −1.7% at 100k), so this is a *defer/batch -rebuilds* strategy, not *never rebuild*. Remaining open: region-local drift, an incremental -baseline, a real GNN-metric trajectory, and tighter (more-query) confirmation of the -scale-gap trend. +rebuild within the pre-registered 2% recall gate across diagonal/rotational/non-linear drift, +across n=5k…100k, **and** under region-local drift (warping only a 15% cluster), at +**~1,000–4,000× lower update cost**. **Caveats (honest):** (1) the recall gap widens mildly +with scale (−0.2% → −1.7% at 100k), so this is a *defer/batch rebuilds* strategy, not *never +rebuild*; (2) the rebuild baseline is a simplified single-pass Vamana with build variance +(a transient B dip surfaced under region-local drift), so results should be re-confirmed on +the production `ruvector-diskann` index. Remaining open: production-index port, a real +GNN-metric trajectory, an incremental baseline, and more-query confirmation of the scale-gap +trend. ## Context @@ -103,6 +106,33 @@ rebuild*. (Both A and B recall fall with N — fixed beam L=64 weakens relativel the A−B gap, not the absolute, is the signal.) With 100 queries, per-point noise is ~±1%, so the trend should be confirmed with more queries before being treated as definitive. +### Region-local drift (n=20k; warp only a 15% local cluster) + +The hardest realistic case: the metric changes a lot in ONE region (e.g. one topic the +GNN re-learns) while the rest is stationary. Recall reported **separately** for queries +inside vs outside the warped region (a global average would hide a local failure). +`region_drift.rs`: + +| t | churn-in | A_in (reuse) | B_in (rebuild) | A_out | B_out | +|---|---|---|---|---|---| +| 0.00 | 0% | 89.7% | 89.7% | 88.0% | 88.0% | +| 0.25 | 44% | 89.8% | **81.4%** | 87.9% | 89.0% | +| 0.50 | 53% | 89.3% | 90.0% | 87.9% | 89.0% | +| 1.00 | 45% | 89.5% | 90.0% | 87.9% | 89.0% | + +**Gate: PASS.** Reuse holds *inside* the drifted region — A_in within 0.7% of B_in (and +**above** it at t=0.25) even at 53% in-region churn. Out-region recall is essentially +unchanged (A_out ~1.1% under B_out, within gate). Region-local drift did **not** break +reuse. + +**Honest caveat — the t=0.25 anomaly.** B_in transiently fell to 81.4% (reuse beat rebuild +by 8 pts) then recovered. This non-monotonic dip is a **build-stability artifact of the +simplified single-pass Vamana** (random init, one seed, α=1.2) on the quarter-warped +geometry — *not* a smooth property. It cuts two ways: (i) it shows reuse can be *more +stable* than a fresh build during drift; (ii) it shows the rebuild baseline `B` has +build variance, so "A matches B" partly depends on B being a fair baseline. This is the +strongest argument for porting the baseline to the production `ruvector-diskann` index. + **Query cost is also equal.** Mean distance-evals/query: A ≈ B within ~1% in every row (e.g. 590 vs 583 at peak churn). So reuse does **not** trade build savings for slower queries — it matches B on recall *and* per-query work. @@ -142,8 +172,8 @@ passes, so navigability robustness is not limited to linear remetrization.)* ~10³–10⁴× lower update cost). Follow-up: re-run with more queries (≥500) to confirm whether the −1.7% gap at 100k is a real trend or noise; and port to the production `ruvector-diskann` index to confirm on its graph. -2. **Region-local drift** — the most likely thing to break reuse (different metric in - different regions could strand the old topology locally). +2. ~~Region-local drift~~ **done** (warp a 15% cluster; reuse held in-region within 0.7%, + gate PASS). Surfaced a build-variance dip in the lite-Vamana baseline → reinforces #1. 3. Incremental-rebuild baseline for a fair cost comparison (vs full rebuild). 4. Wire re-weight-on-drift into the `ruvector-diskann`/GNN loop behind a flag and validate on a real learned-metric trajectory (the eventual production proof). diff --git a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md index 7e41801639..37a973bf1d 100644 --- a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md +++ b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md @@ -31,10 +31,14 @@ Harness: `crates/ruvector-seprag/examples/reweight_vs_rebuild.rs`. - **Scale ✅** (`examples/scale_drift.rs`): recall parity within 2% from 5k→100k at **~1,000–4,000× lower update cost** (rebuild 142s vs reuse 0.035s at 100k). Honest caveat: gap widens with N (−0.2%→−1.7%) → *defer/batch rebuilds*, not *never*. -- **Open (ranked):** (1) confirm the 100k gap trend with ≥500 queries + port to - `ruvector-diskann`; (2) **region-local drift** (most likely to break reuse); - (3) hybrid policy (re-weight every step + rebuild every K); (4) incremental-rebuild baseline; - (5) wire into the real GNN loop behind a flag. +- **Region-local drift ✅** (`examples/region_drift.rs`): warp only a 15% cluster, grade + in-region vs out-region separately. Reuse held *inside* the drifted region (A_in within + 0.7% of B_in, gate PASS) even at 53% in-region churn. Surfaced a transient rebuild dip + (B_in 81% at t=0.25) = lite-Vamana build variance → motivates the diskann port. +- **Open (ranked):** (1) **port baseline to `ruvector-diskann`** (firms B, removes lite + build variance, confirms on the production index) + confirm 100k gap with ≥500 queries; + (2) hybrid policy (re-weight every step + rebuild every K / on drift-monitor trigger); + (3) incremental-rebuild baseline; (4) wire into the real GNN loop behind a flag. ### BET 2 — Filtered ANN vs `ruvector-acorn` Region/predicate pruning for constrained ("nearest among items matching X") retrieval — a From 5526e34cd695429cb40410e37051b487b3ca6dff Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 11:08:06 -0400 Subject: [PATCH 13/15] =?UTF-8?q?feat(seprag):=20port=20BET=201=20to=20pro?= =?UTF-8?q?duction=20ruvector-diskann=20(ADR-200)=20=E2=80=94=20confirmed?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit examples/diskann_drift.rs: re-run the re-weight-vs-rebuild test on the shipping ruvector-diskann VamanaGraph (added as dev-dependency). The reuse trick is native - the graph stores only topology and greedy_search takes vectors externally, so drift = search a graph-built-on-original with the transformed vectors. Result (n=20k, recall@10): GLOBAL rotational: A reuse vs B rebuild within 2% (95.6 vs 97.1 worst, t=0.5) REGION-LOCAL in-region: A_in within ~1.5% of B_in; A_in 98.6 vs B_in 94.5 at t=0.25 absolute recall 96-99% (stronger/fairer baseline than lite-Vamana ~90%) Confirms BET 1 on the production index. The t=0.25 reuse-beats-rebuild dip REPRODUCED on diskann => it is a real property (fresh Vamana build on a half-warped region underperforms reuse), not lite-Vamana noise. Baseline-variance caveat RESOLVED. Remaining caveat: gap widens with scale/churn (defer/batch rebuilds). ADR-200 + FUTURE-DIRECTIONS + status updated. --- crates/ruvector-seprag/Cargo.toml | 3 + .../ruvector-seprag/examples/diskann_drift.rs | 135 ++++++++++++++++++ ...omizable-reweighting-fixed-topology-ann.md | 52 ++++--- .../seprag-cch-retrieval/FUTURE-DIRECTIONS.md | 12 +- 4 files changed, 180 insertions(+), 22 deletions(-) create mode 100644 crates/ruvector-seprag/examples/diskann_drift.rs diff --git a/crates/ruvector-seprag/Cargo.toml b/crates/ruvector-seprag/Cargo.toml index 9b871d195d..b645cbbde5 100644 --- a/crates/ruvector-seprag/Cargo.toml +++ b/crates/ruvector-seprag/Cargo.toml @@ -15,6 +15,9 @@ thiserror = { workspace = true } [dev-dependencies] approx = "0.5" +# Production DiskANN/Vamana index — used by the diskann_drift example to confirm +# BET 1 (ADR-200) on the real index rather than the lite reference Vamana. +ruvector-diskann = { path = "../ruvector-diskann" } [lints.rust] unexpected_cfgs = { level = "allow", priority = -1 } diff --git a/crates/ruvector-seprag/examples/diskann_drift.rs b/crates/ruvector-seprag/examples/diskann_drift.rs new file mode 100644 index 0000000000..3307c48288 --- /dev/null +++ b/crates/ruvector-seprag/examples/diskann_drift.rs @@ -0,0 +1,135 @@ +//! BET 1 on the PRODUCTION index (ADR-200 next step): re-run the re-weight-vs- +//! rebuild test on `ruvector-diskann`'s real Vamana graph, not the lite +//! reference Vamana. This (a) confirms the result on the shipping index and +//! (b) firms the rebuild baseline (the lite Vamana showed build variance). +//! +//! The reuse trick is native to `VamanaGraph`: the graph stores only topology; +//! `greedy_search(vectors, query, beam)` takes the vectors externally. So drift +//! = pass the *transformed* vectors to a graph built on the *original* ones. +//! +//! Run: cargo run --release -p ruvector-seprag --example diskann_drift -- + +use ruvector_diskann::distance::FlatVectors; +use ruvector_diskann::graph::VamanaGraph; +use ruvector_seprag::ann::{ + apply_linear, brute_topk, identity, l2, lerp_mat, read_vectors, recall, target_rot, Rng, Vec32, +}; +use std::time::Instant; + +const R: usize = 32; +const BUILD_BEAM: usize = 64; +const SEARCH_BEAM: usize = 64; +const ALPHA: f32 = 1.2; +const K: usize = 10; + +fn flat(vecs: &[Vec32], dim: usize) -> FlatVectors { + let mut f = FlatVectors::with_capacity(dim, vecs.len()); + for v in vecs { + f.push(v); + } + f +} + +fn build_graph(vecs: &[Vec32], dim: usize) -> VamanaGraph { + let f = flat(vecs, dim); + let mut g = VamanaGraph::new(vecs.len(), R, BUILD_BEAM, ALPHA); + g.build(&f).expect("vamana build"); + g +} + +/// Top-k from a graph search over `vecs`, re-ranked by exact distance to the query. +fn topk(g: &VamanaGraph, vecs: &[Vec32], f: &FlatVectors, q: usize) -> Vec { + let (cands, _) = g.greedy_search(f, &vecs[q], SEARCH_BEAM); + let mut scored: Vec<(f32, u32)> = cands.iter().map(|&c| (l2(&vecs[c as usize], &vecs[q]), c)).collect(); + scored.sort_by(|a, b| a.0.total_cmp(&b.0)); + scored.into_iter().filter(|&(_, c)| c as usize != q).take(K).map(|(_, c)| c).collect() +} + +fn main() { + let args: Vec = std::env::args().collect(); + let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-100k.csv".into()); + let n: usize = args.get(2).and_then(|s| s.parse().ok()).unwrap_or(20000); + let vecs = read_vectors(&path, n); + let n = vecs.len(); + let dim = vecs[0].len(); + + eprintln!("[diskann] n={n} dim={dim}; ruvector-diskann Vamana R={R} L={BUILD_BEAM} alpha={ALPHA}"); + let t0 = Instant::now(); + let g0 = build_graph(&vecs, dim); + eprintln!("[diskann] base graph built in {:.1}s\n", t0.elapsed().as_secs_f64()); + + let id = identity(dim); + let rot = target_rot(dim, &mut Rng::new(54321)); + + // ---- Part 1: global rotational drift ---- + println!("=== diskann BET 1: GLOBAL rotational drift (recall@{K}) ==="); + println!("{:>5} {:>7} | {:>8} {:>8} | {:>9}", "t", "churn", "A reuse", "B rebld", "B build s"); + println!("{}", "-".repeat(46)); + let mut qrng = Rng::new(999); + let queries: Vec = (0..100).map(|_| qrng.below(n)).collect(); + let base_truth: Vec> = queries.iter().map(|&q| brute_topk(&vecs, q, K)).collect(); + + for &t in &[0.0f32, 0.25, 0.5, 1.0] { + let vt = apply_linear(&lerp_mat(&id, &rot, t), &vecs, dim); + let ft = flat(&vt, dim); + let truth: Vec> = queries.iter().map(|&q| brute_topk(&vt, q, K)).collect(); + let churn: f64 = truth.iter().zip(&base_truth).map(|(a, b)| 1.0 - recall(a, b)).sum::() / queries.len() as f64; + + let ra: f64 = queries.iter().zip(&truth).map(|(&q, tr)| recall(&topk(&g0, &vt, &ft, q), tr)).sum::() / queries.len() as f64; + + let tb = Instant::now(); + let gt = build_graph(&vt, dim); + let bt = tb.elapsed().as_secs_f64(); + let rb: f64 = queries.iter().zip(&truth).map(|(&q, tr)| recall(&topk(>, &vt, &ft, q), tr)).sum::() / queries.len() as f64; + + println!("{:>5.2} {:>6.0}% | {:>7.1}% {:>7.1}% | {:>9.2}", t, churn * 100.0, ra * 100.0, rb * 100.0, bt); + } + + // ---- Part 2: region-local drift (does the lite-Vamana t=0.25 dip reproduce?) ---- + println!("\n=== diskann BET 1: REGION-LOCAL drift (warp 15% cluster, recall@{K}) ==="); + let region_frac = 0.15f32; + let mut rng = Rng::new(2024); + let centre = vecs[rng.below(n)].clone(); + let mut by_dist: Vec<(f32, usize)> = (0..n).map(|i| (l2(&vecs[i], ¢re), i)).collect(); + by_dist.sort_by(|a, b| a.0.total_cmp(&b.0)); + let region_size = (n as f32 * region_frac) as usize; + let mut in_region = vec![false; n]; + for &(_, i) in by_dist.iter().take(region_size) { + in_region[i] = true; + } + let region_ids: Vec = (0..n).filter(|&i| in_region[i]).collect(); + let outside_ids: Vec = (0..n).filter(|&i| !in_region[i]).collect(); + let mut qr = Rng::new(77); + let q_in: Vec = (0..100).map(|_| region_ids[qr.below(region_ids.len())]).collect(); + let q_out: Vec = (0..100).map(|_| outside_ids[qr.below(outside_ids.len())]).collect(); + + println!("{:>5} | {:>7} {:>7} {:>7} | {:>7} {:>7}", "t", "chrnIn", "A_in", "B_in", "A_out", "B_out"); + println!("{}", "-".repeat(54)); + for &t in &[0.0f32, 0.25, 0.5, 1.0] { + let a = lerp_mat(&id, &rot, t); + let mut vt = vecs.clone(); + for &i in ®ion_ids { + vt[i] = (0..dim).map(|r| { let row = &a[r * dim..(r + 1) * dim]; row.iter().zip(&vecs[i]).map(|(x, y)| x * y).sum() }).collect(); + } + let ft = flat(&vt, dim); + let gt = build_graph(&vt, dim); + + let eval = |qs: &[usize]| -> (f64, f64, f64) { + let (mut churn, mut ra, mut rb) = (0.0, 0.0, 0.0); + for &q in qs { + let truth = brute_topk(&vt, q, K); + let truth0 = brute_topk(&vecs, q, K); + churn += 1.0 - recall(&truth, &truth0); + ra += recall(&topk(&g0, &vt, &ft, q), &truth); + rb += recall(&topk(>, &vt, &ft, q), &truth); + } + let m = qs.len() as f64; + (churn / m * 100.0, ra / m * 100.0, rb / m * 100.0) + }; + let (ci, ai, bi) = eval(&q_in); + let (_co, ao, bo) = eval(&q_out); + println!("{:>5.2} | {:>6.0}% {:>6.1}% {:>6.1}% | {:>6.1}% {:>6.1}%", t, ci, ai, bi, ao, bo); + } + + println!("\nGate: A within 2% of B (overall and in-region). Production-index confirmation of ADR-200."); +} diff --git a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md index 9645a9028b..197c82e29b 100644 --- a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md +++ b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md @@ -18,13 +18,14 @@ one idea from the SepRAG exploration ([ADR-196]) that survived every test — th full-contraction was NO-GO on embedding graphs, [ADR-199]). The fixed topology matches full rebuild within the pre-registered 2% recall gate across diagonal/rotational/non-linear drift, across n=5k…100k, **and** under region-local drift (warping only a 15% cluster), at -**~1,000–4,000× lower update cost**. **Caveats (honest):** (1) the recall gap widens mildly -with scale (−0.2% → −1.7% at 100k), so this is a *defer/batch rebuilds* strategy, not *never -rebuild*; (2) the rebuild baseline is a simplified single-pass Vamana with build variance -(a transient B dip surfaced under region-local drift), so results should be re-confirmed on -the production `ruvector-diskann` index. Remaining open: production-index port, a real -GNN-metric trajectory, an incremental baseline, and more-query confirmation of the scale-gap -trend. +**~1,000–4,000× lower update cost**. Confirmed on the **production `ruvector-diskann` Vamana** (96–99% recall, reuse within 2% of +rebuild). **Caveat (honest):** the recall gap widens mildly with scale/churn (−0.2% → −1.7% +at 100k; −1.5% at peak production-index churn), so this is a *defer/batch rebuilds* strategy, +not *never rebuild*. The earlier "rebuild-baseline variance" caveat is **resolved** — the +production index reaches the same conclusion, and the t=0.25 reuse-beats-rebuild dip +reproduced (it is a real property, not lite-Vamana noise). Remaining open: a real GNN-metric +trajectory, an incremental-rebuild baseline, larger-N on diskann, and more-query +confirmation of the gap trend. ## Context @@ -125,13 +126,29 @@ inside vs outside the warped region (a global average would hide a local failure unchanged (A_out ~1.1% under B_out, within gate). Region-local drift did **not** break reuse. -**Honest caveat — the t=0.25 anomaly.** B_in transiently fell to 81.4% (reuse beat rebuild -by 8 pts) then recovered. This non-monotonic dip is a **build-stability artifact of the -simplified single-pass Vamana** (random init, one seed, α=1.2) on the quarter-warped -geometry — *not* a smooth property. It cuts two ways: (i) it shows reuse can be *more -stable* than a fresh build during drift; (ii) it shows the rebuild baseline `B` has -build variance, so "A matches B" partly depends on B being a fair baseline. This is the -strongest argument for porting the baseline to the production `ruvector-diskann` index. +**The t=0.25 anomaly.** B_in transiently fell to 81.4% (reuse beat rebuild by 8 pts) then +recovered — a non-monotonic dip where a fresh build on the quarter-warped geometry produced +a worse in-region graph than reuse did. Initially suspected as lite-Vamana build variance; +the production-index run below **reproduced it** (smaller, but real), so it is a genuine +property, not an artifact: a fresh Vamana build on a partially-warped region can underperform +reuse, which keeps the original's good global connectivity. + +### Production-index confirmation (`ruvector-diskann`, n=20k) + +Re-run on the **shipping** Vamana (`ruvector_diskann::graph::VamanaGraph`, R=32) instead of +the lite reference Vamana — the reuse trick is native (the graph stores only topology; +`greedy_search(vectors, query, beam)` takes vectors externally, so drift = pass transformed +vectors to a graph built on the originals). Harness: `diskann_drift.rs`. recall@10: + +Global rotational drift: A reuse vs B rebuild = 95.9/95.8 (t0), 96.2/96.5 (t.25, 29% churn), +95.6/97.1 (t.5, 41% churn), 95.8/96.4 (t1). Region-local (warp 15% cluster), in-region: +A_in/B_in = 98.6/99.0 (t0), **98.6/94.5** (t.25), 98.0/97.9 (t.5, 53% churn), 98.5/99.5 (t1). + +**Confirmed:** reuse stays within the 2% gate on the production index (largest gap −1.5% at +peak global churn), at much higher absolute recall (96–99% vs lite ~90%) — a stronger, fairer +baseline. The t=0.25 reuse-beats-rebuild effect reproduces (B_in 94.5 vs A_in 98.6). **The +"rebuild baseline variance" caveat is resolved**: the production index reaches the same +conclusion. **Query cost is also equal.** Mean distance-evals/query: A ≈ B within ~1% in every row (e.g. 590 vs 583 at peak churn). So reuse does **not** trade build savings for slower @@ -168,10 +185,9 @@ passes, so navigability robustness is not limited to linear remetrization.)* ## Next steps -1. ~~Scale to n≥10⁵~~ **done** (self-contained Vamana-lite; recall parity within 2% at - ~10³–10⁴× lower update cost). Follow-up: re-run with more queries (≥500) to confirm - whether the −1.7% gap at 100k is a real trend or noise; and port to the production - `ruvector-diskann` index to confirm on its graph. +1. ~~Scale to n≥10⁵~~ **done**; ~~port to production `ruvector-diskann`~~ **done** (n=20k, + confirmed within 2%, baseline-variance caveat resolved). Follow-up: diskann at n≥10⁵ and + ≥500 queries to confirm whether the −1.5–1.7% gap is a real trend or noise. 2. ~~Region-local drift~~ **done** (warp a 15% cluster; reuse held in-region within 0.7%, gate PASS). Surfaced a build-variance dip in the lite-Vamana baseline → reinforces #1. 3. Incremental-rebuild baseline for a fair cost comparison (vs full rebuild). diff --git a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md index 37a973bf1d..b817d670b7 100644 --- a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md +++ b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md @@ -35,10 +35,14 @@ Harness: `crates/ruvector-seprag/examples/reweight_vs_rebuild.rs`. in-region vs out-region separately. Reuse held *inside* the drifted region (A_in within 0.7% of B_in, gate PASS) even at 53% in-region churn. Surfaced a transient rebuild dip (B_in 81% at t=0.25) = lite-Vamana build variance → motivates the diskann port. -- **Open (ranked):** (1) **port baseline to `ruvector-diskann`** (firms B, removes lite - build variance, confirms on the production index) + confirm 100k gap with ≥500 queries; - (2) hybrid policy (re-weight every step + rebuild every K / on drift-monitor trigger); - (3) incremental-rebuild baseline; (4) wire into the real GNN loop behind a flag. +- **Production-index port ✅** (`examples/diskann_drift.rs`): confirmed on the shipping + `ruvector-diskann` Vamana (n=20k, recall 96–99%, reuse within 2% global + in-region). The + t=0.25 reuse-beats-rebuild dip reproduced → it's a real property, not lite-Vamana noise; + baseline-variance caveat resolved. +- **Open (ranked):** (1) diskann at n≥10⁵ + ≥500 queries (confirm the −1.5–1.7% gap is + real vs noise); (2) hybrid policy (re-weight every step + rebuild every K / on + drift-monitor trigger); (3) incremental-rebuild baseline; (4) wire into the real GNN loop + behind a flag (the production payoff). ### BET 2 — Filtered ANN vs `ruvector-acorn` Region/predicate pruning for constrained ("nearest among items matching X") retrieval — a From 79b57a0ed0d1dc17633aa333f4af700b1dd1e494 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 11:20:24 -0400 Subject: [PATCH 14/15] =?UTF-8?q?feat(seprag):=20hybrid=20re-weight+period?= =?UTF-8?q?ic-rebuild=20policy=20(ADR-200)=20=E2=80=94=20shippable?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit examples/hybrid_policy.rs: simulate a compounding random-walk metric-drift trajectory and compare operating policies on the production diskann Vamana — always / never / periodic-K / drift-triggered (Frobenius monitor). Result (n=10k, 24 steps, aggressive random-walk drift, recall@10): always 99.1% mean 98.4% min 24 rebuilds never 94.4% mean 89.7% min 1 rebuild <- decays under heavy drift periodic-4 98.8% mean 97.9% min 6 rebuilds <- ~always at 25% cost periodic-8 98.4% mean 96.5% min 3 rebuilds <- at 12.5% cost Shippable operating point: re-weight every step + rebuild every ~4 steps recovers near-full recall at a fraction of the cost. Honest sub-finding: the drift-TRIGGERED monitor (Frobenius of cumulative transform) underperformed simple periodic — periodic-K is the recommended knob; a sampled-recall probe trigger is future work. Under gentle single-direction drift (n=5k) never did NOT decay, so the hybrid only matters under large/compounding drift. ADR-200 status/boundaries/next-steps + FUTURE-DIRECTIONS updated; stale 'n=2000/scale-unconfirmed' caveats removed (now resolved). --- .../ruvector-seprag/examples/hybrid_policy.rs | 160 ++++++++++++++++++ ...omizable-reweighting-fixed-topology-ann.md | 75 +++++--- .../seprag-cch-retrieval/FUTURE-DIRECTIONS.md | 12 +- 3 files changed, 220 insertions(+), 27 deletions(-) create mode 100644 crates/ruvector-seprag/examples/hybrid_policy.rs diff --git a/crates/ruvector-seprag/examples/hybrid_policy.rs b/crates/ruvector-seprag/examples/hybrid_policy.rs new file mode 100644 index 0000000000..c19df2033d --- /dev/null +++ b/crates/ruvector-seprag/examples/hybrid_policy.rs @@ -0,0 +1,160 @@ +//! BET 1 → operating policy (ADR-200): the hybrid re-weight + periodic/triggered +//! rebuild strategy, the shippable answer to "the recall gap widens with drift." +//! +//! A self-learning system drifts its metric a little every step. Per step you can +//! RE-WEIGHT (reuse the graph under the new metric, ~0 cost) or REBUILD (expensive). +//! We simulate a drift *trajectory* and compare four policies on the production +//! `ruvector-diskann` Vamana: +//! - always : rebuild every step (recall ceiling, max cost) +//! - never : build once, reuse forever (min cost, recall decays) +//! - periodic : rebuild every K steps +//! - triggered: rebuild when drift-since-last-rebuild exceeds τ (cheap monitor) +//! +//! Win: a hybrid matches `always` recall within ~2% using a small fraction of the +//! rebuilds — turning the proven finding into a usable operating point. +//! +//! Run: cargo run --release -p ruvector-seprag --example hybrid_policy -- + +use ruvector_diskann::distance::FlatVectors; +use ruvector_diskann::graph::VamanaGraph; +use ruvector_seprag::ann::{apply_linear, brute_topk, gaussian, identity, l2, read_vectors, recall, Rng, Vec32}; +use std::time::Instant; + +const R: usize = 32; +const BUILD_BEAM: usize = 64; +const SEARCH_BEAM: usize = 64; +const ALPHA: f32 = 1.2; +const K: usize = 10; +const EPS: f32 = 0.3; // per-step random-walk drift magnitude + +/// Cumulative-transform step: A' = (I + eps·G/√dim) · A — a random-walk metric +/// drift (fresh direction each step), more adversarial than a straight path. +fn small_warp(dim: usize, rng: &mut Rng, eps: f32) -> Vec { + let scale = eps / (dim as f32).sqrt(); + let mut m = identity(dim); + for i in 0..dim { + for j in 0..dim { + m[i * dim + j] += scale * gaussian(rng); + } + } + m +} + +fn matmul(a: &[f32], b: &[f32], dim: usize) -> Vec { + let mut c = vec![0.0f32; dim * dim]; + for i in 0..dim { + for k in 0..dim { + let aik = a[i * dim + k]; + if aik == 0.0 { continue; } + for j in 0..dim { + c[i * dim + j] += aik * b[k * dim + j]; + } + } + } + c +} + +fn frob(a: &[f32], b: &[f32]) -> f32 { + a.iter().zip(b).map(|(x, y)| (x - y) * (x - y)).sum::().sqrt() +} + +fn flat(vecs: &[Vec32], dim: usize) -> FlatVectors { + let mut f = FlatVectors::with_capacity(dim, vecs.len()); + for v in vecs { f.push(v); } + f +} + +fn build_graph(vecs: &[Vec32], dim: usize) -> VamanaGraph { + let f = flat(vecs, dim); + let mut g = VamanaGraph::new(vecs.len(), R, BUILD_BEAM, ALPHA); + g.build(&f).expect("build"); + g +} + +fn topk(g: &VamanaGraph, vecs: &[Vec32], f: &FlatVectors, q: usize) -> Vec { + let (cands, _) = g.greedy_search(f, &vecs[q], SEARCH_BEAM); + let mut s: Vec<(f32, u32)> = cands.iter().map(|&c| (l2(&vecs[c as usize], &vecs[q]), c)).collect(); + s.sort_by(|a, b| a.0.total_cmp(&b.0)); + s.into_iter().filter(|&(_, c)| c as usize != q).take(K).map(|(_, c)| c).collect() +} + +struct Step { vt: Vec, ft: FlatVectors, truth: Vec>, a: Vec } + +fn main() { + let args: Vec = std::env::args().collect(); + let path = args.get(1).cloned().unwrap_or_else(|| "target/m1-data/node-feat-100k.csv".into()); + let n: usize = args.get(2).and_then(|x| x.parse().ok()).unwrap_or(5000); + let steps_n: usize = args.get(3).and_then(|x| x.parse().ok()).unwrap_or(24); + let vecs = read_vectors(&path, n); + let n = vecs.len(); + let dim = vecs[0].len(); + + let mut qrng = Rng::new(999); + let queries: Vec = (0..100).map(|_| qrng.below(n)).collect(); + + eprintln!("[hybrid] n={n} dim={dim} steps={steps_n}; precomputing random-walk drift (eps={EPS})…"); + // Precompute each step's drifted vectors + ground truth (shared across policies). + // Drift is a compounding random walk: A_t = (I+eps·G)·A_{t-1}. + let mut acc = identity(dim); + let mut wrng = Rng::new(2); + let steps: Vec = (0..steps_n).map(|t| { + if t > 0 { + let w = small_warp(dim, &mut wrng, EPS); + acc = matmul(&w, &acc, dim); + } + let a = acc.clone(); + let vt = apply_linear(&a, &vecs, dim); + let ft = flat(&vt, dim); + let truth = queries.iter().map(|&q| brute_topk(&vt, q, K)).collect(); + Step { vt, ft, truth, a } + }).collect(); + // Calibrate trigger thresholds from the mean per-step drift. + let d_step: f32 = (1..steps_n).map(|t| frob(&steps[t].a, &steps[t - 1].a)).sum::() / (steps_n - 1).max(1) as f32; + eprintln!("[hybrid] mean per-step drift (Frobenius) = {d_step:.2}"); + + // One representative rebuild cost (for the cost column). + let t0 = Instant::now(); + let _ = build_graph(&steps[0].vt, dim); + let build_s = t0.elapsed().as_secs_f64(); + eprintln!("[hybrid] one rebuild ≈ {build_s:.2}s\n"); + + println!("=== BET 1 hybrid policy: drift trajectory, {steps_n} steps, recall@{K} (diskann) ==="); + println!("{:>12} | {:>9} {:>9} {:>9} | {:>8} {:>10}", "policy", "mean rec", "min rec", "end rec", "rebuilds", "rebuild s"); + println!("{}", "-".repeat(72)); + + // policy = closure(step_idx, frob_drift_since_last_rebuild) -> should_rebuild + let run = |name: &str, should_rebuild: &dyn Fn(usize, f32) -> bool| { + let mut g = build_graph(&steps[0].vt, dim); // t=0 always builds + let mut last = 0usize; + let mut builds = 1usize; + let mut recalls = Vec::with_capacity(steps_n); + for (t, st) in steps.iter().enumerate() { + if t > 0 { + let drift = frob(&st.a, &steps[last].a); + if should_rebuild(t, drift) { + g = build_graph(&st.vt, dim); + last = t; + builds += 1; + } + } + let r: f64 = queries.iter().zip(&st.truth).map(|(&q, tr)| recall(&topk(&g, &st.vt, &st.ft, q), tr)).sum::() / queries.len() as f64; + recalls.push(r); + } + let mean = recalls.iter().sum::() / recalls.len() as f64; + let min = recalls.iter().cloned().fold(1.0, f64::min); + let end = *recalls.last().unwrap(); + println!("{:>14} | {:>8.1}% {:>8.1}% {:>8.1}% | {:>8} {:>10.2}", name, mean * 100.0, min * 100.0, end * 100.0, builds, builds as f64 * build_s); + }; + + let tau_a = 3.0 * d_step; // rebuild after ~3 steps of drift + let tau_b = 6.0 * d_step; // rebuild after ~6 steps of drift + run("always", &|_, _| true); + run("never", &|_, _| false); + run("periodic-4", &|t, _| t % 4 == 0); + run("periodic-8", &|t, _| t % 8 == 0); + run("triggered~3", &|_, d| d >= tau_a); + run("triggered~6", &|_, d| d >= tau_b); + + println!("\nWin: a hybrid matches 'always' mean recall within ~2% at a fraction of the rebuilds."); + println!("'never' shows the decay reuse-only suffers as drift accumulates; the gap is what hybrids close."); +} diff --git a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md index 197c82e29b..282a9e5701 100644 --- a/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md +++ b/docs/adr/ADR-200-customizable-reweighting-fixed-topology-ann.md @@ -19,9 +19,12 @@ full-contraction was NO-GO on embedding graphs, [ADR-199]). The fixed topology m rebuild within the pre-registered 2% recall gate across diagonal/rotational/non-linear drift, across n=5k…100k, **and** under region-local drift (warping only a 15% cluster), at **~1,000–4,000× lower update cost**. Confirmed on the **production `ruvector-diskann` Vamana** (96–99% recall, reuse within 2% of -rebuild). **Caveat (honest):** the recall gap widens mildly with scale/churn (−0.2% → −1.7% -at 100k; −1.5% at peak production-index churn), so this is a *defer/batch rebuilds* strategy, -not *never rebuild*. The earlier "rebuild-baseline variance" caveat is **resolved** — the +rebuild), and the **hybrid operating policy is validated**: under aggressive compounding drift +a periodic rebuild every ~4 steps recovers 98.8% (vs 99.1% always, 94.4% never) at 25% of the +rebuild cost. **Caveat (honest):** the recall gap widens with scale/churn (−0.2% → −1.7% at +100k; `never` decays to 94% under heavy compounding drift), which is exactly what the hybrid +*defer/batch rebuild* policy is for — so the strategy is "re-weight every step, rebuild +periodically," not "never rebuild." The earlier "rebuild-baseline variance" caveat is **resolved** — the production index reaches the same conclusion, and the t=0.25 reuse-beats-rebuild dip reproduced (it is a real property, not lite-Vamana noise). Remaining open: a real GNN-metric trajectory, an incremental-rebuild baseline, larger-N on diskann, and more-query @@ -161,6 +164,30 @@ beam width, not edge metric-optimality — and navigability survives smooth reme linear *or* non-linear. (Edge optimality would matter more for path length / efficiency, which is why we also checked per-query evals and found them equal.) +### Operating policy: hybrid re-weight + periodic rebuild (n=10k, diskann) + +The shippable answer to "the gap widens with drift": re-weight every step, rebuild +occasionally. Tested on a **compounding random-walk** drift (fresh direction each step, +eps=0.3 — aggressive, to force `never` to decay) over a 24-step trajectory. `hybrid_policy.rs`, +recall@10: + +| policy | mean | min | rebuilds | rebuild cost | +|---|---|---|---|---| +| always (rebuild every step) | 99.1% | 98.4% | 24 | 68.7s | +| never (reuse only) | 94.4% | 89.7% | 1 | 2.9s | +| **periodic-4** | **98.8%** | 97.9% | 6 | 17.2s | +| periodic-8 | 98.4% | 96.5% | 3 | 8.6s | +| triggered (Frobenius monitor) | 95–98% | 90–94% | 1–3 | 2.9–8.6s | + +**Result:** under aggressive compounding drift `never` decays (94.4% mean, 89.7% floor); +**periodic-4 recovers 98.8% — within 0.3% of always — at 25% of the rebuild cost** (periodic-8: +98.4% at 12.5%). So a cheap fixed-schedule rebuild captures nearly all of always's recall. +**Honest sub-finding:** the drift-*triggered* policy (rebuild when the Frobenius norm of the +cumulative-transform delta exceeds τ) **underperformed simple periodic** — the signal fired +unevenly. Simple **periodic-K is the recommended knob**; a smarter trigger (e.g. a small +sampled-recall probe) is future work. Note: under *gentle* single-direction drift (n=5k test) +`never` did **not** decay — the hybrid only earns its keep under large/compounding drift. + ## Consequences **Positive.** @@ -172,30 +199,32 @@ which is why we also checked per-query evals and found them equal.) overclaiming. **Boundaries / not yet proven (the honest caveats).** -- **Scale.** n=2000; recall-at-scale (n≥10⁵) and the rebuild-cost curve unconfirmed. This - is now the *primary* open question — and the cost asymmetry only grows with n. -- **Global drift.** Same transform for all points; **region-local** metric change (different - relevance in different regions) is harder and untested. -- **Baseline.** Compared vs *full* rebuild; an *incremental*-update baseline is not yet in. -- **Synthetic drift.** Drift is parametric (diag/rot/tanh), not a real learned-GNN metric - trajectory — realistic, but the live GNN loop is the eventual proof. - -*(Resolved: the "linear drift only" caveat — non-linear tanh-warp drift now tested and -passes, so navigability robustness is not limited to linear remetrization.)* +- **Synthetic drift.** Drift is parametric (diagonal / rotational / non-linear tanh / + compounding random walk), not a real learned-GNN metric trajectory. Realistic and + adversarial, but the live GNN loop is the eventual proof. +- **Gap grows with scale/churn.** Recall gap reaches −1.7% at n=100k and `never` decays to + ~94% under heavy compounding drift — addressed operationally by the periodic-rebuild + hybrid, but not eliminated. +- **Incremental baseline.** Compared vs *full* rebuild; an *incremental*-update baseline is + not yet in (would tighten the cost comparison). +- **Trigger signal.** The Frobenius drift-monitor underperformed simple periodic; a better + cheap signal (sampled-recall probe) is unproven. + +*(Resolved: "linear drift only" — non-linear tanh-warp passes. "n=2000 only" — scaled to +100k. "lite-Vamana baseline variance" — confirmed on production `ruvector-diskann`.)* ## Next steps -1. ~~Scale to n≥10⁵~~ **done**; ~~port to production `ruvector-diskann`~~ **done** (n=20k, - confirmed within 2%, baseline-variance caveat resolved). Follow-up: diskann at n≥10⁵ and - ≥500 queries to confirm whether the −1.5–1.7% gap is a real trend or noise. -2. ~~Region-local drift~~ **done** (warp a 15% cluster; reuse held in-region within 0.7%, - gate PASS). Surfaced a build-variance dip in the lite-Vamana baseline → reinforces #1. +1. ~~Scale to n≥10⁵~~ **done** · ~~production `ruvector-diskann` port~~ **done** · + ~~region-local drift~~ **done** · ~~hybrid policy~~ **done** (periodic-4 ≈ always at 25% + cost). Follow-up: diskann at n≥10⁵ with ≥500 queries to firm the gap-trend estimate. +2. **Smarter rebuild trigger** — replace the Frobenius monitor with a small sampled-recall + probe (estimate live recall cheaply, rebuild when it crosses a floor); should beat + fixed periodic. 3. Incremental-rebuild baseline for a fair cost comparison (vs full rebuild). -4. Wire re-weight-on-drift into the `ruvector-diskann`/GNN loop behind a flag and validate - on a real learned-metric trajectory (the eventual production proof). -5. A *hybrid policy*: cheap re-weight every step + a full rebuild every K steps (or when a - drift-monitor predicts the gap will cross a threshold) — captures most of the cost win - while bounding recall loss. +4. **Wire into the `ruvector-diskann`/`ruvector-gnn` loop behind a flag** and validate on a + real learned-metric trajectory — the eventual production proof and the natural home for + the periodic-rebuild policy. ## Alternatives considered diff --git a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md index b817d670b7..e02d916773 100644 --- a/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md +++ b/docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md @@ -39,10 +39,14 @@ Harness: `crates/ruvector-seprag/examples/reweight_vs_rebuild.rs`. `ruvector-diskann` Vamana (n=20k, recall 96–99%, reuse within 2% global + in-region). The t=0.25 reuse-beats-rebuild dip reproduced → it's a real property, not lite-Vamana noise; baseline-variance caveat resolved. -- **Open (ranked):** (1) diskann at n≥10⁵ + ≥500 queries (confirm the −1.5–1.7% gap is - real vs noise); (2) hybrid policy (re-weight every step + rebuild every K / on - drift-monitor trigger); (3) incremental-rebuild baseline; (4) wire into the real GNN loop - behind a flag (the production payoff). +- **Hybrid policy ✅** (`examples/hybrid_policy.rs`): under aggressive compounding random-walk + drift, `never` decays to 94.4% mean / 89.7% floor; **periodic-4 recovers 98.8% (≈ always + 99.1%) at 25% of the rebuild cost** (periodic-8: 98.4% at 12.5%). The drift-*triggered* + monitor (Frobenius) underperformed simple periodic → periodic-K is the recommended knob. +- **Open (ranked):** (1) smarter rebuild trigger (sampled-recall probe vs the Frobenius + monitor); (2) wire re-weight + periodic-rebuild into the `ruvector-diskann`/`ruvector-gnn` + loop behind a flag (production payoff); (3) diskann at n≥10⁵ + ≥500 queries; (4) + incremental-rebuild baseline. ### BET 2 — Filtered ANN vs `ruvector-acorn` Region/predicate pruning for constrained ("nearest among items matching X") retrieval — a From 44ee4dbf6f504d66de7959ab1706ef09e591f397 Mon Sep 17 00:00:00 2001 From: Ofer Shaal Date: Thu, 4 Jun 2026 11:41:01 -0400 Subject: [PATCH 15/15] chore(seprag): add ruvector-seprag to Cargo.lock --- Cargo.lock | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/Cargo.lock b/Cargo.lock index 078e1b29fa..571453c24f 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -10136,6 +10136,15 @@ dependencies = [ "web-sys", ] +[[package]] +name = "ruvector-seprag" +version = "2.2.3" +dependencies = [ + "approx", + "ruvector-diskann", + "thiserror 2.0.18", +] + [[package]] name = "ruvector-server" version = "2.2.3"