Neverdecel · Neverdecel · Jun 17, 2026 · Jun 17, 2026
diff --git a/docs/eval.md b/docs/eval.md
@@ -224,6 +224,12 @@ ambiguous and the embedder already matches them well. So the code-side default i
 for larger repos where exact-string recall matters more. Off by default pending larger-repo
 validation; enable with `CODERAG_ADAPTIVE_FUSION=1`.
 
+> ⚠️ **This did NOT generalize — keep it off.** On `pydantic` (4 155-chunk corpus, commit-message
+> queries) adaptive *hurt* (MRR 0.286 vs hybrid 0.361): those queries embed exact API names, so
+> "lean dense for NL" is backwards. The dense-vs-BM25 ranking flips by repo/query style, and
+> **fixed 1:1 hybrid is the robust default.** Full write-up:
+> [research/external-validation.md](research/external-validation.md).
+
 ## Dataset format
 
 JSONL, one case per line:

diff --git a/docs/research/code-retrieval-strategy.md b/docs/research/code-retrieval-strategy.md
@@ -138,6 +138,11 @@ bolt-on. Treat as a later experiment, not a v1 move.
 > half of the hypothesis was refuted** — up-weighting BM25 there *hurt* (short identifiers are
 > lexically ambiguous; the embedder already matches them), so the code-side default is neutral.
 > Off by default pending larger-repo validation. See [docs/eval.md](../eval.md).
+>
+> **Generalization update — refuted on a larger repo.** On `pydantic` adaptive *hurt* (MRR
+> 0.286 vs hybrid 0.361) because commit-message queries embed exact API names; the dense-vs-BM25
+> ranking flips by repo, and fixed 1:1 hybrid is the robust default. Keep adaptive off. See
+> [external-validation.md](external-validation.md).
 
 CodeRAG already does dense + BM25 + RRF — the literature says that's the right foundation; the wins
 are in **routing and tuning**:

diff --git a/docs/research/external-validation.md b/docs/research/external-validation.md
@@ -0,0 +1,81 @@
+# External-repo validation: do the symbol-level findings generalize?
+
+> The CodeRAG-on-itself experiments (see [eval.md](../eval.md)) produced three "levers":
+> dense-leaning adaptive fusion, reranking for top-1, and "a bigger embedder doesn't help."
+> A single 93-file repo is a weak basis for a default change, so this validates them on a
+> **larger external repo** — and the headline result is that **the biggest lever did not
+> generalize.** External validation paid for itself.
+
+## Setup
+
+- **Repo:** `pydantic/pydantic` (depth-500 clone), 404 Python files. Indexed with the default
+  `bge-small-en-v1.5`; **161 files / 4 155 chunks** indexed (7× CodeRAG's corpus — full index
+  was CPU-bound at ~25 min, so a partial index was used and the dataset filtered to it).
+- **Dataset:** 109 symbol-level cases mined from commit history (`build_from_git(symbols=True)`),
+  filtered to the 30 whose changed files are all in the indexed set. Queries are **commit
+  subjects** (e.g. *"Fix tuple order in `AliasGenerator.generate_aliases()`"*) — note these
+  often embed exact API/symbol names.
+- All offline, `bge-small`, symbol level, on the existing index (no re-indexing).
+
+## Results (30 cases, 4 155-chunk corpus)
+
+```
+mode           MRR    R@1    R@5    R@10   nDCG@10  Hit@10
+dense          0.150  0.067  0.253  0.253  0.166    0.300
+bm25           0.384  0.317  0.425  0.465  0.403    0.533
+hybrid         0.361  0.283  0.408  0.408  0.369    0.433
+hybrid+rerank  0.353  0.253  0.386  0.419  0.344    0.467
+adaptive       0.286  0.183  0.372  0.372  0.302    0.400
+```
+
+## Findings — three of them overturn or qualify the single-repo conclusions
+
+1. **The dense-vs-BM25 ranking *flips* by repo/query style.** On CodeRAG (clean, hand-written
+   NL queries) dense crushed BM25 (0.675 vs 0.427). On pydantic (commit-message queries that
+   embed API names) **BM25 crushes dense** (0.384 vs 0.150). Neither modality wins universally —
+   it depends on whether the discriminating signal is semantic or an exact identifier in the
+   query text.
+
+2. **Adaptive fusion did not generalize — it *hurt* here** (0.286 vs hybrid 0.361). It leans
+   dense for "natural-language" queries, but pydantic's prose-shaped commit messages contain the
+   exact symbol names BM25 needs, so leaning dense is exactly backwards. The `looks_like_identifier`
+   classifier keys on prose *shape* and is fooled by identifier-laden sentences. **Conclusion:
+   keep `adaptive_fusion` OFF by default (as shipped); the "lean dense for NL" rule was overfit
+   to CodeRAG's curated queries.** It remains useful only with per-corpus calibration.
+
+3. **Reranking (ms-marco) did not help either** (0.353 ≈ hybrid 0.361) — it lifted top-1 on
+   CodeRAG (+12–55% R@1) but was neutral/slightly negative here. A web-trained cross-encoder is
+   not a reliable code reranker; a code-aware one (untested at this scale on CPU) is the open
+   question.
+
+4. **Fixed 1:1 hybrid is the robust default.** It is never the best but never the worst, and
+   stays within ~6% of the winner on *both* repos (0.573 vs dense 0.675 on CodeRAG; 0.361 vs
+   bm25 0.384 here). This directly validates CodeRAG's existing default and the decision to keep
+   the new levers opt-in.
+
+## The meta-lesson
+
+Single-repo tuning overfits. Every "improvement" measured on CodeRAG-on-itself was fragile:
+the embedder ranking flipped, the adaptive-fusion lever reversed sign, and reranking's gain
+evaporated. The robust configuration is exactly the **shipped defaults** — 1:1 hybrid, adaptive
+off, rerank opt-in. This is the harness earning its keep: it caught the overfitting before any
+of it became a default.
+
+**Actionable next steps** (none change a default):
+- Make `looks_like_identifier` smarter — detect identifiers *embedded* in prose queries (so
+  "Fix `AliasGenerator.generate_aliases`" routes BM25-up, not dense-up). That could make adaptive
+  fusion a net win across query styles instead of fragile.
+- Test a **code-aware reranker** (`bge-reranker-base`, `jina-reranker-v2`) at scale on GPU —
+  the only lever not yet fairly evaluated.
+- Build a multi-repo eval set (several external repos) so future tuning is judged on
+  generalization, not a single codebase.
+
+### Reproduce
+
+```bash
+git clone --depth 500 https://github.com/pydantic/pydantic /tmp/extrepo
+coderag index --watched-dir /tmp/extrepo --store-dir /tmp/pyd_store   # ~25 min on CPU
+coderag eval --build --level symbol --watched-dir /tmp/extrepo --dataset pyd.jsonl
+coderag eval --dataset pyd.jsonl --level symbol --compare --adaptive --rerank \
+  --watched-dir /tmp/extrepo --store-dir /tmp/pyd_store
+```