diff --git a/docs/eval.md b/docs/eval.md index 51145ce..69d6f72 100644 --- a/docs/eval.md +++ b/docs/eval.md @@ -224,6 +224,12 @@ ambiguous and the embedder already matches them well. So the code-side default i for larger repos where exact-string recall matters more. Off by default pending larger-repo validation; enable with `CODERAG_ADAPTIVE_FUSION=1`. +> ⚠️ **This did NOT generalize — keep it off.** On `pydantic` (4 155-chunk corpus, commit-message +> queries) adaptive *hurt* (MRR 0.286 vs hybrid 0.361): those queries embed exact API names, so +> "lean dense for NL" is backwards. The dense-vs-BM25 ranking flips by repo/query style, and +> **fixed 1:1 hybrid is the robust default.** Full write-up: +> [research/external-validation.md](research/external-validation.md). + ## Dataset format JSONL, one case per line: diff --git a/docs/research/code-retrieval-strategy.md b/docs/research/code-retrieval-strategy.md index 4a9aea1..3e8a717 100644 --- a/docs/research/code-retrieval-strategy.md +++ b/docs/research/code-retrieval-strategy.md @@ -138,6 +138,11 @@ bolt-on. Treat as a later experiment, not a v1 move. > half of the hypothesis was refuted** — up-weighting BM25 there *hurt* (short identifiers are > lexically ambiguous; the embedder already matches them), so the code-side default is neutral. > Off by default pending larger-repo validation. See [docs/eval.md](../eval.md). +> +> **Generalization update — refuted on a larger repo.** On `pydantic` adaptive *hurt* (MRR +> 0.286 vs hybrid 0.361) because commit-message queries embed exact API names; the dense-vs-BM25 +> ranking flips by repo, and fixed 1:1 hybrid is the robust default. Keep adaptive off. See +> [external-validation.md](external-validation.md). CodeRAG already does dense + BM25 + RRF — the literature says that's the right foundation; the wins are in **routing and tuning**: diff --git a/docs/research/external-validation.md b/docs/research/external-validation.md new file mode 100644 index 0000000..afec0b6 --- /dev/null +++ b/docs/research/external-validation.md @@ -0,0 +1,81 @@ +# External-repo validation: do the symbol-level findings generalize? + +> The CodeRAG-on-itself experiments (see [eval.md](../eval.md)) produced three "levers": +> dense-leaning adaptive fusion, reranking for top-1, and "a bigger embedder doesn't help." +> A single 93-file repo is a weak basis for a default change, so this validates them on a +> **larger external repo** — and the headline result is that **the biggest lever did not +> generalize.** External validation paid for itself. + +## Setup + +- **Repo:** `pydantic/pydantic` (depth-500 clone), 404 Python files. Indexed with the default + `bge-small-en-v1.5`; **161 files / 4 155 chunks** indexed (7× CodeRAG's corpus — full index + was CPU-bound at ~25 min, so a partial index was used and the dataset filtered to it). +- **Dataset:** 109 symbol-level cases mined from commit history (`build_from_git(symbols=True)`), + filtered to the 30 whose changed files are all in the indexed set. Queries are **commit + subjects** (e.g. *"Fix tuple order in `AliasGenerator.generate_aliases()`"*) — note these + often embed exact API/symbol names. +- All offline, `bge-small`, symbol level, on the existing index (no re-indexing). + +## Results (30 cases, 4 155-chunk corpus) + +``` +mode MRR R@1 R@5 R@10 nDCG@10 Hit@10 +dense 0.150 0.067 0.253 0.253 0.166 0.300 +bm25 0.384 0.317 0.425 0.465 0.403 0.533 +hybrid 0.361 0.283 0.408 0.408 0.369 0.433 +hybrid+rerank 0.353 0.253 0.386 0.419 0.344 0.467 +adaptive 0.286 0.183 0.372 0.372 0.302 0.400 +``` + +## Findings — three of them overturn or qualify the single-repo conclusions + +1. **The dense-vs-BM25 ranking *flips* by repo/query style.** On CodeRAG (clean, hand-written + NL queries) dense crushed BM25 (0.675 vs 0.427). On pydantic (commit-message queries that + embed API names) **BM25 crushes dense** (0.384 vs 0.150). Neither modality wins universally — + it depends on whether the discriminating signal is semantic or an exact identifier in the + query text. + +2. **Adaptive fusion did not generalize — it *hurt* here** (0.286 vs hybrid 0.361). It leans + dense for "natural-language" queries, but pydantic's prose-shaped commit messages contain the + exact symbol names BM25 needs, so leaning dense is exactly backwards. The `looks_like_identifier` + classifier keys on prose *shape* and is fooled by identifier-laden sentences. **Conclusion: + keep `adaptive_fusion` OFF by default (as shipped); the "lean dense for NL" rule was overfit + to CodeRAG's curated queries.** It remains useful only with per-corpus calibration. + +3. **Reranking (ms-marco) did not help either** (0.353 ≈ hybrid 0.361) — it lifted top-1 on + CodeRAG (+12–55% R@1) but was neutral/slightly negative here. A web-trained cross-encoder is + not a reliable code reranker; a code-aware one (untested at this scale on CPU) is the open + question. + +4. **Fixed 1:1 hybrid is the robust default.** It is never the best but never the worst, and + stays within ~6% of the winner on *both* repos (0.573 vs dense 0.675 on CodeRAG; 0.361 vs + bm25 0.384 here). This directly validates CodeRAG's existing default and the decision to keep + the new levers opt-in. + +## The meta-lesson + +Single-repo tuning overfits. Every "improvement" measured on CodeRAG-on-itself was fragile: +the embedder ranking flipped, the adaptive-fusion lever reversed sign, and reranking's gain +evaporated. The robust configuration is exactly the **shipped defaults** — 1:1 hybrid, adaptive +off, rerank opt-in. This is the harness earning its keep: it caught the overfitting before any +of it became a default. + +**Actionable next steps** (none change a default): +- Make `looks_like_identifier` smarter — detect identifiers *embedded* in prose queries (so + "Fix `AliasGenerator.generate_aliases`" routes BM25-up, not dense-up). That could make adaptive + fusion a net win across query styles instead of fragile. +- Test a **code-aware reranker** (`bge-reranker-base`, `jina-reranker-v2`) at scale on GPU — + the only lever not yet fairly evaluated. +- Build a multi-repo eval set (several external repos) so future tuning is judged on + generalization, not a single codebase. + +### Reproduce + +```bash +git clone --depth 500 https://github.com/pydantic/pydantic /tmp/extrepo +coderag index --watched-dir /tmp/extrepo --store-dir /tmp/pyd_store # ~25 min on CPU +coderag eval --build --level symbol --watched-dir /tmp/extrepo --dataset pyd.jsonl +coderag eval --dataset pyd.jsonl --level symbol --compare --adaptive --rerank \ + --watched-dir /tmp/extrepo --store-dir /tmp/pyd_store +```