Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,12 @@ ambiguous and the embedder already matches them well. So the code-side default i
for larger repos where exact-string recall matters more. Off by default pending larger-repo
validation; enable with `CODERAG_ADAPTIVE_FUSION=1`.

> ⚠️ **This did NOT generalize — keep it off.** On `pydantic` (4 155-chunk corpus, commit-message
> queries) adaptive *hurt* (MRR 0.286 vs hybrid 0.361): those queries embed exact API names, so
> "lean dense for NL" is backwards. The dense-vs-BM25 ranking flips by repo/query style, and
> **fixed 1:1 hybrid is the robust default.** Full write-up:
> [research/external-validation.md](research/external-validation.md).

## Dataset format

JSONL, one case per line:
Expand Down
5 changes: 5 additions & 0 deletions docs/research/code-retrieval-strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,11 @@ bolt-on. Treat as a later experiment, not a v1 move.
> half of the hypothesis was refuted** — up-weighting BM25 there *hurt* (short identifiers are
> lexically ambiguous; the embedder already matches them), so the code-side default is neutral.
> Off by default pending larger-repo validation. See [docs/eval.md](../eval.md).
>
> **Generalization update — refuted on a larger repo.** On `pydantic` adaptive *hurt* (MRR
> 0.286 vs hybrid 0.361) because commit-message queries embed exact API names; the dense-vs-BM25
> ranking flips by repo, and fixed 1:1 hybrid is the robust default. Keep adaptive off. See
> [external-validation.md](external-validation.md).

CodeRAG already does dense + BM25 + RRF — the literature says that's the right foundation; the wins
are in **routing and tuning**:
Expand Down
81 changes: 81 additions & 0 deletions docs/research/external-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# External-repo validation: do the symbol-level findings generalize?

> The CodeRAG-on-itself experiments (see [eval.md](../eval.md)) produced three "levers":
> dense-leaning adaptive fusion, reranking for top-1, and "a bigger embedder doesn't help."
> A single 93-file repo is a weak basis for a default change, so this validates them on a
> **larger external repo** — and the headline result is that **the biggest lever did not
> generalize.** External validation paid for itself.

## Setup

- **Repo:** `pydantic/pydantic` (depth-500 clone), 404 Python files. Indexed with the default
`bge-small-en-v1.5`; **161 files / 4 155 chunks** indexed (7× CodeRAG's corpus — full index
was CPU-bound at ~25 min, so a partial index was used and the dataset filtered to it).
- **Dataset:** 109 symbol-level cases mined from commit history (`build_from_git(symbols=True)`),
filtered to the 30 whose changed files are all in the indexed set. Queries are **commit
subjects** (e.g. *"Fix tuple order in `AliasGenerator.generate_aliases()`"*) — note these
often embed exact API/symbol names.
- All offline, `bge-small`, symbol level, on the existing index (no re-indexing).

## Results (30 cases, 4 155-chunk corpus)

```
mode MRR R@1 R@5 R@10 nDCG@10 Hit@10
dense 0.150 0.067 0.253 0.253 0.166 0.300
bm25 0.384 0.317 0.425 0.465 0.403 0.533
hybrid 0.361 0.283 0.408 0.408 0.369 0.433
hybrid+rerank 0.353 0.253 0.386 0.419 0.344 0.467
adaptive 0.286 0.183 0.372 0.372 0.302 0.400
```

## Findings — three of them overturn or qualify the single-repo conclusions

1. **The dense-vs-BM25 ranking *flips* by repo/query style.** On CodeRAG (clean, hand-written
NL queries) dense crushed BM25 (0.675 vs 0.427). On pydantic (commit-message queries that
embed API names) **BM25 crushes dense** (0.384 vs 0.150). Neither modality wins universally —
it depends on whether the discriminating signal is semantic or an exact identifier in the
query text.

2. **Adaptive fusion did not generalize — it *hurt* here** (0.286 vs hybrid 0.361). It leans
dense for "natural-language" queries, but pydantic's prose-shaped commit messages contain the
exact symbol names BM25 needs, so leaning dense is exactly backwards. The `looks_like_identifier`
classifier keys on prose *shape* and is fooled by identifier-laden sentences. **Conclusion:
keep `adaptive_fusion` OFF by default (as shipped); the "lean dense for NL" rule was overfit
to CodeRAG's curated queries.** It remains useful only with per-corpus calibration.

3. **Reranking (ms-marco) did not help either** (0.353 ≈ hybrid 0.361) — it lifted top-1 on
CodeRAG (+12–55% R@1) but was neutral/slightly negative here. A web-trained cross-encoder is
not a reliable code reranker; a code-aware one (untested at this scale on CPU) is the open
question.

4. **Fixed 1:1 hybrid is the robust default.** It is never the best but never the worst, and
stays within ~6% of the winner on *both* repos (0.573 vs dense 0.675 on CodeRAG; 0.361 vs
bm25 0.384 here). This directly validates CodeRAG's existing default and the decision to keep
the new levers opt-in.

## The meta-lesson

Single-repo tuning overfits. Every "improvement" measured on CodeRAG-on-itself was fragile:
the embedder ranking flipped, the adaptive-fusion lever reversed sign, and reranking's gain
evaporated. The robust configuration is exactly the **shipped defaults** — 1:1 hybrid, adaptive
off, rerank opt-in. This is the harness earning its keep: it caught the overfitting before any
of it became a default.

**Actionable next steps** (none change a default):
- Make `looks_like_identifier` smarter — detect identifiers *embedded* in prose queries (so
"Fix `AliasGenerator.generate_aliases`" routes BM25-up, not dense-up). That could make adaptive
fusion a net win across query styles instead of fragile.
- Test a **code-aware reranker** (`bge-reranker-base`, `jina-reranker-v2`) at scale on GPU —
the only lever not yet fairly evaluated.
- Build a multi-repo eval set (several external repos) so future tuning is judged on
generalization, not a single codebase.

### Reproduce

```bash
git clone --depth 500 https://github.com/pydantic/pydantic /tmp/extrepo
coderag index --watched-dir /tmp/extrepo --store-dir /tmp/pyd_store # ~25 min on CPU
coderag eval --build --level symbol --watched-dir /tmp/extrepo --dataset pyd.jsonl
coderag eval --dataset pyd.jsonl --level symbol --compare --adaptive --rerank \
--watched-dir /tmp/extrepo --store-dir /tmp/pyd_store
```
Loading