Retrieval eval harness

A small, offline harness for measuring retrieval quality — "did we surface the right file/symbol for this query?" — so accuracy claims are provable and regressions are caught. It implements move #0 of the code-retrieval strategy: nothing else in that plan (a better embedder, a reranker, fusion tuning) is worth shipping until we can measure it.

Metrics

Standard localization metrics, matching the SWE-bench / Agentless / SweRank and CoIR / CodeSearchNet conventions:

recall@k — fraction of relevant items found in the top k.
hit@k (Acc@k) — 1 if any relevant item is in the top k.
MRR — reciprocal rank of the first relevant item.
nDCG@k — rank-discounted gain (binary relevance).

Scored at file level (default) or symbol level (--level symbol).

Quick start

# 1. Mine a dataset from the repo's own git history (no network, no LLM):
#    query = commit subject, ground truth = files that commit changed (and still exist).
coderag eval --build --dataset coderag-eval.jsonl

# 2. Score the current hybrid retriever:
coderag eval --dataset coderag-eval.jsonl

# 3. Contrast dense-only vs BM25-only vs hybrid on one index:
coderag eval --dataset coderag-eval.jsonl --compare

# 4. Add the optional two-stage cross-encoder reranker (adds a hybrid+rerank row):
coderag eval --dataset coderag-eval.jsonl --compare --rerank

# 5. Harder, non-saturated: symbol-level (find the right function, not just file).
coderag eval --build --level symbol --dataset sym.jsonl   # mines relevant_symbols too
coderag eval --dataset sym.jsonl --level symbol --compare --rerank

# 6. Structure-aware 1-hop call-graph (callee) expansion (adds a hybrid+graph row):
coderag eval --dataset sym.jsonl --level symbol --compare --graph

Reranking is opt-in at search time too: set CODERAG_RERANK=1 (model via CODERAG_RERANK_MODEL, pool depth via CODERAG_RERANK_CANDIDATES) and every coderag search / API / UI query runs two-stage retrieve-then-rerank. Likewise CODERAG_GRAPH_EXPANSION=1 turns on 1-hop call-graph expansion — pulling in the definitions of what the top hits call (their callees) — for every query. See configuration.md and §4 of research/code-retrieval-strategy.md.

mode    n   MRR    R@1    R@5    R@10   nDCG@1  nDCG@5  nDCG@10  Hit@1  Hit@5  Hit@10
------  --  -----  -----  -----  -----  ------  ------  -------  -----  -----  ------
dense   …
bm25    …
hybrid  …

Add --json for machine-readable output, --ks 1,3,5,10 to change cutoffs, and --level symbol for function/class-level localization (needs relevant_symbols in the dataset). The usual --watched-dir / --store-dir / --provider / --model flags apply.

The default fake provider is for tests only — its vectors are random, so dense looks near-zero. Run real evals against fastembed (the local default) or whatever model you're evaluating, e.g. coderag eval --dataset … --compare --model BAAI/bge-small-en-v1.5 then again with a candidate like CodeRankEmbed to measure the lift.

Measured results (this repo)

Move #1 experiment — current default vs a code-specific model — run with scripts/bench_embedders.py on the curated dataset (coderag/eval/datasets/coderag_self.jsonl, 24 natural-language → file cases, 90 files / 553 chunks):

mode                                   n   MRR    R@1    R@5    R@10   nDCG@10  Hit@10
bge-small-en-v1.5 · dense              24  0.784  0.604  0.938  1.000  0.831    1.000
bge-small-en-v1.5 · bm25               24  0.751  0.604  0.854  1.000  0.802    1.000
bge-small-en-v1.5 · hybrid             24  0.822  0.688  1.000  1.000  0.860    1.000
jina-embeddings-v2-base-code · dense   24  0.759  0.583  0.938  0.979  0.810    1.000
jina-embeddings-v2-base-code · bm25    24  0.751  0.604  0.854  1.000  0.802    1.000
jina-embeddings-v2-base-code · hybrid  24  0.835  0.729  0.938  0.958  0.858    0.958

Two findings, one expected and one cautionary:

Hybrid beats either modality alone, for both models (bge hybrid MRR 0.822 > dense 0.784 > bm25 0.751; jina hybrid 0.835 > dense 0.759 > bm25 0.751). This is the core thesis — fusion is the differentiator vs pure-grep agents and single-modality embedding tools. The identical BM25 rows across models are a sanity check that the harness isolates the embedding variable correctly.
The code-specific model did not clearly beat bge-small here. jina-code's hybrid is marginally ahead on MRR/R@1 but behind on R@5/R@10/Hit@10. The reason is saturation: on a 90-file repo with lexical-rich NL queries, bge-small already hits Hit@10 = 1.0 and R@5 ≈ 1.0 — there's no recall headroom for a better model to capture. The large published CoIR gap (bge ~45.8 vs code models ~60) is measured on big, hard, cross-language corpora and does not transfer to a small single-repo file-localization task.

Takeaways: (a) don't flip the default to a 10×-larger model on this evidence — keep bge-small, offer code models as an option (coderag eval --list-models); (b) discriminating embedders needs a larger/harder benchmark (a big external repo, or harder cross-file/conceptual queries with less lexical leakage); (c) the remaining headroom is at rank 1 (R@1 ≈ 0.6–0.73), which is exactly what a cross-encoder reranker (strategy move #2) targets. This is the harness doing its job: it stopped a plausible-sounding upgrade that the data doesn't support.

Reranker experiment (move #2)

Adding the optional cross-encoder reranker (--rerank, default Xenova/ms-marco-MiniLM-L-12-v2) on the same 24-case dataset:

mode                               MRR    R@1    R@5    R@10   nDCG@10  Hit@10
bge-small-en-v1.5 · dense          0.805  0.646  0.938  1.000  0.845    1.000
bge-small-en-v1.5 · bm25           0.747  0.604  0.812  1.000  0.798    1.000
bge-small-en-v1.5 · hybrid         0.801  0.646  1.000  1.000  0.845    1.000
bge-small-en-v1.5 · hybrid+rerank  0.790  0.646  0.958  1.000  0.836    1.000

The reranker did not help here — it marginally hurt (hybrid+rerank MRR 0.790 < hybrid 0.801; R@5 0.958 < 1.000). Same lesson as move #1, plus a model-fit issue:

Saturation, again. Hybrid already gets R@5 = 1.0 / Hit@10 = 1.0 and the headroom is only at rank 1 (R@1 = 0.646). A reranker reorders within the candidate pool, so on file-level metrics where the right files are already in the pool, it can only shuffle — and any mistake shows up as a small regression.
Model fit. ms-marco-MiniLM is trained on web-passage relevance, not code. The research explicitly flagged that small-cross-encoder code reranking lift is inferred, not measured — this run is consistent with that caveat. A code-aware reranker (CODERAG_RERANK_MODEL=jinaai/jina-reranker-v2-base-multilingual or BAAI/bge-reranker-base) is worth trying, but those are larger.

Conclusion across moves #1 and #2 (file level): the recurring blocker was that file-level on this small repo is too saturated to discriminate any retrieval improvement. The fix is a harder benchmark — see the symbol-level results next, which resolve it.

Symbol-level: the non-saturated benchmark (and the reranker, validated)

Build a symbol-level dataset (coderag eval --build --level symbol, or build_from_git(..., symbols=True)) — the functions/classes a commit touched that still exist at HEAD — and score with --level symbol. Finding the right function (not just file) is far harder, so the benchmark stops saturating (Hit@10 ≈ 0.5 instead of 1.0). On 10 symbol-level cases from this repo's history:

mode                               MRR    R@1    R@5    R@10   nDCG@10  Hit@10
bge-small-en-v1.5 · dense          0.400  0.183  0.292  0.317  0.327    0.400
bge-small-en-v1.5 · bm25           0.417  0.183  0.317  0.342  0.345    0.500
bge-small-en-v1.5 · hybrid         0.420  0.183  0.417  0.417  0.369    0.500
bge-small-en-v1.5 · hybrid+rerank  0.514  0.283  0.392  0.442  0.448    0.600

With headroom, the reranker delivers exactly the predicted lift: R@1 0.183 → 0.283 (+55%), MRR 0.420 → 0.514, nDCG@10 0.369 → 0.448, Hit@10 0.500 → 0.600 — improvement across every top-of-list metric, from the same off-the-shelf ms-marco-MiniLM that looked useless at the saturated file level. This both validates move #2 and confirms the saturation diagnosis: the file-level null result was a benchmark artifact, not a property of the technique. (Caveat: 10 cases is small/noisy; the direction is strong and consistent, but widen the dataset before quoting exact numbers.)

Net guidance: evaluate retrieval changes at symbol level — it's where the signal is.

Symbol-level model comparison (curated 22-case set)

Re-run on coderag/eval/datasets/coderag_self_symbols.jsonl (22 hand-verified natural-language → function/method cases, much less noisy than the git-mined set):

mode                                   MRR    R@1    R@5    R@10   nDCG@10  Hit@10
bge-small-en-v1.5 · dense              0.675  0.591  0.818  0.864  0.720    0.864
bge-small-en-v1.5 · bm25               0.427  0.318  0.636  0.727  0.498    0.727
bge-small-en-v1.5 · hybrid             0.573  0.364  0.864  0.864  0.647    0.864
bge-small-en-v1.5 · hybrid+rerank      0.580  0.409  0.864  0.864  0.651    0.864
jina-embeddings-v2-base-code · dense   0.483  0.318  0.682  0.773  0.554    0.773
jina-embeddings-v2-base-code · hybrid  0.604  0.455  0.818  0.864  0.668    0.864

Three findings, all actionable:

The code-specific model does not win on NL→symbol queries. bge-small · dense (MRR 0.675, R@1 0.591) clearly beats jina-code · dense (0.483 / 0.318). jina-v2-base-code is older and tuned more for code↔code; for natural-language "where is X" queries a good general text embedder is stronger. (jina-code's hybrid is competitive only because BM25 props up its weaker dense signal.)
Equal-weight hybrid is not universally better. For the strong bge-small retriever, dense alone (0.675) beats hybrid (0.573): on NL queries BM25 is weak (0.427) and equal-weight RRF drags the strong dense ranking down. For the weaker jina-code, BM25 helps (hybrid 0.604 > dense 0.483). Takeaway: fusion weights should depend on query type — weight dense up for natural-language queries; a fixed 1:1 is a compromise, not an optimum. (Implemented and validated below — note the "BM25-up for code" half of this intuition was refuted by the data.)
Reranking improves top-1 precision. hybrid+rerank lifts R@1 0.364 → 0.409 (+12%) over hybrid with the tiny ms-marco model — consistent with the git-mined result (+55% on 10 noisier cases). The reranker reliably sharpens the top of the list; it operates on the hybrid pool, so it can't fully recover the dense-vs-hybrid gap above (reranking a dense-weighted pool is the natural follow-up).

Larger code-aware rerankers (bge-reranker-base, jina-reranker-v2) are registered (coderag eval --list-models) but are ~1 GB and slow to rerank on CPU — test them on a GPU or a smaller candidate pool. The MiniLM default is the pragmatic local choice.

Bottom line for "win the eval": the biggest lever found here is query-type-aware fusion weighting (finding 2), then reranking for top-1 (finding 3) — not a bigger embedding model (finding 1). Validate these on a larger external repo next.

Adaptive fusion weighting (finding #2, implemented)

CODERAG_ADAPTIVE_FUSION=1 (or coderag eval --adaptive) routes the fusion weights by query type: a cheap local heuristic (looks_like_identifier) leans dense for natural-language queries and stays neutral for identifier-like ones. Validated on bge-small at symbol level against fixed 1:1 hybrid, on two 22-case sets:

NL queries (coderag_self_symbols.jsonl)     MRR    R@1    nDCG@10
  hybrid (fixed 1:1)                         0.604  0.455  0.669
  adaptive                                   0.674  0.545  0.722     ← +0.070 MRR, +20% R@1

identifier queries (coderag_self_identifiers.jsonl)
  hybrid (fixed 1:1)                         0.685  0.545  0.741
  adaptive                                   0.685  0.545  0.741     ← unchanged (no regression)

So adaptive is a Pareto improvement over fixed hybrid here: big gain on NL, no loss on identifiers. Honest caveat that shaped the defaults: the literature's "BM25-up for code" intuition was refuted by the data — up-weighting BM25 for identifier queries actively hurt (MRR 0.685 → 0.613), because short/common identifiers (search, index) are lexically ambiguous and the embedder already matches them well. So the code-side default is neutral (1:1), not BM25-leaning; BM25-leaning is left configurable (CODERAG_CODE_LEXICAL_WEIGHT) for larger repos where exact-string recall matters more. Off by default pending larger-repo validation; enable with CODERAG_ADAPTIVE_FUSION=1.

⚠️ First cut did not generalize — then was fixed. The original classifier keyed on query shape and mis-read prose queries that name a symbol. On pydantic (commit-message queries) it leaned dense and hurt (MRR 0.286 vs hybrid 0.361). The classifier now detects identifiers embedded in prose (references_identifier) and routes them to the neutral code weights, so adaptive falls back to plain hybrid whenever a symbol is named. Full write-up: research/external-validation.md.

Adaptive fusion, after the classifier fix

The classifier fix removed the catastrophic regression the first cut had (on pydantic, adaptive went from 0.286 → 0.458 = hybrid, no longer hurting). On two early datasets it looked like a clear win:

CodeRAG curated, symbol level          hybrid  dense  adaptive
  natural-language queries (MRR)        0.581   0.675   0.706   ← beats both
  identifier queries (MRR)              0.685   0.686   0.715   ← beats both

pydantic, symbol level (172 cases, 22 071-chunk corpus)
  dense 0.328 · bm25 0.398 · hybrid 0.458 · adaptive 0.458     ← equals hybrid (was 0.286)

⚠️ But a 4-repo sweep (627 git-mined cases) shows it is not an aggregate win — hybrid 0.442 vs adaptive 0.423 MRR; adaptive is a wash on the well-powered repos and the big CodeRAG-curated gain turned out to be an artifact of unusually dense-friendly clean-NL queries (see the Multi-repo evaluation section below). So adaptive stays off by default — it's a safe opt-in (no catastrophic regression after this fix), not a default. Fixed 1:1 hybrid remains the default. Enable per-session with CODERAG_ADAPTIVE_FUSION=1.

Multi-repo evaluation (judging generalization)

Single-repo tuning overfits — the external-repo validation showed levers that won on CodeRAG reversing on pydantic. So a config should only be promoted to a default once it wins on the average of several repos. scripts/bench_multirepo.py runs the eval across a manifest of repos and prints each repo's table plus a macro-averaged aggregate (each repo weighted equally, so a big repo can't dominate):

python scripts/bench_multirepo.py --manifest repos.json --level symbol --adaptive --rerank

[
  {"name": "coderag",  "watched_dir": ".",            "dataset": "coderag/eval/datasets/coderag_self_symbols.jsonl"},
  {"name": "pydantic", "watched_dir": "/tmp/pydantic", "store_dir": "/tmp/pyd_store", "dataset": "/tmp/pyd_sym.jsonl"}
]

Each entry reuses a prepared index + dataset (indexing is the slow part); pass --index to build them first and --build to mine a symbol dataset from git history. The aggregate rows are labelled mean:<mode>. See coderag/eval/datasets/multirepo.example.json. Programmatic API: from coderag.eval import aggregate_by_mode, mean_results.

This is the gate for promoting adaptive fusion to default-on: it should be ≥ hybrid on the aggregate and on every individual repo before the default flips.

Result: adaptive fusion does not earn default-on

Run across four repos (627 git-mined symbol-level cases, bge-small, with the embedded-identifier classifier):

                 coderag  flask  requests  click  | AGGREGATE (macro-avg, MRR)
  dense           0.423   0.297   0.354    0.351  |  0.356
  bm25            0.500   0.371   0.371    0.401  |  0.411
  hybrid          0.564   0.363   0.415    0.427  |  0.442   ← best
  adaptive        0.487   0.357   0.415    0.431  |  0.423
  (cases)          13      219     126      269   |  627

Hybrid 1:1 wins the aggregate (0.442) and is first-or-tied on every repo; adaptive does not clear the bar. On the three well-powered repos adaptive ≈ hybrid (a wash), and it trails on the small/noisy coderag set. The large curated-CodeRAG adaptive win reported above was an artifact of unusually dense-friendly, clean natural-language queries; on realistic git-mined commit queries dense is consistently the weakest modality, so "lean dense for NL" stops paying off.

Decisions this locks in: keep adaptive_fusion off by default (it's a safe opt-in after the classifier fix — no catastrophic regression — but not an aggregate win); keep 1:1 hybrid as the default. The harness did its job: a single-repo "win" was correctly blocked from becoming a default. (Reranking across repos and a code-aware reranker remain the open levers.)

Dataset format

JSONL, one case per line:

{"query": "fix retry backoff on 429", "relevant_files": ["coderag/llm.py"], "relevant_symbols": ["stream_answer"], "id": "abc123", "source": "git"}

relevant_symbols and id/source are optional. Mine with --build, or hand-author cases for queries you care about (the natural-language "where is X handled" questions where semantic retrieval should beat grep).

Library API

from coderag import CodeRAG, Config
from coderag.eval import build_from_git, compare_modes, evaluate

cr = CodeRAG(Config.from_env())
cr.index()
cases = build_from_git(cr.config.watched_dir, max_cases=200)

for r in compare_modes(cr, cases):          # dense / bm25 / hybrid
    print(r.label, r.as_dict())

# Or score any retriever callable directly:
res = evaluate(cr.search, cases, level="file")
print(res.recall, res.mrr)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval eval harness

Metrics

Quick start

Measured results (this repo)

Reranker experiment (move #2)

Symbol-level: the non-saturated benchmark (and the reranker, validated)

Symbol-level model comparison (curated 22-case set)

Adaptive fusion weighting (finding #2, implemented)

Adaptive fusion, after the classifier fix

Multi-repo evaluation (judging generalization)

Result: adaptive fusion does not earn default-on

Dataset format

Library API

FilesExpand file tree

eval.md

Latest commit

History

eval.md

File metadata and controls

Retrieval eval harness

Metrics

Quick start

Measured results (this repo)

Reranker experiment (move #2)

Symbol-level: the non-saturated benchmark (and the reranker, validated)

Symbol-level model comparison (curated 22-case set)

Adaptive fusion weighting (finding #2, implemented)

Adaptive fusion, after the classifier fix

Multi-repo evaluation (judging generalization)

Result: adaptive fusion does not earn default-on

Dataset format

Library API