A small, offline harness for measuring retrieval quality — "did we surface the right file/symbol for this query?" — so accuracy claims are provable and regressions are caught. It implements move #0 of the code-retrieval strategy: nothing else in that plan (a better embedder, a reranker, fusion tuning) is worth shipping until we can measure it.
Standard localization metrics, matching the SWE-bench / Agentless / SweRank and CoIR / CodeSearchNet conventions:
- recall@k — fraction of relevant items found in the top k.
- hit@k (Acc@k) — 1 if any relevant item is in the top k.
- MRR — reciprocal rank of the first relevant item.
- nDCG@k — rank-discounted gain (binary relevance).
Scored at file level (default) or symbol level (--level symbol).
# 1. Mine a dataset from the repo's own git history (no network, no LLM):
# query = commit subject, ground truth = files that commit changed (and still exist).
coderag eval --build --dataset coderag-eval.jsonl
# 2. Score the current hybrid retriever:
coderag eval --dataset coderag-eval.jsonl
# 3. Contrast dense-only vs BM25-only vs hybrid on one index:
coderag eval --dataset coderag-eval.jsonl --compare
# 4. Add the optional two-stage cross-encoder reranker (adds a hybrid+rerank row):
coderag eval --dataset coderag-eval.jsonl --compare --rerank
# 5. Harder, non-saturated: symbol-level (find the right function, not just file).
coderag eval --build --level symbol --dataset sym.jsonl # mines relevant_symbols too
coderag eval --dataset sym.jsonl --level symbol --compare --rerank
# 6. Structure-aware 1-hop call-graph (callee) expansion (adds a hybrid+graph row):
coderag eval --dataset sym.jsonl --level symbol --compare --graphReranking is opt-in at search time too: set CODERAG_RERANK=1 (model via
CODERAG_RERANK_MODEL, pool depth via CODERAG_RERANK_CANDIDATES) and every coderag search / API / UI query runs two-stage retrieve-then-rerank. Likewise
CODERAG_GRAPH_EXPANSION=1 turns on 1-hop call-graph expansion — pulling in the
definitions of what the top hits call (their callees) — for every query. See
configuration.md and §4 of
research/code-retrieval-strategy.md.
mode n MRR R@1 R@5 R@10 nDCG@1 nDCG@5 nDCG@10 Hit@1 Hit@5 Hit@10
------ -- ----- ----- ----- ----- ------ ------ ------- ----- ----- ------
dense …
bm25 …
hybrid …
Add --json for machine-readable output, --ks 1,3,5,10 to change cutoffs, and
--level symbol for function/class-level localization (needs relevant_symbols in the
dataset). The usual --watched-dir / --store-dir / --provider / --model flags apply.
The default
fakeprovider is for tests only — its vectors are random, so dense looks near-zero. Run real evals againstfastembed(the local default) or whatever model you're evaluating, e.g.coderag eval --dataset … --compare --model BAAI/bge-small-en-v1.5then again with a candidate like CodeRankEmbed to measure the lift.
Move #1 experiment — current default vs a code-specific model — run with
scripts/bench_embedders.py on the curated dataset
(coderag/eval/datasets/coderag_self.jsonl, 24 natural-language → file cases, 90 files /
553 chunks):
mode n MRR R@1 R@5 R@10 nDCG@10 Hit@10
bge-small-en-v1.5 · dense 24 0.784 0.604 0.938 1.000 0.831 1.000
bge-small-en-v1.5 · bm25 24 0.751 0.604 0.854 1.000 0.802 1.000
bge-small-en-v1.5 · hybrid 24 0.822 0.688 1.000 1.000 0.860 1.000
jina-embeddings-v2-base-code · dense 24 0.759 0.583 0.938 0.979 0.810 1.000
jina-embeddings-v2-base-code · bm25 24 0.751 0.604 0.854 1.000 0.802 1.000
jina-embeddings-v2-base-code · hybrid 24 0.835 0.729 0.938 0.958 0.858 0.958
Two findings, one expected and one cautionary:
- Hybrid beats either modality alone, for both models (bge hybrid MRR 0.822 > dense 0.784 > bm25 0.751; jina hybrid 0.835 > dense 0.759 > bm25 0.751). This is the core thesis — fusion is the differentiator vs pure-grep agents and single-modality embedding tools. The identical BM25 rows across models are a sanity check that the harness isolates the embedding variable correctly.
- The code-specific model did not clearly beat bge-small here. jina-code's hybrid is marginally ahead on MRR/R@1 but behind on R@5/R@10/Hit@10. The reason is saturation: on a 90-file repo with lexical-rich NL queries, bge-small already hits Hit@10 = 1.0 and R@5 ≈ 1.0 — there's no recall headroom for a better model to capture. The large published CoIR gap (bge ~45.8 vs code models ~60) is measured on big, hard, cross-language corpora and does not transfer to a small single-repo file-localization task.
Takeaways: (a) don't flip the default to a 10×-larger model on this evidence — keep
bge-small, offer code models as an option (coderag eval --list-models); (b) discriminating
embedders needs a larger/harder benchmark (a big external repo, or harder
cross-file/conceptual queries with less lexical leakage); (c) the remaining headroom is at
rank 1 (R@1 ≈ 0.6–0.73), which is exactly what a cross-encoder reranker (strategy move
#2) targets. This is the harness doing its job: it stopped a plausible-sounding upgrade that
the data doesn't support.
Adding the optional cross-encoder reranker (--rerank, default
Xenova/ms-marco-MiniLM-L-12-v2) on the same 24-case dataset:
mode MRR R@1 R@5 R@10 nDCG@10 Hit@10
bge-small-en-v1.5 · dense 0.805 0.646 0.938 1.000 0.845 1.000
bge-small-en-v1.5 · bm25 0.747 0.604 0.812 1.000 0.798 1.000
bge-small-en-v1.5 · hybrid 0.801 0.646 1.000 1.000 0.845 1.000
bge-small-en-v1.5 · hybrid+rerank 0.790 0.646 0.958 1.000 0.836 1.000
The reranker did not help here — it marginally hurt (hybrid+rerank MRR 0.790 < hybrid 0.801; R@5 0.958 < 1.000). Same lesson as move #1, plus a model-fit issue:
- Saturation, again. Hybrid already gets R@5 = 1.0 / Hit@10 = 1.0 and the headroom is only at rank 1 (R@1 = 0.646). A reranker reorders within the candidate pool, so on file-level metrics where the right files are already in the pool, it can only shuffle — and any mistake shows up as a small regression.
- Model fit.
ms-marco-MiniLMis trained on web-passage relevance, not code. The research explicitly flagged that small-cross-encoder code reranking lift is inferred, not measured — this run is consistent with that caveat. A code-aware reranker (CODERAG_RERANK_MODEL=jinaai/jina-reranker-v2-base-multilingualorBAAI/bge-reranker-base) is worth trying, but those are larger.
Conclusion across moves #1 and #2 (file level): the recurring blocker was that file-level on this small repo is too saturated to discriminate any retrieval improvement. The fix is a harder benchmark — see the symbol-level results next, which resolve it.
Build a symbol-level dataset (coderag eval --build --level symbol, or build_from_git(..., symbols=True)) — the functions/classes a commit touched that still exist at HEAD — and score
with --level symbol. Finding the right function (not just file) is far harder, so the
benchmark stops saturating (Hit@10 ≈ 0.5 instead of 1.0). On 10 symbol-level cases from this
repo's history:
mode MRR R@1 R@5 R@10 nDCG@10 Hit@10
bge-small-en-v1.5 · dense 0.400 0.183 0.292 0.317 0.327 0.400
bge-small-en-v1.5 · bm25 0.417 0.183 0.317 0.342 0.345 0.500
bge-small-en-v1.5 · hybrid 0.420 0.183 0.417 0.417 0.369 0.500
bge-small-en-v1.5 · hybrid+rerank 0.514 0.283 0.392 0.442 0.448 0.600
With headroom, the reranker delivers exactly the predicted lift: R@1 0.183 → 0.283
(+55%), MRR 0.420 → 0.514, nDCG@10 0.369 → 0.448, Hit@10 0.500 → 0.600 — improvement across
every top-of-list metric, from the same off-the-shelf ms-marco-MiniLM that looked useless
at the saturated file level. This both validates move #2 and confirms the saturation
diagnosis: the file-level null result was a benchmark artifact, not a property of the
technique. (Caveat: 10 cases is small/noisy; the direction is strong and consistent, but
widen the dataset before quoting exact numbers.)
Net guidance: evaluate retrieval changes at symbol level — it's where the signal is.
Re-run on coderag/eval/datasets/coderag_self_symbols.jsonl (22 hand-verified
natural-language → function/method cases, much less noisy than the git-mined set):
mode MRR R@1 R@5 R@10 nDCG@10 Hit@10
bge-small-en-v1.5 · dense 0.675 0.591 0.818 0.864 0.720 0.864
bge-small-en-v1.5 · bm25 0.427 0.318 0.636 0.727 0.498 0.727
bge-small-en-v1.5 · hybrid 0.573 0.364 0.864 0.864 0.647 0.864
bge-small-en-v1.5 · hybrid+rerank 0.580 0.409 0.864 0.864 0.651 0.864
jina-embeddings-v2-base-code · dense 0.483 0.318 0.682 0.773 0.554 0.773
jina-embeddings-v2-base-code · hybrid 0.604 0.455 0.818 0.864 0.668 0.864
Three findings, all actionable:
- The code-specific model does not win on NL→symbol queries.
bge-small · dense(MRR 0.675, R@1 0.591) clearly beatsjina-code · dense(0.483 / 0.318). jina-v2-base-code is older and tuned more for code↔code; for natural-language "where is X" queries a good general text embedder is stronger. (jina-code's hybrid is competitive only because BM25 props up its weaker dense signal.) - Equal-weight hybrid is not universally better. For the strong
bge-smallretriever,densealone (0.675) beatshybrid(0.573): on NL queries BM25 is weak (0.427) and equal-weight RRF drags the strong dense ranking down. For the weaker jina-code, BM25 helps (hybrid 0.604 > dense 0.483). Takeaway: fusion weights should depend on query type — weight dense up for natural-language queries; a fixed 1:1 is a compromise, not an optimum. (Implemented and validated below — note the "BM25-up for code" half of this intuition was refuted by the data.) - Reranking improves top-1 precision.
hybrid+reranklifts R@1 0.364 → 0.409 (+12%) over hybrid with the tiny ms-marco model — consistent with the git-mined result (+55% on 10 noisier cases). The reranker reliably sharpens the top of the list; it operates on the hybrid pool, so it can't fully recover the dense-vs-hybrid gap above (reranking a dense-weighted pool is the natural follow-up).
Larger code-aware rerankers (bge-reranker-base, jina-reranker-v2) are registered
(coderag eval --list-models) but are ~1 GB and slow to rerank on CPU — test them on a GPU
or a smaller candidate pool. The MiniLM default is the pragmatic local choice.
Bottom line for "win the eval": the biggest lever found here is query-type-aware fusion weighting (finding 2), then reranking for top-1 (finding 3) — not a bigger embedding model (finding 1). Validate these on a larger external repo next.
CODERAG_ADAPTIVE_FUSION=1 (or coderag eval --adaptive) routes the fusion weights by query
type: a cheap local heuristic (looks_like_identifier) leans dense for natural-language
queries and stays neutral for identifier-like ones. Validated on bge-small at symbol
level against fixed 1:1 hybrid, on two 22-case sets:
NL queries (coderag_self_symbols.jsonl) MRR R@1 nDCG@10
hybrid (fixed 1:1) 0.604 0.455 0.669
adaptive 0.674 0.545 0.722 ← +0.070 MRR, +20% R@1
identifier queries (coderag_self_identifiers.jsonl)
hybrid (fixed 1:1) 0.685 0.545 0.741
adaptive 0.685 0.545 0.741 ← unchanged (no regression)
So adaptive is a Pareto improvement over fixed hybrid here: big gain on NL, no loss on
identifiers. Honest caveat that shaped the defaults: the literature's "BM25-up for code"
intuition was refuted by the data — up-weighting BM25 for identifier queries actively hurt
(MRR 0.685 → 0.613), because short/common identifiers (search, index) are lexically
ambiguous and the embedder already matches them well. So the code-side default is neutral
(1:1), not BM25-leaning; BM25-leaning is left configurable (CODERAG_CODE_LEXICAL_WEIGHT)
for larger repos where exact-string recall matters more. Off by default pending larger-repo
validation; enable with CODERAG_ADAPTIVE_FUSION=1.
⚠️ First cut did not generalize — then was fixed. The original classifier keyed on query shape and mis-read prose queries that name a symbol. Onpydantic(commit-message queries) it leaned dense and hurt (MRR 0.286 vs hybrid 0.361). The classifier now detects identifiers embedded in prose (references_identifier) and routes them to the neutral code weights, so adaptive falls back to plain hybrid whenever a symbol is named. Full write-up: research/external-validation.md.
The classifier fix removed the catastrophic regression the first cut had (on pydantic,
adaptive went from 0.286 → 0.458 = hybrid, no longer hurting). On two early datasets it
looked like a clear win:
CodeRAG curated, symbol level hybrid dense adaptive
natural-language queries (MRR) 0.581 0.675 0.706 ← beats both
identifier queries (MRR) 0.685 0.686 0.715 ← beats both
pydantic, symbol level (172 cases, 22 071-chunk corpus)
dense 0.328 · bm25 0.398 · hybrid 0.458 · adaptive 0.458 ← equals hybrid (was 0.286)
CODERAG_ADAPTIVE_FUSION=1.
Single-repo tuning overfits — the external-repo validation
showed levers that won on CodeRAG reversing on pydantic. So a config should only be promoted
to a default once it wins on the average of several repos. scripts/bench_multirepo.py
runs the eval across a manifest of repos and prints each repo's table plus a macro-averaged
aggregate (each repo weighted equally, so a big repo can't dominate):
python scripts/bench_multirepo.py --manifest repos.json --level symbol --adaptive --rerank[
{"name": "coderag", "watched_dir": ".", "dataset": "coderag/eval/datasets/coderag_self_symbols.jsonl"},
{"name": "pydantic", "watched_dir": "/tmp/pydantic", "store_dir": "/tmp/pyd_store", "dataset": "/tmp/pyd_sym.jsonl"}
]Each entry reuses a prepared index + dataset (indexing is the slow part); pass --index to
build them first and --build to mine a symbol dataset from git history. The aggregate rows are
labelled mean:<mode>. See coderag/eval/datasets/multirepo.example.json. Programmatic API:
from coderag.eval import aggregate_by_mode, mean_results.
This is the gate for promoting adaptive fusion to default-on: it should be ≥ hybrid on the aggregate and on every individual repo before the default flips.
Run across four repos (627 git-mined symbol-level cases, bge-small, with the embedded-identifier
classifier):
coderag flask requests click | AGGREGATE (macro-avg, MRR)
dense 0.423 0.297 0.354 0.351 | 0.356
bm25 0.500 0.371 0.371 0.401 | 0.411
hybrid 0.564 0.363 0.415 0.427 | 0.442 ← best
adaptive 0.487 0.357 0.415 0.431 | 0.423
(cases) 13 219 126 269 | 627
Hybrid 1:1 wins the aggregate (0.442) and is first-or-tied on every repo; adaptive does not clear the bar. On the three well-powered repos adaptive ≈ hybrid (a wash), and it trails on the small/noisy coderag set. The large curated-CodeRAG adaptive win reported above was an artifact of unusually dense-friendly, clean natural-language queries; on realistic git-mined commit queries dense is consistently the weakest modality, so "lean dense for NL" stops paying off.
Decisions this locks in: keep adaptive_fusion off by default (it's a safe opt-in after
the classifier fix — no catastrophic regression — but not an aggregate win); keep 1:1 hybrid as
the default. The harness did its job: a single-repo "win" was correctly blocked from becoming a
default. (Reranking across repos and a code-aware reranker remain the open levers.)
JSONL, one case per line:
{"query": "fix retry backoff on 429", "relevant_files": ["coderag/llm.py"], "relevant_symbols": ["stream_answer"], "id": "abc123", "source": "git"}relevant_symbols and id/source are optional. Mine with --build, or hand-author cases
for queries you care about (the natural-language "where is X handled" questions where semantic
retrieval should beat grep).
from coderag import CodeRAG, Config
from coderag.eval import build_from_git, compare_modes, evaluate
cr = CodeRAG(Config.from_env())
cr.index()
cases = build_from_git(cr.config.watched_dir, max_cases=200)
for r in compare_modes(cr, cases): # dense / bm25 / hybrid
print(r.label, r.as_dict())
# Or score any retriever callable directly:
res = evaluate(cr.search, cases, level="file")
print(res.recall, res.mrr)