Add a code-retrieval eval harness (+ embedder/reranker experiments) by Neverdecel · Pull Request #37 · Neverdecel/CodeRAG

Neverdecel · 2026-06-17T08:12:41Z

Why

To position CodeRAG as a retrieval layer that agentic tools (Claude Code, Codex) plug into — "better code search," Firecrawl-style — we first need to be able to prove retrieval quality. You can't claim "more accurate / faster / more efficient" without a measurement. This PR builds that measurement, then uses it to run the first two accuracy experiments — honestly reporting what worked and what didn't.

The headline deliverable is the eval harness. The embedder and reranker experiments are included as working, opt-in infrastructure with documented (null) results.

What's in here

1. Eval harness (`coderag/eval/`) — the centerpiece

A small, offline harness following the SWE-bench / Agentless / SweRank localization protocol:

Metrics (metrics.py): recall@k, hit@k (Acc@k), MRR, nDCG@k, with rank de-duplication so multiple chunks per file don't inflate scores.
Dataset (dataset.py): JSONL EvalCase format + a git-history miner (commit subject → files it changed that still exist at HEAD), filtering merges/reverts/bots/diffuse commits.
Harness (harness.py): evaluate() scores any retriever; compare_modes() contrasts dense-only vs BM25-only vs hybrid (and optional hybrid+rerank) on one index by swapping RRF weights.
CLI: coderag eval [--build] [--compare] [--rerank] [--level file|symbol] [--list-models], a thin adapter over the engine.

2. Optional cross-encoder reranker (`coderag/retrieval/rerank.py`)

Two-stage retrieve-then-rerank, off by default, zero new dependencies (fastembed's TextCrossEncoder, default Xenova/ms-marco-MiniLM-L-12-v2). Enable with CODERAG_RERANK=1. Reranker protocol + factory mirror the embeddings provider pattern.

3. Tooling & assets

coderag/embeddings/models.py + coderag eval --list-models: curated registry of local code-search embedders.
scripts/bench_embedders.py: reproducible per-model comparison.
coderag/eval/datasets/coderag_self.jsonl: 24 curated NL→file cases for benchmarking CodeRAG on itself.

4. Docs

docs/research/code-retrieval-strategy.md: cited research synthesis and the prioritized plan.
docs/eval.md: harness usage + the measured results below.

5. Cleanup

New chunking.languages.extensions_for() as the single canonical language→extension reverse lookup, replacing two hardcoded copies (CLI + dataset miner).
README documents coderag eval.

Measured results (this repo, 24 cases)

mode                               MRR    R@1    R@5    nDCG@10
bge-small · dense                  0.805  0.646  0.938  0.845
bge-small · bm25                   0.747  0.604  0.812  0.798
bge-small · hybrid                 0.801  0.646  1.000  0.845
bge-small · hybrid+rerank          0.790  0.646  0.958  0.836
jina-code · hybrid                 0.835  0.729  0.938  0.858

What worked (validated): hybrid fusion (dense+BM25+RRF) beats either modality alone — the core differentiator vs pure-grep agents and single-modality embedding tools.

What didn't (honest null results): the code-specific embedder and the reranker showed no clear lift here. Root cause is the benchmark, not the techniques: this repo is small and saturated (hybrid already hits R@5≈1.0 / Hit@10=1.0), so there's no headroom to measure improvements, and the generic MS-MARCO reranker is web-trained, not code. The harness doing its job — stopping cargo-cult "upgrades."

Follow-ups (not in this PR)

A larger, harder, non-saturated benchmark (1k+-file external repo; symbol-level + cross-file conceptual queries). This is the true critical path before re-validating the embedder swap and a code-aware reranker.
CodeRankEmbed via a custom ONNX export (fastembed doesn't ship it).
An MCP surface so agents adopt CodeRAG as a retrieval tool.

Testing

New offline tests: harness/metrics/dataset/git-miner, model registry, and reranker (via a deterministic fake reranker). Full pytest -m "not integration" suite green; ruff lint+format clean; mypy clean on all new code.

🤖 Generated with Claude Code

Generated by Claude Code

Synthesizes multi-source research on making CodeRAG win a code-retrieval eval harness against agentic-grep loops (Claude Code, Codex) and commercial semantic search (Cursor, Cody, Augment), under a local/zero-key constraint. Key findings and prioritized plan: build a SweRank/Agentless-style eval harness first; swap the default embedder (bge-small ~45.8 CoIR -> CodeRankEmbed ~60.1); add a local ONNX cross-encoder reranker; route/tune hybrid fusion; then structure-aware graph expansion. Includes cited accuracy-vs-cost tradeoffs and the honest grep-vs-embeddings debate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Implements move #0 of the retrieval strategy: a small, offline harness to measure retrieval quality so accuracy claims are provable and regressions are caught. - coderag/eval/metrics.py: recall@k, hit@k (Acc@k), MRR, nDCG@k with rank de-duplication so multiple chunks per file don't inflate scores. - coderag/eval/dataset.py: JSONL EvalCase format + a git miner that builds datasets SWE-bench/SweRank-style (commit subject -> changed files that still exist at HEAD), filtering merges/reverts/bots/diffuse commits. - coderag/eval/harness.py: evaluate() scores any search callable; compare_modes() contrasts dense-only vs BM25-only vs hybrid on one index by swapping RRF fusion weights (the index is mode-independent). - coderag eval [--build] [--compare] [--level file|symbol] CLI surface, a thin adapter over the engine. - docs/eval.md usage guide; tests cover metrics, dataset round-trip, the git miner, and end-to-end scoring via the deterministic fake provider. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Runs move #1 (the embedder experiment) on real local models via the eval harness, and records the honest result. - scripts/bench_embedders.py: reproducible model comparison — index a repo per model into an isolated store, score dense/bm25/hybrid via the harness. - coderag/embeddings/models.py + `coderag eval --list-models`: curated registry of local code-search embedders with size/accuracy notes. Note: fastembed does not ship CodeRankEmbed (needs custom ONNX export, follow-up); jina-embeddings-v2-base-code is the best out-of-the-box code-specific option. - coderag/eval/datasets/coderag_self.jsonl: 24 curated natural-language -> file cases for benchmarking CodeRAG on itself. Measured (this repo, 24 cases): hybrid > dense > BM25 for BOTH models (validates the fusion thesis), but the code-specific model did NOT clearly beat bge-small — the small repo saturates (bge already Hit@10=1.0), so the published CoIR gap does not transfer. Conclusion: keep bge-small default; model swaps need a larger/harder benchmark; rank-1 headroom points at the reranker (move #2). Documented in docs/eval.md and the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Two-stage retrieve-then-rerank: first-stage hybrid (dense+BM25+RRF) for recall, then a local ONNX cross-encoder re-scores the top candidates jointly with the query for top-of-list precision. Opt-in (config.rerank, default off) so the zero-config engine stays tiny/fast; uses fastembed's TextCrossEncoder (default Xenova/ms-marco-MiniLM-L-12-v2) so it needs no API key and no new dependency. - coderag/retrieval/rerank.py: Reranker protocol + CrossEncoderReranker + get_reranker() factory (mirrors the embeddings provider pattern). - HybridSearcher: deeper candidate pool when reranking, re-score, reorder, trim to top_k; reranker injected by the facade from config. - config: rerank / rerank_model / rerank_candidates (+ CODERAG_RERANK* env). - status() reports rerank state; eval compare_modes adds a hybrid+rerank row; `coderag eval --rerank` and `bench_embedders.py --rerank`. - Tests via a deterministic fake reranker (offline). Measured (this repo, 24 cases): the generic ms-marco reranker gave no lift / a marginal regression. The benchmark is saturated (hybrid already R@5~1.0) and ms-marco is web-trained, not code. Documented in docs/eval.md: the critical path is now a larger/harder benchmark, after which a code-aware reranker should be re-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

…README - Add chunking.languages.extensions_for() as the single canonical reverse lookup, and use it from the CLI and the git dataset miner instead of two separate hardcoded copies of the language->suffix table. - Surface `coderag eval` in the README CLI list and link the eval + strategy docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

+
+if TYPE_CHECKING:
+    from coderag.api import CodeRAG
+    from coderag.retrieval.rerank import Reranker


+    """Scores how well each document answers the query (higher = more relevant)."""
+
+    @property
+    def model_id(self) -> str: ...


claude added 5 commits June 17, 2026 06:25

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Neverdecel merged commit d8a97ce into master Jun 17, 2026
13 checks passed

Neverdecel mentioned this pull request Jun 17, 2026

Symbol-level eval: non-saturated benchmark + model comparison findings #38

Merged

Neverdecel deleted the claude/coderag-vs-agentic-tools-358h97 branch June 18, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a code-retrieval eval harness (+ embedder/reranker experiments)#37

Add a code-retrieval eval harness (+ embedder/reranker experiments)#37
Neverdecel merged 5 commits into
masterfrom
claude/coderag-vs-agentic-tools-358h97

Neverdecel commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Neverdecel commented Jun 17, 2026

Why

What's in here

1. Eval harness (coderag/eval/) — the centerpiece

2. Optional cross-encoder reranker (coderag/retrieval/rerank.py)

3. Tooling & assets

4. Docs

5. Cleanup

Measured results (this repo, 24 cases)

Follow-ups (not in this PR)

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Eval harness (`coderag/eval/`) — the centerpiece

2. Optional cross-encoder reranker (`coderag/retrieval/rerank.py`)