Add a code-retrieval eval harness (+ embedder/reranker experiments)#37
Merged
Merged
Conversation
Synthesizes multi-source research on making CodeRAG win a code-retrieval eval harness against agentic-grep loops (Claude Code, Codex) and commercial semantic search (Cursor, Cody, Augment), under a local/zero-key constraint. Key findings and prioritized plan: build a SweRank/Agentless-style eval harness first; swap the default embedder (bge-small ~45.8 CoIR -> CodeRankEmbed ~60.1); add a local ONNX cross-encoder reranker; route/tune hybrid fusion; then structure-aware graph expansion. Includes cited accuracy-vs-cost tradeoffs and the honest grep-vs-embeddings debate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Implements move #0 of the retrieval strategy: a small, offline harness to measure retrieval quality so accuracy claims are provable and regressions are caught. - coderag/eval/metrics.py: recall@k, hit@k (Acc@k), MRR, nDCG@k with rank de-duplication so multiple chunks per file don't inflate scores. - coderag/eval/dataset.py: JSONL EvalCase format + a git miner that builds datasets SWE-bench/SweRank-style (commit subject -> changed files that still exist at HEAD), filtering merges/reverts/bots/diffuse commits. - coderag/eval/harness.py: evaluate() scores any search callable; compare_modes() contrasts dense-only vs BM25-only vs hybrid on one index by swapping RRF fusion weights (the index is mode-independent). - coderag eval [--build] [--compare] [--level file|symbol] CLI surface, a thin adapter over the engine. - docs/eval.md usage guide; tests cover metrics, dataset round-trip, the git miner, and end-to-end scoring via the deterministic fake provider. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Runs move #1 (the embedder experiment) on real local models via the eval harness, and records the honest result. - scripts/bench_embedders.py: reproducible model comparison — index a repo per model into an isolated store, score dense/bm25/hybrid via the harness. - coderag/embeddings/models.py + `coderag eval --list-models`: curated registry of local code-search embedders with size/accuracy notes. Note: fastembed does not ship CodeRankEmbed (needs custom ONNX export, follow-up); jina-embeddings-v2-base-code is the best out-of-the-box code-specific option. - coderag/eval/datasets/coderag_self.jsonl: 24 curated natural-language -> file cases for benchmarking CodeRAG on itself. Measured (this repo, 24 cases): hybrid > dense > BM25 for BOTH models (validates the fusion thesis), but the code-specific model did NOT clearly beat bge-small — the small repo saturates (bge already Hit@10=1.0), so the published CoIR gap does not transfer. Conclusion: keep bge-small default; model swaps need a larger/harder benchmark; rank-1 headroom points at the reranker (move #2). Documented in docs/eval.md and the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Two-stage retrieve-then-rerank: first-stage hybrid (dense+BM25+RRF) for recall, then a local ONNX cross-encoder re-scores the top candidates jointly with the query for top-of-list precision. Opt-in (config.rerank, default off) so the zero-config engine stays tiny/fast; uses fastembed's TextCrossEncoder (default Xenova/ms-marco-MiniLM-L-12-v2) so it needs no API key and no new dependency. - coderag/retrieval/rerank.py: Reranker protocol + CrossEncoderReranker + get_reranker() factory (mirrors the embeddings provider pattern). - HybridSearcher: deeper candidate pool when reranking, re-score, reorder, trim to top_k; reranker injected by the facade from config. - config: rerank / rerank_model / rerank_candidates (+ CODERAG_RERANK* env). - status() reports rerank state; eval compare_modes adds a hybrid+rerank row; `coderag eval --rerank` and `bench_embedders.py --rerank`. - Tests via a deterministic fake reranker (offline). Measured (this repo, 24 cases): the generic ms-marco reranker gave no lift / a marginal regression. The benchmark is saturated (hybrid already R@5~1.0) and ms-marco is web-trained, not code. Documented in docs/eval.md: the critical path is now a larger/harder benchmark, after which a code-aware reranker should be re-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
…README - Add chunking.languages.extensions_for() as the single canonical reverse lookup, and use it from the CLI and the git dataset miner instead of two separate hardcoded copies of the language->suffix table. - Surface `coderag eval` in the README CLI list and link the eval + strategy docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
|
|
||
| if TYPE_CHECKING: | ||
| from coderag.api import CodeRAG | ||
| from coderag.retrieval.rerank import Reranker |
| """Scores how well each document answers the query (higher = more relevant).""" | ||
|
|
||
| @property | ||
| def model_id(self) -> str: ... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
To position CodeRAG as a retrieval layer that agentic tools (Claude Code, Codex) plug into — "better code search," Firecrawl-style — we first need to be able to prove retrieval quality. You can't claim "more accurate / faster / more efficient" without a measurement. This PR builds that measurement, then uses it to run the first two accuracy experiments — honestly reporting what worked and what didn't.
The headline deliverable is the eval harness. The embedder and reranker experiments are included as working, opt-in infrastructure with documented (null) results.
What's in here
1. Eval harness (
coderag/eval/) — the centerpieceA small, offline harness following the SWE-bench / Agentless / SweRank localization protocol:
metrics.py): recall@k, hit@k (Acc@k), MRR, nDCG@k, with rank de-duplication so multiple chunks per file don't inflate scores.dataset.py): JSONLEvalCaseformat + a git-history miner (commit subject → files it changed that still exist at HEAD), filtering merges/reverts/bots/diffuse commits.harness.py):evaluate()scores any retriever;compare_modes()contrasts dense-only vs BM25-only vs hybrid (and optionalhybrid+rerank) on one index by swapping RRF weights.coderag eval [--build] [--compare] [--rerank] [--level file|symbol] [--list-models], a thin adapter over the engine.2. Optional cross-encoder reranker (
coderag/retrieval/rerank.py)Two-stage retrieve-then-rerank, off by default, zero new dependencies (fastembed's
TextCrossEncoder, defaultXenova/ms-marco-MiniLM-L-12-v2). Enable withCODERAG_RERANK=1.Rerankerprotocol + factory mirror the embeddings provider pattern.3. Tooling & assets
coderag/embeddings/models.py+coderag eval --list-models: curated registry of local code-search embedders.scripts/bench_embedders.py: reproducible per-model comparison.coderag/eval/datasets/coderag_self.jsonl: 24 curated NL→file cases for benchmarking CodeRAG on itself.4. Docs
docs/research/code-retrieval-strategy.md: cited research synthesis and the prioritized plan.docs/eval.md: harness usage + the measured results below.5. Cleanup
chunking.languages.extensions_for()as the single canonical language→extension reverse lookup, replacing two hardcoded copies (CLI + dataset miner).coderag eval.Measured results (this repo, 24 cases)
What worked (validated): hybrid fusion (dense+BM25+RRF) beats either modality alone — the core differentiator vs pure-grep agents and single-modality embedding tools.
What didn't (honest null results): the code-specific embedder and the reranker showed no clear lift here. Root cause is the benchmark, not the techniques: this repo is small and saturated (hybrid already hits R@5≈1.0 / Hit@10=1.0), so there's no headroom to measure improvements, and the generic MS-MARCO reranker is web-trained, not code. The harness doing its job — stopping cargo-cult "upgrades."
Follow-ups (not in this PR)
Testing
pytest -m "not integration"suite green;rufflint+format clean;mypyclean on all new code.🤖 Generated with Claude Code
Generated by Claude Code