Skip to content

Add a code-retrieval eval harness (+ embedder/reranker experiments)#37

Merged
Neverdecel merged 5 commits into
masterfrom
claude/coderag-vs-agentic-tools-358h97
Jun 17, 2026
Merged

Add a code-retrieval eval harness (+ embedder/reranker experiments)#37
Neverdecel merged 5 commits into
masterfrom
claude/coderag-vs-agentic-tools-358h97

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Why

To position CodeRAG as a retrieval layer that agentic tools (Claude Code, Codex) plug into — "better code search," Firecrawl-style — we first need to be able to prove retrieval quality. You can't claim "more accurate / faster / more efficient" without a measurement. This PR builds that measurement, then uses it to run the first two accuracy experiments — honestly reporting what worked and what didn't.

The headline deliverable is the eval harness. The embedder and reranker experiments are included as working, opt-in infrastructure with documented (null) results.

What's in here

1. Eval harness (coderag/eval/) — the centerpiece

A small, offline harness following the SWE-bench / Agentless / SweRank localization protocol:

  • Metrics (metrics.py): recall@k, hit@k (Acc@k), MRR, nDCG@k, with rank de-duplication so multiple chunks per file don't inflate scores.
  • Dataset (dataset.py): JSONL EvalCase format + a git-history miner (commit subject → files it changed that still exist at HEAD), filtering merges/reverts/bots/diffuse commits.
  • Harness (harness.py): evaluate() scores any retriever; compare_modes() contrasts dense-only vs BM25-only vs hybrid (and optional hybrid+rerank) on one index by swapping RRF weights.
  • CLI: coderag eval [--build] [--compare] [--rerank] [--level file|symbol] [--list-models], a thin adapter over the engine.

2. Optional cross-encoder reranker (coderag/retrieval/rerank.py)

Two-stage retrieve-then-rerank, off by default, zero new dependencies (fastembed's TextCrossEncoder, default Xenova/ms-marco-MiniLM-L-12-v2). Enable with CODERAG_RERANK=1. Reranker protocol + factory mirror the embeddings provider pattern.

3. Tooling & assets

  • coderag/embeddings/models.py + coderag eval --list-models: curated registry of local code-search embedders.
  • scripts/bench_embedders.py: reproducible per-model comparison.
  • coderag/eval/datasets/coderag_self.jsonl: 24 curated NL→file cases for benchmarking CodeRAG on itself.

4. Docs

  • docs/research/code-retrieval-strategy.md: cited research synthesis and the prioritized plan.
  • docs/eval.md: harness usage + the measured results below.

5. Cleanup

  • New chunking.languages.extensions_for() as the single canonical language→extension reverse lookup, replacing two hardcoded copies (CLI + dataset miner).
  • README documents coderag eval.

Measured results (this repo, 24 cases)

mode                               MRR    R@1    R@5    nDCG@10
bge-small · dense                  0.805  0.646  0.938  0.845
bge-small · bm25                   0.747  0.604  0.812  0.798
bge-small · hybrid                 0.801  0.646  1.000  0.845
bge-small · hybrid+rerank          0.790  0.646  0.958  0.836
jina-code · hybrid                 0.835  0.729  0.938  0.858

What worked (validated): hybrid fusion (dense+BM25+RRF) beats either modality alone — the core differentiator vs pure-grep agents and single-modality embedding tools.

What didn't (honest null results): the code-specific embedder and the reranker showed no clear lift here. Root cause is the benchmark, not the techniques: this repo is small and saturated (hybrid already hits R@5≈1.0 / Hit@10=1.0), so there's no headroom to measure improvements, and the generic MS-MARCO reranker is web-trained, not code. The harness doing its job — stopping cargo-cult "upgrades."

Follow-ups (not in this PR)

  • A larger, harder, non-saturated benchmark (1k+-file external repo; symbol-level + cross-file conceptual queries). This is the true critical path before re-validating the embedder swap and a code-aware reranker.
  • CodeRankEmbed via a custom ONNX export (fastembed doesn't ship it).
  • An MCP surface so agents adopt CodeRAG as a retrieval tool.

Testing

  • New offline tests: harness/metrics/dataset/git-miner, model registry, and reranker (via a deterministic fake reranker). Full pytest -m "not integration" suite green; ruff lint+format clean; mypy clean on all new code.

🤖 Generated with Claude Code


Generated by Claude Code

claude added 5 commits June 17, 2026 06:25
Synthesizes multi-source research on making CodeRAG win a code-retrieval
eval harness against agentic-grep loops (Claude Code, Codex) and commercial
semantic search (Cursor, Cody, Augment), under a local/zero-key constraint.

Key findings and prioritized plan: build a SweRank/Agentless-style eval
harness first; swap the default embedder (bge-small ~45.8 CoIR ->
CodeRankEmbed ~60.1); add a local ONNX cross-encoder reranker; route/tune
hybrid fusion; then structure-aware graph expansion. Includes cited
accuracy-vs-cost tradeoffs and the honest grep-vs-embeddings debate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Implements move #0 of the retrieval strategy: a small, offline harness to
measure retrieval quality so accuracy claims are provable and regressions
are caught.

- coderag/eval/metrics.py: recall@k, hit@k (Acc@k), MRR, nDCG@k with
  rank de-duplication so multiple chunks per file don't inflate scores.
- coderag/eval/dataset.py: JSONL EvalCase format + a git miner that builds
  datasets SWE-bench/SweRank-style (commit subject -> changed files that
  still exist at HEAD), filtering merges/reverts/bots/diffuse commits.
- coderag/eval/harness.py: evaluate() scores any search callable;
  compare_modes() contrasts dense-only vs BM25-only vs hybrid on one index
  by swapping RRF fusion weights (the index is mode-independent).
- coderag eval [--build] [--compare] [--level file|symbol] CLI surface,
  a thin adapter over the engine.
- docs/eval.md usage guide; tests cover metrics, dataset round-trip, the
  git miner, and end-to-end scoring via the deterministic fake provider.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Runs move #1 (the embedder experiment) on real local models via the eval
harness, and records the honest result.

- scripts/bench_embedders.py: reproducible model comparison — index a repo
  per model into an isolated store, score dense/bm25/hybrid via the harness.
- coderag/embeddings/models.py + `coderag eval --list-models`: curated
  registry of local code-search embedders with size/accuracy notes. Note:
  fastembed does not ship CodeRankEmbed (needs custom ONNX export, follow-up);
  jina-embeddings-v2-base-code is the best out-of-the-box code-specific option.
- coderag/eval/datasets/coderag_self.jsonl: 24 curated natural-language ->
  file cases for benchmarking CodeRAG on itself.

Measured (this repo, 24 cases): hybrid > dense > BM25 for BOTH models
(validates the fusion thesis), but the code-specific model did NOT clearly
beat bge-small — the small repo saturates (bge already Hit@10=1.0), so the
published CoIR gap does not transfer. Conclusion: keep bge-small default;
model swaps need a larger/harder benchmark; rank-1 headroom points at the
reranker (move #2). Documented in docs/eval.md and the strategy doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Two-stage retrieve-then-rerank: first-stage hybrid (dense+BM25+RRF) for
recall, then a local ONNX cross-encoder re-scores the top candidates jointly
with the query for top-of-list precision. Opt-in (config.rerank, default off)
so the zero-config engine stays tiny/fast; uses fastembed's TextCrossEncoder
(default Xenova/ms-marco-MiniLM-L-12-v2) so it needs no API key and no new
dependency.

- coderag/retrieval/rerank.py: Reranker protocol + CrossEncoderReranker +
  get_reranker() factory (mirrors the embeddings provider pattern).
- HybridSearcher: deeper candidate pool when reranking, re-score, reorder,
  trim to top_k; reranker injected by the facade from config.
- config: rerank / rerank_model / rerank_candidates (+ CODERAG_RERANK* env).
- status() reports rerank state; eval compare_modes adds a hybrid+rerank row;
  `coderag eval --rerank` and `bench_embedders.py --rerank`.
- Tests via a deterministic fake reranker (offline).

Measured (this repo, 24 cases): the generic ms-marco reranker gave no lift /
a marginal regression. The benchmark is saturated (hybrid already R@5~1.0)
and ms-marco is web-trained, not code. Documented in docs/eval.md: the
critical path is now a larger/harder benchmark, after which a code-aware
reranker should be re-tested.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
…README

- Add chunking.languages.extensions_for() as the single canonical reverse
  lookup, and use it from the CLI and the git dataset miner instead of two
  separate hardcoded copies of the language->suffix table.
- Surface `coderag eval` in the README CLI list and link the eval + strategy
  docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Comment thread coderag/eval/harness.py

if TYPE_CHECKING:
from coderag.api import CodeRAG
from coderag.retrieval.rerank import Reranker
"""Scores how well each document answers the query (higher = more relevant)."""

@property
def model_id(self) -> str: ...
@Neverdecel Neverdecel merged commit d8a97ce into master Jun 17, 2026
13 checks passed
@Neverdecel Neverdecel deleted the claude/coderag-vs-agentic-tools-358h97 branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants