Symbol-level eval: non-saturated benchmark + model comparison findings by Neverdecel · Pull Request #38 · Neverdecel/CodeRAG

Neverdecel · 2026-06-17T13:12:12Z

Why

PR #37 shipped the eval harness but flagged a blocker: file-level eval on this small repo saturates (hybrid hits Hit@10 = 1.0), so it couldn't measure any retrieval improvement — both the embedder swap and the reranker came back flat. This PR makes the benchmark discriminating and then uses it to get real answers.

What's in here

1. Symbol-level dataset mining (the enabler)

Finding the right function/class (not just file) is far harder, so the benchmark stops saturating (Hit@10 ≈ 0.5 instead of 1.0).

build_from_git(symbols=True) / coderag eval --build --level symbol: maps each commit's changed lines (zero-context diff hunks) to the symbols they touch — parsed from the file content at that commit via CodeRAG's own chunker, then intersected with HEAD symbols so every ground-truth symbol is retrievable. Off by default.
coderag/eval/datasets/coderag_self_symbols.jsonl: 22 hand-verified natural-language → function/method cases (less noisy than the git-mined set).

2. Tooling for model comparison

RECOMMENDED_RERANKERS registry + coderag eval --list-models now lists local cross-encoder rerankers (MiniLM, bge-reranker-base, jina-reranker-v2).
scripts/bench_embedders.py --rerank-models: one hybrid+rerank row per named reranker, on a fixed index.

3. Findings (measured, honest)

The reranker is validated once there's headroom. On 10 git-mined symbol cases it lifted R@1 0.183 → 0.283 (+55%); on the 22 curated cases, R@1 0.364 → 0.409 (+12%) — the earlier file-level "no lift" was a saturation artifact, not a property of the technique.

Curated 22-case symbol-level comparison:

mode                                   MRR    R@1    R@5    nDCG@10  Hit@10
bge-small-en-v1.5 · dense              0.675  0.591  0.818  0.720    0.864
bge-small-en-v1.5 · bm25               0.427  0.318  0.636  0.498    0.727
bge-small-en-v1.5 · hybrid             0.573  0.364  0.864  0.647    0.864
bge-small-en-v1.5 · hybrid+rerank      0.580  0.409  0.864  0.651    0.864
jina-embeddings-v2-base-code · dense   0.483  0.318  0.682  0.554    0.773
jina-embeddings-v2-base-code · hybrid  0.604  0.455  0.818  0.668    0.864

A bigger/code-specific embedder is not the win. bge-small · dense (MRR 0.675) clearly beats jina-code · dense (0.483) on NL→symbol queries — a good general text embedder is stronger for "where is X" retrieval; jina-v2-base-code is older/code↔code-tuned.
Equal-weight hybrid is not universally better — the single biggest lever. For the strong bge-small, dense alone (0.675) beats 1:1 hybrid (0.573): weak BM25 on NL queries drags the dense ranking down via RRF. For the weaker jina-code, BM25 helps. → Fusion weighting should be query-type-aware (dense-up for NL, BM25-up for identifiers), not fixed 1:1.
Reranking reliably sharpens top-1 (+12–55% R@1 across both sets) with the tiny ms-marco model.

Net: the largest lever found is query-type-aware fusion weighting, then reranking for top-1 — not a bigger embedding model. Larger code-aware rerankers are registered but are ~1 GB and slow to rerank on CPU; test them on GPU / a smaller pool, ideally on a larger external repo.

Testing

New offline tests for symbol mining (only the changed function is reported; default-off) and the reranker registry. Full pytest -m "not integration" green; ruff + mypy clean on new code.

Follow-ups

Implement query-type-aware fusion weighting (finding chore: tighten repo tooling #2) and measure it.
Validate on a larger external repo (the real generalization test).
Code-aware reranker on GPU; CodeRankEmbed ONNX export; MCP surface.

🤖 Generated with Claude Code

Generated by Claude Code

File-level eval on a small repo saturates (hybrid hits Hit@10=1.0), so it could not measure retrieval improvements. Symbol-level localization — find the right function/class, not just file — has real headroom and discriminates. - build_from_git(symbols=True) / `coderag eval --build --level symbol`: maps each commit's changed lines (zero-context diff hunks) to the symbols they touch, parsed from the file content *at that commit* via CodeRAG's own chunker, then intersected with the symbols present at HEAD so every ground-truth symbol is retrievable from the index. Off by default. - Tests cover symbol extraction (only the changed function is reported) and the default-off behavior. Result (10 symbol-level cases, this repo): the benchmark stops saturating (Hit@10 ~0.5), and the previously-flat cross-encoder reranker now shows the predicted lift — R@1 0.183->0.283 (+55%), MRR 0.420->0.514, nDCG@10 0.369->0.448. This validates move #2 and confirms the file-level null result was a benchmark artifact. Documented in docs/eval.md and the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Tooling for the symbol-level model comparisons: - RECOMMENDED_RERANKERS registry + `coderag eval --list-models` now lists local cross-encoder rerankers (MiniLM, bge-reranker-base, jina-reranker-v2) with size/notes, so code-aware rerankers are discoverable. - scripts/bench_embedders.py --rerank-models: score one hybrid+rerank row per named reranker, to compare reranker models on a fixed index. - coderag/eval/datasets/coderag_self_symbols.jsonl: 22 curated natural-language -> function/method cases (verified symbol names) for a trustworthy symbol-level eval, less noisy than the git-mined set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Measured on the curated 22-case symbol dataset (bge-small vs jina-code, with the ms-marco reranker): 1. The code-specific jina-code-v2 does NOT beat bge-small on NL->symbol queries (dense MRR 0.483 vs 0.675); a good general text embedder wins for natural-language "where is X" retrieval. 2. Equal-weight hybrid is not universally better: for the strong bge-small retriever, dense alone (0.675) beats 1:1 hybrid (0.573) because weak BM25 drags it down via RRF. Fusion weighting should be query-type-aware (dense-up for NL, BM25-up for identifiers) -- the biggest lever found. 3. Reranking lifts top-1 precision (R@1 0.364->0.409, +12%), consistent with the git-mined result. Documents these in docs/eval.md and elevates query-type fusion weighting in the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

claude added 3 commits June 17, 2026 08:37

Neverdecel mentioned this pull request Jun 17, 2026

Query-type-aware adaptive fusion weighting (validated symbol-level lever) #39

Merged

Neverdecel merged commit bc65f63 into master Jun 17, 2026
12 checks passed

Neverdecel mentioned this pull request Jun 17, 2026

External-repo validation: single-repo retrieval findings do not generalize #40

Merged

Neverdecel deleted the claude/eval-harder-benchmark branch June 18, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbol-level eval: non-saturated benchmark + model comparison findings#38

Symbol-level eval: non-saturated benchmark + model comparison findings#38
Neverdecel merged 3 commits into
masterfrom
claude/eval-harder-benchmark

Neverdecel commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Neverdecel commented Jun 17, 2026

Why

What's in here

1. Symbol-level dataset mining (the enabler)

2. Tooling for model comparison

3. Findings (measured, honest)

Testing

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants