Skip to content

Symbol-level eval: non-saturated benchmark + model comparison findings#38

Merged
Neverdecel merged 3 commits into
masterfrom
claude/eval-harder-benchmark
Jun 17, 2026
Merged

Symbol-level eval: non-saturated benchmark + model comparison findings#38
Neverdecel merged 3 commits into
masterfrom
claude/eval-harder-benchmark

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Why

PR #37 shipped the eval harness but flagged a blocker: file-level eval on this small repo saturates (hybrid hits Hit@10 = 1.0), so it couldn't measure any retrieval improvement — both the embedder swap and the reranker came back flat. This PR makes the benchmark discriminating and then uses it to get real answers.

What's in here

1. Symbol-level dataset mining (the enabler)

Finding the right function/class (not just file) is far harder, so the benchmark stops saturating (Hit@10 ≈ 0.5 instead of 1.0).

  • build_from_git(symbols=True) / coderag eval --build --level symbol: maps each commit's changed lines (zero-context diff hunks) to the symbols they touch — parsed from the file content at that commit via CodeRAG's own chunker, then intersected with HEAD symbols so every ground-truth symbol is retrievable. Off by default.
  • coderag/eval/datasets/coderag_self_symbols.jsonl: 22 hand-verified natural-language → function/method cases (less noisy than the git-mined set).

2. Tooling for model comparison

  • RECOMMENDED_RERANKERS registry + coderag eval --list-models now lists local cross-encoder rerankers (MiniLM, bge-reranker-base, jina-reranker-v2).
  • scripts/bench_embedders.py --rerank-models: one hybrid+rerank row per named reranker, on a fixed index.

3. Findings (measured, honest)

The reranker is validated once there's headroom. On 10 git-mined symbol cases it lifted R@1 0.183 → 0.283 (+55%); on the 22 curated cases, R@1 0.364 → 0.409 (+12%) — the earlier file-level "no lift" was a saturation artifact, not a property of the technique.

Curated 22-case symbol-level comparison:

mode                                   MRR    R@1    R@5    nDCG@10  Hit@10
bge-small-en-v1.5 · dense              0.675  0.591  0.818  0.720    0.864
bge-small-en-v1.5 · bm25               0.427  0.318  0.636  0.498    0.727
bge-small-en-v1.5 · hybrid             0.573  0.364  0.864  0.647    0.864
bge-small-en-v1.5 · hybrid+rerank      0.580  0.409  0.864  0.651    0.864
jina-embeddings-v2-base-code · dense   0.483  0.318  0.682  0.554    0.773
jina-embeddings-v2-base-code · hybrid  0.604  0.455  0.818  0.668    0.864
  1. A bigger/code-specific embedder is not the win. bge-small · dense (MRR 0.675) clearly beats jina-code · dense (0.483) on NL→symbol queries — a good general text embedder is stronger for "where is X" retrieval; jina-v2-base-code is older/code↔code-tuned.
  2. Equal-weight hybrid is not universally better — the single biggest lever. For the strong bge-small, dense alone (0.675) beats 1:1 hybrid (0.573): weak BM25 on NL queries drags the dense ranking down via RRF. For the weaker jina-code, BM25 helps. → Fusion weighting should be query-type-aware (dense-up for NL, BM25-up for identifiers), not fixed 1:1.
  3. Reranking reliably sharpens top-1 (+12–55% R@1 across both sets) with the tiny ms-marco model.

Net: the largest lever found is query-type-aware fusion weighting, then reranking for top-1 — not a bigger embedding model. Larger code-aware rerankers are registered but are ~1 GB and slow to rerank on CPU; test them on GPU / a smaller pool, ideally on a larger external repo.

Testing

New offline tests for symbol mining (only the changed function is reported; default-off) and the reranker registry. Full pytest -m "not integration" green; ruff + mypy clean on new code.

Follow-ups

  • Implement query-type-aware fusion weighting (finding chore: tighten repo tooling #2) and measure it.
  • Validate on a larger external repo (the real generalization test).
  • Code-aware reranker on GPU; CodeRankEmbed ONNX export; MCP surface.

🤖 Generated with Claude Code


Generated by Claude Code

claude added 3 commits June 17, 2026 08:37
File-level eval on a small repo saturates (hybrid hits Hit@10=1.0), so it
could not measure retrieval improvements. Symbol-level localization — find
the right function/class, not just file — has real headroom and discriminates.

- build_from_git(symbols=True) / `coderag eval --build --level symbol`:
  maps each commit's changed lines (zero-context diff hunks) to the symbols
  they touch, parsed from the file content *at that commit* via CodeRAG's own
  chunker, then intersected with the symbols present at HEAD so every
  ground-truth symbol is retrievable from the index. Off by default.
- Tests cover symbol extraction (only the changed function is reported) and
  the default-off behavior.

Result (10 symbol-level cases, this repo): the benchmark stops saturating
(Hit@10 ~0.5), and the previously-flat cross-encoder reranker now shows the
predicted lift — R@1 0.183->0.283 (+55%), MRR 0.420->0.514, nDCG@10
0.369->0.448. This validates move #2 and confirms the file-level null result
was a benchmark artifact. Documented in docs/eval.md and the strategy doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Tooling for the symbol-level model comparisons:
- RECOMMENDED_RERANKERS registry + `coderag eval --list-models` now lists
  local cross-encoder rerankers (MiniLM, bge-reranker-base, jina-reranker-v2)
  with size/notes, so code-aware rerankers are discoverable.
- scripts/bench_embedders.py --rerank-models: score one hybrid+rerank row per
  named reranker, to compare reranker models on a fixed index.
- coderag/eval/datasets/coderag_self_symbols.jsonl: 22 curated
  natural-language -> function/method cases (verified symbol names) for a
  trustworthy symbol-level eval, less noisy than the git-mined set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Measured on the curated 22-case symbol dataset (bge-small vs jina-code, with
the ms-marco reranker):

1. The code-specific jina-code-v2 does NOT beat bge-small on NL->symbol
   queries (dense MRR 0.483 vs 0.675); a good general text embedder wins for
   natural-language "where is X" retrieval.
2. Equal-weight hybrid is not universally better: for the strong bge-small
   retriever, dense alone (0.675) beats 1:1 hybrid (0.573) because weak BM25
   drags it down via RRF. Fusion weighting should be query-type-aware
   (dense-up for NL, BM25-up for identifiers) -- the biggest lever found.
3. Reranking lifts top-1 precision (R@1 0.364->0.409, +12%), consistent with
   the git-mined result.

Documents these in docs/eval.md and elevates query-type fusion weighting in
the strategy doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
@Neverdecel Neverdecel merged commit bc65f63 into master Jun 17, 2026
12 checks passed
@Neverdecel Neverdecel deleted the claude/eval-harder-benchmark branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants