External-repo validation: single-repo retrieval findings do not generalize by Neverdecel · Pull Request #40 · Neverdecel/CodeRAG

Neverdecel · 2026-06-17T15:01:24Z

Why

The symbol-level experiments (PRs #38/#39) were all measured on CodeRAG-on-itself — 93 files. That's a weak basis for default changes. This validates the "levers" on a larger external repo (pydantic, 4 155-chunk corpus), and the headline is that the biggest single-repo lever reversed sign. External validation paid for itself.

Setup

pydantic/pydantic, 161 files / 4 155 chunks indexed (7× CodeRAG; full index was CPU-bound, so partial + dataset filtered to it), bge-small, symbol level.
30 symbol-level cases mined from commit history (queries = commit subjects, which often embed API names).

Results (30 cases)

mode           MRR    R@1    Hit@10
dense          0.150  0.067  0.300
bm25           0.384  0.317  0.533   ← best
hybrid         0.361  0.283  0.433   ← robust
hybrid+rerank  0.353  0.253  0.467
adaptive       0.286  0.183  0.400   ← hurt

Findings (3 of 4 overturn or qualify the single-repo conclusions)

Dense-vs-BM25 ranking flips by repo/query style. Dense crushed BM25 on CodeRAG's curated NL queries (0.675 vs 0.427); BM25 crushes dense on pydantic's commit-message queries (0.384 vs 0.150). No modality wins universally.
Adaptive fusion did not generalize — it hurt (0.286 vs hybrid 0.361). "Lean dense for NL" is backwards when prose queries embed exact symbol names; the looks_like_identifier classifier is fooled by identifier-laden sentences. Confirms shipping it OFF by default.
Reranking (ms-marco) did not help either (0.353 ≈ 0.361). A web-trained cross-encoder isn't a reliable code reranker.
Fixed 1:1 hybrid is the robust default — never best, never worst, within ~6% of the winner on both repos. Validates CodeRAG's existing defaults.

Outcome

No defaults change — the validation confirms the current ones (1:1 hybrid, adaptive off, rerank opt-in) are well-chosen, and catches the single-repo overfitting before any of it shipped as a default. Docs-only PR: adds docs/research/external-validation.md and cross-links the caveat from eval.md and the strategy doc.

Genuine next steps (none change a default)

Make looks_like_identifier detect identifiers embedded in prose (so "Fix AliasGenerator.generate_aliases" routes BM25-up) — could make adaptive a net win across query styles.
Test a code-aware reranker at scale on GPU — the one lever not yet fairly evaluated.
Build a multi-repo eval set so future tuning is judged on generalization.

🤖 Generated with Claude Code

Generated by Claude Code

Validated the symbol-level "levers" on a larger external repo (pydantic, 4155-chunk corpus, 30 symbol cases from commit messages). The headline: the biggest single-repo lever reversed sign. - dense-vs-BM25 ranking FLIPS by repo/query style: dense won on CodeRAG's curated NL queries; BM25 wins on pydantic's commit-message queries (which embed exact API names). MRR: bm25 0.384 > hybrid 0.361 > adaptive 0.286 > dense 0.150. - adaptive fusion (lean-dense-for-NL) HURT here (0.286 vs hybrid 0.361) — the classifier is fooled by identifier-laden prose. Confirms shipping it OFF by default. - reranking (ms-marco) was neutral/slightly negative — web-trained reranker isn't a reliable code reranker. - fixed 1:1 hybrid is the robust default (never best, never worst, within ~6% of the winner on both repos) — validates the existing defaults. No defaults change; the validation confirms the current ones are well-chosen. Adds docs/research/external-validation.md and cross-links the caveat from eval.md and the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Neverdecel merged commit 4927143 into master Jun 17, 2026
12 checks passed

This was referenced Jun 17, 2026

Detect identifiers embedded in prose — adaptive fusion now generalizes #42

Merged

Multi-repo evaluation: macro-averaged generalization view #43

Merged

Neverdecel deleted the claude/external-repo-validation branch June 18, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External-repo validation: single-repo retrieval findings do not generalize#40

External-repo validation: single-repo retrieval findings do not generalize#40
Neverdecel merged 1 commit into
masterfrom
claude/external-repo-validation

Neverdecel commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Neverdecel commented Jun 17, 2026

Why

Setup

Results (30 cases)

Findings (3 of 4 overturn or qualify the single-repo conclusions)

Outcome

Genuine next steps (none change a default)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants