External-repo validation: single-repo retrieval findings do not generalize#40
Merged
Merged
Conversation
Validated the symbol-level "levers" on a larger external repo (pydantic, 4155-chunk corpus, 30 symbol cases from commit messages). The headline: the biggest single-repo lever reversed sign. - dense-vs-BM25 ranking FLIPS by repo/query style: dense won on CodeRAG's curated NL queries; BM25 wins on pydantic's commit-message queries (which embed exact API names). MRR: bm25 0.384 > hybrid 0.361 > adaptive 0.286 > dense 0.150. - adaptive fusion (lean-dense-for-NL) HURT here (0.286 vs hybrid 0.361) — the classifier is fooled by identifier-laden prose. Confirms shipping it OFF by default. - reranking (ms-marco) was neutral/slightly negative — web-trained reranker isn't a reliable code reranker. - fixed 1:1 hybrid is the robust default (never best, never worst, within ~6% of the winner on both repos) — validates the existing defaults. No defaults change; the validation confirms the current ones are well-chosen. Adds docs/research/external-validation.md and cross-links the caveat from eval.md and the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
This was referenced Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The symbol-level experiments (PRs #38/#39) were all measured on CodeRAG-on-itself — 93 files. That's a weak basis for default changes. This validates the "levers" on a larger external repo (
pydantic, 4 155-chunk corpus), and the headline is that the biggest single-repo lever reversed sign. External validation paid for itself.Setup
pydantic/pydantic, 161 files / 4 155 chunks indexed (7× CodeRAG; full index was CPU-bound, so partial + dataset filtered to it),bge-small, symbol level.Results (30 cases)
Findings (3 of 4 overturn or qualify the single-repo conclusions)
looks_like_identifierclassifier is fooled by identifier-laden sentences. Confirms shipping it OFF by default.Outcome
No defaults change — the validation confirms the current ones (1:1 hybrid, adaptive off, rerank opt-in) are well-chosen, and catches the single-repo overfitting before any of it shipped as a default. Docs-only PR: adds
docs/research/external-validation.mdand cross-links the caveat fromeval.mdand the strategy doc.Genuine next steps (none change a default)
looks_like_identifierdetect identifiers embedded in prose (so "FixAliasGenerator.generate_aliases" routes BM25-up) — could make adaptive a net win across query styles.🤖 Generated with Claude Code
Generated by Claude Code