Skip to content

External-repo validation: single-repo retrieval findings do not generalize#40

Merged
Neverdecel merged 1 commit into
masterfrom
claude/external-repo-validation
Jun 17, 2026
Merged

External-repo validation: single-repo retrieval findings do not generalize#40
Neverdecel merged 1 commit into
masterfrom
claude/external-repo-validation

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Why

The symbol-level experiments (PRs #38/#39) were all measured on CodeRAG-on-itself — 93 files. That's a weak basis for default changes. This validates the "levers" on a larger external repo (pydantic, 4 155-chunk corpus), and the headline is that the biggest single-repo lever reversed sign. External validation paid for itself.

Setup

  • pydantic/pydantic, 161 files / 4 155 chunks indexed (7× CodeRAG; full index was CPU-bound, so partial + dataset filtered to it), bge-small, symbol level.
  • 30 symbol-level cases mined from commit history (queries = commit subjects, which often embed API names).

Results (30 cases)

mode           MRR    R@1    Hit@10
dense          0.150  0.067  0.300
bm25           0.384  0.317  0.533   ← best
hybrid         0.361  0.283  0.433   ← robust
hybrid+rerank  0.353  0.253  0.467
adaptive       0.286  0.183  0.400   ← hurt

Findings (3 of 4 overturn or qualify the single-repo conclusions)

  1. Dense-vs-BM25 ranking flips by repo/query style. Dense crushed BM25 on CodeRAG's curated NL queries (0.675 vs 0.427); BM25 crushes dense on pydantic's commit-message queries (0.384 vs 0.150). No modality wins universally.
  2. Adaptive fusion did not generalize — it hurt (0.286 vs hybrid 0.361). "Lean dense for NL" is backwards when prose queries embed exact symbol names; the looks_like_identifier classifier is fooled by identifier-laden sentences. Confirms shipping it OFF by default.
  3. Reranking (ms-marco) did not help either (0.353 ≈ 0.361). A web-trained cross-encoder isn't a reliable code reranker.
  4. Fixed 1:1 hybrid is the robust default — never best, never worst, within ~6% of the winner on both repos. Validates CodeRAG's existing defaults.

Outcome

No defaults change — the validation confirms the current ones (1:1 hybrid, adaptive off, rerank opt-in) are well-chosen, and catches the single-repo overfitting before any of it shipped as a default. Docs-only PR: adds docs/research/external-validation.md and cross-links the caveat from eval.md and the strategy doc.

Genuine next steps (none change a default)

  • Make looks_like_identifier detect identifiers embedded in prose (so "Fix AliasGenerator.generate_aliases" routes BM25-up) — could make adaptive a net win across query styles.
  • Test a code-aware reranker at scale on GPU — the one lever not yet fairly evaluated.
  • Build a multi-repo eval set so future tuning is judged on generalization.

🤖 Generated with Claude Code


Generated by Claude Code

Validated the symbol-level "levers" on a larger external repo (pydantic,
4155-chunk corpus, 30 symbol cases from commit messages). The headline:
the biggest single-repo lever reversed sign.

- dense-vs-BM25 ranking FLIPS by repo/query style: dense won on CodeRAG's
  curated NL queries; BM25 wins on pydantic's commit-message queries (which
  embed exact API names). MRR: bm25 0.384 > hybrid 0.361 > adaptive 0.286
  > dense 0.150.
- adaptive fusion (lean-dense-for-NL) HURT here (0.286 vs hybrid 0.361) —
  the classifier is fooled by identifier-laden prose. Confirms shipping it
  OFF by default.
- reranking (ms-marco) was neutral/slightly negative — web-trained reranker
  isn't a reliable code reranker.
- fixed 1:1 hybrid is the robust default (never best, never worst, within
  ~6% of the winner on both repos) — validates the existing defaults.

No defaults change; the validation confirms the current ones are well-chosen.
Adds docs/research/external-validation.md and cross-links the caveat from
eval.md and the strategy doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
@Neverdecel Neverdecel merged commit 4927143 into master Jun 17, 2026
12 checks passed
@Neverdecel Neverdecel deleted the claude/external-repo-validation branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants