Query-type-aware adaptive fusion weighting (validated symbol-level lever) by Neverdecel · Pull Request #39 · Neverdecel/CodeRAG

Neverdecel · 2026-06-17T14:04:16Z

Stacked on #38 (claude/eval-harder-benchmark) — review/merge that first; this PR's base will need retargeting to master once #38 lands.

Why

PR #38's symbol-level eval surfaced the single biggest lever: fixed 1:1 dense/BM25 fusion is a compromise, not an optimum. On natural-language queries the dense retriever is much stronger, and equal-weight RRF drags it down with weak BM25 (bge-small: dense MRR 0.675 vs hybrid 0.573). This PR implements query-type-aware fusion weighting to fix that.

What

coderag/retrieval/query_type.py: looks_like_identifier() (a cheap, conservative local heuristic) + fusion_weights(). NL queries lean dense; identifier-like queries stay neutral.
config.adaptive_fusion (off by default) + nl_/code_ weight pairs, all env-configurable (CODERAG_ADAPTIVE_FUSION, …).
HybridSearcher routes weights per query; compare_modes gains an adaptive row and isolates the fixed modes from it; coderag eval --adaptive, bench_embedders.py --adaptive; status() reports it.
coderag/eval/datasets/coderag_self_identifiers.jsonl: 22 identifier-query cases (derived from the symbol set) to validate the code-side routing — not just the NL side.

Validated (bge-small, symbol level) — a Pareto improvement over fixed 1:1 hybrid

NL queries (coderag_self_symbols.jsonl)     MRR    R@1    nDCG@10
  hybrid (fixed 1:1)                         0.604  0.455  0.669
  adaptive                                   0.674  0.545  0.722   ← +0.070 MRR, +20% R@1

identifier queries (coderag_self_identifiers.jsonl)
  hybrid (fixed 1:1)                         0.685  0.545  0.741
  adaptive                                   0.685  0.545  0.741   ← unchanged (no regression)

Honest correction baked into the defaults

The literature's "BM25-up for code" intuition was refuted by the data: up-weighting BM25 for identifier queries hurt (MRR 0.685 → 0.613), because short/common identifiers (search, index) are lexically ambiguous and the embedder already matches them well. So the code-side default is neutral (1:1), not BM25-leaning — and BM25-leaning is left configurable (CODERAG_CODE_LEXICAL_WEIGHT) for larger repos where exact-string recall matters more. This is the harness preventing a plausible-but-wrong default.

Off by default pending larger-repo validation; enable with CODERAG_ADAPTIVE_FUSION=1.

Testing

New offline tests for the classifier (identifier vs NL), the weight routing (incl. validated defaults), and end-to-end adaptive search. Full pytest -m "not integration" green; ruff + mypy clean on new code.

Follow-ups

Validate (and likely default-on) on a larger external repo.
Reranking on a dense-weighted (adaptive) candidate pool, to stack the two levers.

🤖 Generated with Claude Code

Generated by Claude Code

Implements the symbol-level finding that fixed 1:1 dense/BM25 fusion is a compromise: on natural-language queries weak BM25 drags a strong dense ranking down via RRF. A cheap local heuristic (looks_like_identifier) routes the fusion weights by query type. - coderag/retrieval/query_type.py: looks_like_identifier() + fusion_weights(). - config: adaptive_fusion (off by default) + nl_/code_ weight pairs (+ env). - HybridSearcher uses fusion_weights(); compare_modes gains an `adaptive` row (and isolates fixed modes from adaptive); `coderag eval --adaptive`, bench `--adaptive`; status() reports adaptive_fusion. - coderag/eval/datasets/coderag_self_identifiers.jsonl: 22 identifier-query cases (derived from the symbol set) to validate the code-side routing. Validated (bge-small, symbol level) vs fixed 1:1 hybrid — a Pareto improvement: NL queries: hybrid MRR 0.604 -> adaptive 0.674 (+0.070, R@1 +20%) identifier queries: hybrid MRR 0.685 -> adaptive 0.685 (unchanged) Honest correction baked into the defaults: the "BM25-up for identifiers" half of the hypothesis was refuted by the data (up-weighting BM25 hurt, MRR 0.685->0.613, because short/common identifiers are lexically ambiguous and the embedder already matches them), so the code-side default is neutral, not BM25-leaning. Documented in docs/eval.md and the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Neverdecel changed the base branch from claude/eval-harder-benchmark to master June 17, 2026 14:19

Neverdecel force-pushed the claude/adaptive-fusion-weighting branch from a8fd84d to ceaa4a4 Compare June 17, 2026 14:21

Neverdecel merged commit e6d7096 into master Jun 17, 2026
12 checks passed

This was referenced Jun 17, 2026

External-repo validation: single-repo retrieval findings do not generalize #40

Merged

Detect identifiers embedded in prose — adaptive fusion now generalizes #42

Merged

Neverdecel deleted the claude/adaptive-fusion-weighting branch June 18, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query-type-aware adaptive fusion weighting (validated symbol-level lever)#39

Query-type-aware adaptive fusion weighting (validated symbol-level lever)#39
Neverdecel merged 1 commit into
masterfrom
claude/adaptive-fusion-weighting

Neverdecel commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Neverdecel commented Jun 17, 2026

Why

What

Validated (bge-small, symbol level) — a Pareto improvement over fixed 1:1 hybrid

Honest correction baked into the defaults

Testing

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants