Skip to content

Query-type-aware adaptive fusion weighting (validated symbol-level lever)#39

Merged
Neverdecel merged 1 commit into
masterfrom
claude/adaptive-fusion-weighting
Jun 17, 2026
Merged

Query-type-aware adaptive fusion weighting (validated symbol-level lever)#39
Neverdecel merged 1 commit into
masterfrom
claude/adaptive-fusion-weighting

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Stacked on #38 (claude/eval-harder-benchmark) — review/merge that first; this PR's base will need retargeting to master once #38 lands.

Why

PR #38's symbol-level eval surfaced the single biggest lever: fixed 1:1 dense/BM25 fusion is a compromise, not an optimum. On natural-language queries the dense retriever is much stronger, and equal-weight RRF drags it down with weak BM25 (bge-small: dense MRR 0.675 vs hybrid 0.573). This PR implements query-type-aware fusion weighting to fix that.

What

  • coderag/retrieval/query_type.py: looks_like_identifier() (a cheap, conservative local heuristic) + fusion_weights(). NL queries lean dense; identifier-like queries stay neutral.
  • config.adaptive_fusion (off by default) + nl_/code_ weight pairs, all env-configurable (CODERAG_ADAPTIVE_FUSION, …).
  • HybridSearcher routes weights per query; compare_modes gains an adaptive row and isolates the fixed modes from it; coderag eval --adaptive, bench_embedders.py --adaptive; status() reports it.
  • coderag/eval/datasets/coderag_self_identifiers.jsonl: 22 identifier-query cases (derived from the symbol set) to validate the code-side routing — not just the NL side.

Validated (bge-small, symbol level) — a Pareto improvement over fixed 1:1 hybrid

NL queries (coderag_self_symbols.jsonl)     MRR    R@1    nDCG@10
  hybrid (fixed 1:1)                         0.604  0.455  0.669
  adaptive                                   0.674  0.545  0.722   ← +0.070 MRR, +20% R@1

identifier queries (coderag_self_identifiers.jsonl)
  hybrid (fixed 1:1)                         0.685  0.545  0.741
  adaptive                                   0.685  0.545  0.741   ← unchanged (no regression)

Honest correction baked into the defaults

The literature's "BM25-up for code" intuition was refuted by the data: up-weighting BM25 for identifier queries hurt (MRR 0.685 → 0.613), because short/common identifiers (search, index) are lexically ambiguous and the embedder already matches them well. So the code-side default is neutral (1:1), not BM25-leaning — and BM25-leaning is left configurable (CODERAG_CODE_LEXICAL_WEIGHT) for larger repos where exact-string recall matters more. This is the harness preventing a plausible-but-wrong default.

Off by default pending larger-repo validation; enable with CODERAG_ADAPTIVE_FUSION=1.

Testing

New offline tests for the classifier (identifier vs NL), the weight routing (incl. validated defaults), and end-to-end adaptive search. Full pytest -m "not integration" green; ruff + mypy clean on new code.

Follow-ups

  • Validate (and likely default-on) on a larger external repo.
  • Reranking on a dense-weighted (adaptive) candidate pool, to stack the two levers.

🤖 Generated with Claude Code


Generated by Claude Code

@Neverdecel Neverdecel changed the base branch from claude/eval-harder-benchmark to master June 17, 2026 14:19
Implements the symbol-level finding that fixed 1:1 dense/BM25 fusion is a
compromise: on natural-language queries weak BM25 drags a strong dense ranking
down via RRF. A cheap local heuristic (looks_like_identifier) routes the
fusion weights by query type.

- coderag/retrieval/query_type.py: looks_like_identifier() + fusion_weights().
- config: adaptive_fusion (off by default) + nl_/code_ weight pairs (+ env).
- HybridSearcher uses fusion_weights(); compare_modes gains an `adaptive` row
  (and isolates fixed modes from adaptive); `coderag eval --adaptive`,
  bench `--adaptive`; status() reports adaptive_fusion.
- coderag/eval/datasets/coderag_self_identifiers.jsonl: 22 identifier-query
  cases (derived from the symbol set) to validate the code-side routing.

Validated (bge-small, symbol level) vs fixed 1:1 hybrid — a Pareto improvement:
  NL queries:         hybrid MRR 0.604 -> adaptive 0.674 (+0.070, R@1 +20%)
  identifier queries: hybrid MRR 0.685 -> adaptive 0.685 (unchanged)

Honest correction baked into the defaults: the "BM25-up for identifiers" half
of the hypothesis was refuted by the data (up-weighting BM25 hurt, MRR
0.685->0.613, because short/common identifiers are lexically ambiguous and the
embedder already matches them), so the code-side default is neutral, not
BM25-leaning. Documented in docs/eval.md and the strategy doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
@Neverdecel Neverdecel force-pushed the claude/adaptive-fusion-weighting branch from a8fd84d to ceaa4a4 Compare June 17, 2026 14:21
@Neverdecel Neverdecel merged commit e6d7096 into master Jun 17, 2026
12 checks passed
@Neverdecel Neverdecel deleted the claude/adaptive-fusion-weighting branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants