An end-to-end experiment comparing naive RAG against structure-aware RAG on a multi-document CMS Medicare policy corpus. Given a free-text clinical case narrative, the system retrieves the relevant policy blocks and produces an advisory decision (approve / deny / needs_more_info) with a rationale citing the retrieved sources.
The central question: on a corpus where logically-required evidence is semantically distant from the query — buried in exclusion clauses, cross-document definitions, or procedural prerequisites — does structural retrieval outperform naive fixed-size chunking?
Six real, public CMS Medicare coverage documents (data/raw/):
| Document | Role |
|---|---|
| L33822 — Glucose Monitors LCD | Primary coverage criteria (CGM/BGM, exclusions, appendix definitions) |
| L33794 — External Infusion Pumps LCD | Insulin-pump criteria; cross-doc AND with L33822 for integrated pump+CGM |
| A52464 — Glucose Monitor Policy Article | Payment/coding companion to L33822 (modifiers, pathways, smartphone exclusion) |
| A55426 — Standard Documentation Requirements | Shared procedural prerequisites (SWO, WOPD, face-to-face) referenced by all LCDs |
| L33370 — Nebulizers LCD | Distance distractor (respiratory; no real cross-reference to glucose) |
| L33797 — Oxygen and Oxygen Equipment LCD | Distance distractor (respiratory; no real cross-reference to glucose) |
The corpus parses into 57 structural blocks across the 6 documents. Eval set: 14 cases spanning single-doc, cross-doc AND, exclusion-flips, definitions, procedural prerequisites, and distractor cases (eval/cases/eval_cases.json).
Case narrative
│
▼
┌────────────────────────────────────────────────────────┐
│ Retrieval layer (two strategies, same embeddings) │
│ │
│ Naive: fixed-size word windows (w200) │
│ └─ all-MiniLM-L6-v2 + FAISS IndexFlatIP │
│ top-K cosine similarity │
│ │
│ Structure-aware: │
│ 1. Structural blocks (57, with auto-metadata) │
│ indexed as 60 units (L33794-b04 split into │
│ 4 sub-criteria children for finer matching) │
│ 2. FAISS cosine top-K_sim similarity │
│ 3. Parent-document promotion │
│ (child hit → return full parent block) │
│ 4. Completion rules (optional, ablated) │
│ (exclusion / definition blocks for matched docs) │
│ 5. Cross-encoder reranking (Stage 7) │
└────────────────────────────────────────────────────────┘
│
▼ retrieved blocks (with doc_id, retrieval_source)
┌────────────────────────────────────────────────────────┐
│ Reasoning layer │
│ Single Gemini 2.5 Flash call: │
│ context = retrieved blocks + case narrative │
│ output = decision label + rationale citing blocks │
│ confidence = logprob margin over {approve, deny, │
│ needs_more_info} at decision sub-token │
└────────────────────────────────────────────────────────┘
Both retrieval systems use the same embedding model and corpus. The only variable is chunking strategy and retrieval logic. Final block counts are reported so structure-aware cannot win by bulk-returning the corpus.
Parent-document retrieval is the only mechanism that recovers logically-necessary but semantically-distant blocks.
C6(Case 6, please see cases in './eval/cases') requires a cross-document AND: L33794 pump criteria AND L33822 CGM criteria must both be satisfied. The pump block (L33794-b04) is 688 words — 3.5× the corpus median. After splitting it into four sub-criteria children for indexing, child D (C-peptide criterion) matches by similarity. Parent-document promotion then returns the full parent block, giving the LLM the complete AND structure. This works stably across all 8 ablation configs.
Reranking correctly demotes noise blocks. The cross-encoder discriminates L33822-b06 (supply exclusion clause) as relevant for C4/C11 (co-billing cases) and irrelevant for C10 (respiratory distractor) and C1 (straightforward approve), where the bi-encoder assigned nearly identical similarity scores to all four.
Completion rules add noise, not signal. Adding every exclusion/definition block from any matched document was intended to recover C4 (exclusion-flips) and C2 (definition-lookup) cases. In practice, at K_sim=5 those blocks are already returned by similarity. Completion adds them to distractor cases too, occasionally flipping previously-correct decisions. Net correction: zero.
A55426 (procedural prerequisites) is never retrieved in any config. The SWO requirement referenced by all LCDs lives in A55426, a procedurally-written document with no clinical vocabulary. Its language does not surface-match glucose monitoring queries. At K_sim=10 it enters the pool (sim≈0.54) but the cross-encoder immediately demotes it (rerank≈−8 to −9). C8's deny requires A55426; it is a real and unclosed coverage gap. Neither completion rules nor reranking address it.
C8's correct deny under K5+both is LLM temperature noise, not structural recovery. A52464-b02 contains a pointer to A55426 ("must meet the documentation requirements in Article A55426..."). The LLM infers the deny from this pointer in one config but outputs needs_more_info under configs that retrieve the identical block set. A55426's content never enters the context window in any config.
Reranking can undo what structural retrieval preserved. C6's pump block (L33794-b04) is correctly returned by parent-document promotion in Stage 4, then demoted by the cross-encoder in Stage 7 (rerank=−4.74, rank 7 of 12, below the top-5 cutoff). The cross-encoder has no representation of cross-document conjunction; its local relevance score is rational, the failure is architectural.
Logprob confidence margin provides no discriminative signal. After fixing an alias-collision bug in the scorer (multiple surface forms of the same token — ' NEEDS', ' needs', ' Needs' — were colliding in a dict; last-write wins produced wildly wrong probability estimates), the corrected result is unambiguous: Gemini 2.5 Flash outputs margin ≈1.0 for essentially all decisions — correct, wrong, certain, and genuinely ambiguous cases alike. The signal cannot separate cases that should be uncertain from cases that should be confident.
The needs_more_info decision class itself remains a useful signal: routing on the label triggers an RFI workflow regardless of margin. High-confidence wrong decisions (C8 type) require retrieval-side intervention, not a better confidence scorer.
The ReAct agent's intermediate reasoning step is not a neutral wrapper. A hand-written 2-hop agent (Stage 8) introduced a sufficiency-check LLM call between retrieval and the final decision. That call changed decisions on the same context compared to the standard prompt: three regressions (C1, C5, C13), all in cases that never even triggered a second hop. The framing "can you decide or do you need more?" made genuinely ambiguous cases overconfident (needs_more_info → deny) and a simple approval case uncertain. Any multi-hop architecture must treat intermediate reasoning as a potential decision-altering intervention.
C8's first genuine fix points to multi-query retrieval as the simpler solution. The ReAct agent's NEED query used procedural vocabulary ("Standard Written Order documentation requirements") that the clinical query could not produce — and that query surfaced A55426-b02 at cosine score=0.4447 where the clinical query had failed across all prior stages. The failure was query generation, not retrieval capability. Multi-query retrieval — generate diverse query variants upfront (clinical + procedural), retrieve in parallel, take the union, make one decision — would get exactly this vocabulary-bridging benefit without the intermediate reasoning step that caused the regressions. A full ReAct loop is warranted when the follow-up query is unknowable before seeing hop-1 results; for procedural prerequisite documents like A55426, a targeted query is predictable from the document structure.
| Config | Correct / 14 | Notes |
|---|---|---|
| Naive w200 / K=8 | 14 / 14 | Tuned canonical; K is the dominant lever |
| Structure-aware K_sim=5 +both | 14 / 14 | C8 pass is LLM noise; A55426 never retrieved |
| Naive reranked (K=20 → top-5) | 11 / 14 | Stricter window drops some correct cases |
| Struct reranked (K_sim=10 → top-5) | 12 / 14 | C6 regresses; C8 now needs_more_info (more honest) |
| ReAct agent (K_sim=5 +both, max 2 hops) | 11 / 14 | C8 first genuine fix (A55426 actually retrieved); C1/C5/C13 regress from sufficiency-check prompt framing |
Accuracy is not the primary metric here. The meaningful comparison is the composition of retrieved blocks and the failure mode each config produces — see FINDINGS.md for the case-by-case breakdown.
| Stage | What | Key learning |
|---|---|---|
| 0 | Corpus ingestion + 14 eval cases | Corpus design: distractors, cross-doc, procedural prerequisites |
| 1 | Structural parse → 57 blocks with auto-metadata | block_type heuristic (criteria/definition/exclusion/general) from regex; no hand-authored tree |
| 2 | Naive baseline grid: w100/w200 × K3/5/8 | K is the dominant lever; w200/K8 = 14/14 |
| 3 | Eval harness (doc-level retrieval recall) | Document-level recall measurable; block-level GT deferred |
| 4 | Structure-aware 8-config ablation | Parent-doc retrieval works (C6); completion rules net neutral |
| 5 | Decision accuracy, both retrievers | Both 14/14; C8 instability exposed by cross-config comparison |
| 6 | Logprob confidence scoring | Alias-collision bug found; corrected result: uniform overconfidence |
| 7 | Cross-encoder reranking | Noise reduction confirmed; C6 regresses; C8 more honest |
| 8 | Hand-written ReAct agent (max 2 hops) | C8 first genuine fix: LLM-generated procedural query surfaces A55426; prompt-framing regressions reveal intermediate reasoning step is not neutral |
Full research observations: FINDINGS.md
Design rationale and build plan: PROJECT_PLAN_v2.md
Requirements: Python 3.12, conda, Google Cloud SDK with ADC credentials.
conda create -n sarag python=3.12
conda activate sarag
pip install -r requirements.txtVertex AI authentication:
gcloud auth application-default login
# Project: structureawarerag Region: us-central1 Model: gemini-2.5-flashThe processed corpus artifacts (data/processed/) are already included so you can inspect the parsed block structure without re-running the ingestion step. To regenerate:
python parse_corpus.py # → data/processed/blocks.json (57 blocks)Each script accepts --stub to run the full pipeline with a local stub LLM (no API calls, for verifying the plumbing):
# Stage 2: naive baseline
python run_naive_baseline.py --stub # pipeline check
python run_naive_baseline.py # real Vertex run
# Stage 2: chunk-size × K grid sweep
python run_naive_grid.py
# Stage 4+5: structure-aware ablation (8 configs × 14 cases)
python run_structure_ablation.py --stub
python run_structure_ablation.py
# Stage 6: confidence scoring from logprobs
python run_confidence_analysis.py --stub
python run_confidence_analysis.py
# Stage 7: cross-encoder reranking
python run_rerank_eval.py --stub
python run_rerank_eval.pyRaw LLM outputs are persisted to results/ (gitignored) for reproducibility. Each JSON file includes the full retrieved context, raw model response, and logprob arrays.
.
├── data/
│ ├── raw/ # 6 CMS corpus files (source of truth)
│ └── processed/ # parsed blocks + chunk sets (generated)
│ ├── blocks.json # 57 structural blocks with auto-metadata
│ ├── chunks_w200.json # naive chunks, 200-word windows
│ └── chunks_w100.json # naive chunks, 100-word windows
├── eval/
│ ├── cases/eval_cases.json # 14 eval cases with GT labels
│ └── metrics/ # retrieval recall + decision accuracy harnesses
├── src/
│ ├── ingest/ # corpus loading + structural parser
│ ├── chunking/ # naive fixed-size chunker
│ ├── retrieval/ # NaiveRetriever, StructureAwareRetriever, CrossEncoder reranker
│ ├── reasoning/ # LLM decision + rationale; prompts/
│ ├── confidence/ # logprob-based confidence scorer
│ └── llm/ # VertexAIClient + StubClient
├── run_naive_baseline.py # Stage 2 runner
├── run_naive_grid.py # Stage 2 grid sweep
├── run_structure_ablation.py # Stage 4+5 runner
├── run_confidence_analysis.py # Stage 6 runner
├── run_rerank_eval.py # Stage 7 runner
├── parse_corpus.py # Stage 1 ingestion
├── FINDINGS.md # research observations (all stages)
└── PROJECT_PLAN_v2.md # design rationale and build plan
- Embeddings:
all-MiniLM-L6-v2(sentence-transformers) - Vector index: FAISS
IndexFlatIP— exact inner-product search over L2-normalised vectors; equivalent to cosine similarity - Reranker:
cross-encoder/ms-marco-MiniLM-L-6-v2 - LLM: Gemini 2.5 Flash via Vertex AI (ADC,
us-central1) - Confidence: token-level logprobs (top-5 per position) → 3-way softmax over decision tokens
n=14 eval cases is a mechanism check, not a statistical claim. All numerical findings should be read qualitatively. The primary deliverable is a clear, honest account of what each retrieval mechanism does and does not recover.