Structure-Aware RAG for Policy Decision Support

An end-to-end experiment comparing naive RAG against structure-aware RAG on a multi-document CMS Medicare policy corpus. Given a free-text clinical case narrative, the system retrieves the relevant policy blocks and produces an advisory decision (approve / deny / needs_more_info) with a rationale citing the retrieved sources.

The central question: on a corpus where logically-required evidence is semantically distant from the query — buried in exclusion clauses, cross-document definitions, or procedural prerequisites — does structural retrieval outperform naive fixed-size chunking?

Corpus

Six real, public CMS Medicare coverage documents (data/raw/):

Document	Role
L33822 — Glucose Monitors LCD	Primary coverage criteria (CGM/BGM, exclusions, appendix definitions)
L33794 — External Infusion Pumps LCD	Insulin-pump criteria; cross-doc AND with L33822 for integrated pump+CGM
A52464 — Glucose Monitor Policy Article	Payment/coding companion to L33822 (modifiers, pathways, smartphone exclusion)
A55426 — Standard Documentation Requirements	Shared procedural prerequisites (SWO, WOPD, face-to-face) referenced by all LCDs
L33370 — Nebulizers LCD	Distance distractor (respiratory; no real cross-reference to glucose)
L33797 — Oxygen and Oxygen Equipment LCD	Distance distractor (respiratory; no real cross-reference to glucose)

The corpus parses into 57 structural blocks across the 6 documents. Eval set: 14 cases spanning single-doc, cross-doc AND, exclusion-flips, definitions, procedural prerequisites, and distractor cases (eval/cases/eval_cases.json).

Architecture

Case narrative
     │
     ▼
┌────────────────────────────────────────────────────────┐
│  Retrieval layer (two strategies, same embeddings)     │
│                                                        │
│  Naive:    fixed-size word windows (w200)              │
│            └─ all-MiniLM-L6-v2 + FAISS IndexFlatIP     │
│               top-K cosine similarity                  │
│                                                        │
│  Structure-aware:                                      │
│    1. Structural blocks (57, with auto-metadata)       │
│       indexed as 60 units (L33794-b04 split into       │
│       4 sub-criteria children for finer matching)      │
│    2. FAISS cosine top-K_sim similarity                │
│    3. Parent-document promotion                        │
│       (child hit → return full parent block)           │
│    4. Completion rules (optional, ablated)             │
│       (exclusion / definition blocks for matched docs) │
│    5. Cross-encoder reranking (Stage 7)                │
└────────────────────────────────────────────────────────┘
     │
     ▼  retrieved blocks (with doc_id, retrieval_source)
┌────────────────────────────────────────────────────────┐
│  Reasoning layer                                       │
│  Single Gemini 2.5 Flash call:                         │
│  context = retrieved blocks + case narrative           │
│  output  = decision label + rationale citing blocks    │
│  confidence = logprob margin over {approve, deny,      │
│               needs_more_info} at decision sub-token   │
└────────────────────────────────────────────────────────┘

Both retrieval systems use the same embedding model and corpus. The only variable is chunking strategy and retrieval logic. Final block counts are reported so structure-aware cannot win by bulk-returning the corpus.

Key Findings

What works

Parent-document retrieval is the only mechanism that recovers logically-necessary but semantically-distant blocks.

C6(Case 6, please see cases in './eval/cases') requires a cross-document AND: L33794 pump criteria AND L33822 CGM criteria must both be satisfied. The pump block (L33794-b04) is 688 words — 3.5× the corpus median. After splitting it into four sub-criteria children for indexing, child D (C-peptide criterion) matches by similarity. Parent-document promotion then returns the full parent block, giving the LLM the complete AND structure. This works stably across all 8 ablation configs.

Reranking correctly demotes noise blocks. The cross-encoder discriminates L33822-b06 (supply exclusion clause) as relevant for C4/C11 (co-billing cases) and irrelevant for C10 (respiratory distractor) and C1 (straightforward approve), where the bi-encoder assigned nearly identical similarity scores to all four.

What doesn't work

Completion rules add noise, not signal. Adding every exclusion/definition block from any matched document was intended to recover C4 (exclusion-flips) and C2 (definition-lookup) cases. In practice, at K_sim=5 those blocks are already returned by similarity. Completion adds them to distractor cases too, occasionally flipping previously-correct decisions. Net correction: zero.

A55426 (procedural prerequisites) is never retrieved in any config. The SWO requirement referenced by all LCDs lives in A55426, a procedurally-written document with no clinical vocabulary. Its language does not surface-match glucose monitoring queries. At K_sim=10 it enters the pool (sim≈0.54) but the cross-encoder immediately demotes it (rerank≈−8 to −9). C8's deny requires A55426; it is a real and unclosed coverage gap. Neither completion rules nor reranking address it.

C8's correct deny under K5+both is LLM temperature noise, not structural recovery. A52464-b02 contains a pointer to A55426 ("must meet the documentation requirements in Article A55426..."). The LLM infers the deny from this pointer in one config but outputs needs_more_info under configs that retrieve the identical block set. A55426's content never enters the context window in any config.

Reranking can undo what structural retrieval preserved. C6's pump block (L33794-b04) is correctly returned by parent-document promotion in Stage 4, then demoted by the cross-encoder in Stage 7 (rerank=−4.74, rank 7 of 12, below the top-5 cutoff). The cross-encoder has no representation of cross-document conjunction; its local relevance score is rational, the failure is architectural.

Logprob confidence margin provides no discriminative signal. After fixing an alias-collision bug in the scorer (multiple surface forms of the same token — ' NEEDS', ' needs', ' Needs' — were colliding in a dict; last-write wins produced wildly wrong probability estimates), the corrected result is unambiguous: Gemini 2.5 Flash outputs margin ≈1.0 for essentially all decisions — correct, wrong, certain, and genuinely ambiguous cases alike. The signal cannot separate cases that should be uncertain from cases that should be confident.

The needs_more_info decision class itself remains a useful signal: routing on the label triggers an RFI workflow regardless of margin. High-confidence wrong decisions (C8 type) require retrieval-side intervention, not a better confidence scorer.

The ReAct agent's intermediate reasoning step is not a neutral wrapper. A hand-written 2-hop agent (Stage 8) introduced a sufficiency-check LLM call between retrieval and the final decision. That call changed decisions on the same context compared to the standard prompt: three regressions (C1, C5, C13), all in cases that never even triggered a second hop. The framing "can you decide or do you need more?" made genuinely ambiguous cases overconfident (needs_more_info → deny) and a simple approval case uncertain. Any multi-hop architecture must treat intermediate reasoning as a potential decision-altering intervention.

C8's first genuine fix points to multi-query retrieval as the simpler solution. The ReAct agent's NEED query used procedural vocabulary ("Standard Written Order documentation requirements") that the clinical query could not produce — and that query surfaced A55426-b02 at cosine score=0.4447 where the clinical query had failed across all prior stages. The failure was query generation, not retrieval capability. Multi-query retrieval — generate diverse query variants upfront (clinical + procedural), retrieve in parallel, take the union, make one decision — would get exactly this vocabulary-bridging benefit without the intermediate reasoning step that caused the regressions. A full ReAct loop is warranted when the follow-up query is unknowable before seeing hop-1 results; for procedural prerequisite documents like A55426, a targeted query is predictable from the document structure.

Results Summary

Config	Correct / 14	Notes
Naive w200 / K=8	14 / 14	Tuned canonical; K is the dominant lever
Structure-aware K_sim=5 +both	14 / 14	C8 pass is LLM noise; A55426 never retrieved
Naive reranked (K=20 → top-5)	11 / 14	Stricter window drops some correct cases
Struct reranked (K_sim=10 → top-5)	12 / 14	C6 regresses; C8 now `needs_more_info` (more honest)
ReAct agent (K_sim=5 +both, max 2 hops)	11 / 14	C8 first genuine fix (A55426 actually retrieved); C1/C5/C13 regress from sufficiency-check prompt framing

Accuracy is not the primary metric here. The meaningful comparison is the composition of retrieved blocks and the failure mode each config produces — see FINDINGS.md for the case-by-case breakdown.

Stage-by-Stage Summary

Stage	What	Key learning
0	Corpus ingestion + 14 eval cases	Corpus design: distractors, cross-doc, procedural prerequisites
1	Structural parse → 57 blocks with auto-metadata	`block_type` heuristic (criteria/definition/exclusion/general) from regex; no hand-authored tree
2	Naive baseline grid: w100/w200 × K3/5/8	K is the dominant lever; w200/K8 = 14/14
3	Eval harness (doc-level retrieval recall)	Document-level recall measurable; block-level GT deferred
4	Structure-aware 8-config ablation	Parent-doc retrieval works (C6); completion rules net neutral
5	Decision accuracy, both retrievers	Both 14/14; C8 instability exposed by cross-config comparison
6	Logprob confidence scoring	Alias-collision bug found; corrected result: uniform overconfidence
7	Cross-encoder reranking	Noise reduction confirmed; C6 regresses; C8 more honest
8	Hand-written ReAct agent (max 2 hops)	C8 first genuine fix: LLM-generated procedural query surfaces A55426; prompt-framing regressions reveal intermediate reasoning step is not neutral

Full research observations: FINDINGS.md
Design rationale and build plan: PROJECT_PLAN_v2.md

Setup

Requirements: Python 3.12, conda, Google Cloud SDK with ADC credentials.

conda create -n sarag python=3.12
conda activate sarag
pip install -r requirements.txt

Vertex AI authentication:

gcloud auth application-default login
# Project: structureawarerag  Region: us-central1  Model: gemini-2.5-flash

The processed corpus artifacts (data/processed/) are already included so you can inspect the parsed block structure without re-running the ingestion step. To regenerate:

python parse_corpus.py          # → data/processed/blocks.json (57 blocks)

Running the Experiments

Each script accepts --stub to run the full pipeline with a local stub LLM (no API calls, for verifying the plumbing):

# Stage 2: naive baseline
python run_naive_baseline.py --stub        # pipeline check
python run_naive_baseline.py               # real Vertex run

# Stage 2: chunk-size × K grid sweep
python run_naive_grid.py

# Stage 4+5: structure-aware ablation (8 configs × 14 cases)
python run_structure_ablation.py --stub
python run_structure_ablation.py

# Stage 6: confidence scoring from logprobs
python run_confidence_analysis.py --stub
python run_confidence_analysis.py

# Stage 7: cross-encoder reranking
python run_rerank_eval.py --stub
python run_rerank_eval.py

Raw LLM outputs are persisted to results/ (gitignored) for reproducibility. Each JSON file includes the full retrieved context, raw model response, and logprob arrays.

Repo Layout

.
├── data/
│   ├── raw/                    # 6 CMS corpus files (source of truth)
│   └── processed/              # parsed blocks + chunk sets (generated)
│       ├── blocks.json         # 57 structural blocks with auto-metadata
│       ├── chunks_w200.json    # naive chunks, 200-word windows
│       └── chunks_w100.json    # naive chunks, 100-word windows
├── eval/
│   ├── cases/eval_cases.json   # 14 eval cases with GT labels
│   └── metrics/                # retrieval recall + decision accuracy harnesses
├── src/
│   ├── ingest/                 # corpus loading + structural parser
│   ├── chunking/               # naive fixed-size chunker
│   ├── retrieval/              # NaiveRetriever, StructureAwareRetriever, CrossEncoder reranker
│   ├── reasoning/              # LLM decision + rationale; prompts/
│   ├── confidence/             # logprob-based confidence scorer
│   └── llm/                    # VertexAIClient + StubClient
├── run_naive_baseline.py       # Stage 2 runner
├── run_naive_grid.py           # Stage 2 grid sweep
├── run_structure_ablation.py   # Stage 4+5 runner
├── run_confidence_analysis.py  # Stage 6 runner
├── run_rerank_eval.py          # Stage 7 runner
├── parse_corpus.py             # Stage 1 ingestion
├── FINDINGS.md                 # research observations (all stages)
└── PROJECT_PLAN_v2.md          # design rationale and build plan

Technical Stack

Embeddings: all-MiniLM-L6-v2 (sentence-transformers)
Vector index: FAISS IndexFlatIP — exact inner-product search over L2-normalised vectors; equivalent to cosine similarity
Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2
LLM: Gemini 2.5 Flash via Vertex AI (ADC, us-central1)
Confidence: token-level logprobs (top-5 per position) → 3-way softmax over decision tokens

n=14 eval cases is a mechanism check, not a statistical claim. All numerical findings should be read qualitatively. The primary deliverable is a clear, honest account of what each retrieval mechanism does and does not recover.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structure-Aware RAG for Policy Decision Support

Corpus

Architecture

Key Findings

What works

What doesn't work

Results Summary

Stage-by-Stage Summary

Setup

Running the Experiments

Repo Layout

Technical Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
eval		eval
notebooks		notebooks
src		src
.gitignore		.gitignore
FINDINGS.md		FINDINGS.md
README.md		README.md
parse_corpus.py		parse_corpus.py
requirements.txt		requirements.txt
run_agent_eval.py		run_agent_eval.py
run_confidence_analysis.py		run_confidence_analysis.py
run_naive_baseline.py		run_naive_baseline.py
run_naive_grid.py		run_naive_grid.py
run_rerank_eval.py		run_rerank_eval.py
run_structure_ablation.py		run_structure_ablation.py

Folders and files

Latest commit

History

Repository files navigation

Structure-Aware RAG for Policy Decision Support

Corpus

Architecture

Key Findings

What works

What doesn't work

Results Summary

Stage-by-Stage Summary

Setup

Running the Experiments

Repo Layout

Technical Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages