Skip to content

JingeW/Structure_Aware_RAG

Repository files navigation

Structure-Aware RAG for Policy Decision Support

An end-to-end experiment comparing naive RAG against structure-aware RAG on a multi-document CMS Medicare policy corpus. Given a free-text clinical case narrative, the system retrieves the relevant policy blocks and produces an advisory decision (approve / deny / needs_more_info) with a rationale citing the retrieved sources.

The central question: on a corpus where logically-required evidence is semantically distant from the query — buried in exclusion clauses, cross-document definitions, or procedural prerequisites — does structural retrieval outperform naive fixed-size chunking?


Corpus

Six real, public CMS Medicare coverage documents (data/raw/):

Document Role
L33822 — Glucose Monitors LCD Primary coverage criteria (CGM/BGM, exclusions, appendix definitions)
L33794 — External Infusion Pumps LCD Insulin-pump criteria; cross-doc AND with L33822 for integrated pump+CGM
A52464 — Glucose Monitor Policy Article Payment/coding companion to L33822 (modifiers, pathways, smartphone exclusion)
A55426 — Standard Documentation Requirements Shared procedural prerequisites (SWO, WOPD, face-to-face) referenced by all LCDs
L33370 — Nebulizers LCD Distance distractor (respiratory; no real cross-reference to glucose)
L33797 — Oxygen and Oxygen Equipment LCD Distance distractor (respiratory; no real cross-reference to glucose)

The corpus parses into 57 structural blocks across the 6 documents. Eval set: 14 cases spanning single-doc, cross-doc AND, exclusion-flips, definitions, procedural prerequisites, and distractor cases (eval/cases/eval_cases.json).


Architecture

Case narrative
     │
     ▼
┌────────────────────────────────────────────────────────┐
│  Retrieval layer (two strategies, same embeddings)     │
│                                                        │
│  Naive:    fixed-size word windows (w200)              │
│            └─ all-MiniLM-L6-v2 + FAISS IndexFlatIP     │
│               top-K cosine similarity                  │
│                                                        │
│  Structure-aware:                                      │
│    1. Structural blocks (57, with auto-metadata)       │
│       indexed as 60 units (L33794-b04 split into       │
│       4 sub-criteria children for finer matching)      │
│    2. FAISS cosine top-K_sim similarity                │
│    3. Parent-document promotion                        │
│       (child hit → return full parent block)           │
│    4. Completion rules (optional, ablated)             │
│       (exclusion / definition blocks for matched docs) │
│    5. Cross-encoder reranking (Stage 7)                │
└────────────────────────────────────────────────────────┘
     │
     ▼  retrieved blocks (with doc_id, retrieval_source)
┌────────────────────────────────────────────────────────┐
│  Reasoning layer                                       │
│  Single Gemini 2.5 Flash call:                         │
│  context = retrieved blocks + case narrative           │
│  output  = decision label + rationale citing blocks    │
│  confidence = logprob margin over {approve, deny,      │
│               needs_more_info} at decision sub-token   │
└────────────────────────────────────────────────────────┘

Both retrieval systems use the same embedding model and corpus. The only variable is chunking strategy and retrieval logic. Final block counts are reported so structure-aware cannot win by bulk-returning the corpus.


Key Findings

What works

Parent-document retrieval is the only mechanism that recovers logically-necessary but semantically-distant blocks.

C6(Case 6, please see cases in './eval/cases') requires a cross-document AND: L33794 pump criteria AND L33822 CGM criteria must both be satisfied. The pump block (L33794-b04) is 688 words — 3.5× the corpus median. After splitting it into four sub-criteria children for indexing, child D (C-peptide criterion) matches by similarity. Parent-document promotion then returns the full parent block, giving the LLM the complete AND structure. This works stably across all 8 ablation configs.

Reranking correctly demotes noise blocks. The cross-encoder discriminates L33822-b06 (supply exclusion clause) as relevant for C4/C11 (co-billing cases) and irrelevant for C10 (respiratory distractor) and C1 (straightforward approve), where the bi-encoder assigned nearly identical similarity scores to all four.

What doesn't work

Completion rules add noise, not signal. Adding every exclusion/definition block from any matched document was intended to recover C4 (exclusion-flips) and C2 (definition-lookup) cases. In practice, at K_sim=5 those blocks are already returned by similarity. Completion adds them to distractor cases too, occasionally flipping previously-correct decisions. Net correction: zero.

A55426 (procedural prerequisites) is never retrieved in any config. The SWO requirement referenced by all LCDs lives in A55426, a procedurally-written document with no clinical vocabulary. Its language does not surface-match glucose monitoring queries. At K_sim=10 it enters the pool (sim≈0.54) but the cross-encoder immediately demotes it (rerank≈−8 to −9). C8's deny requires A55426; it is a real and unclosed coverage gap. Neither completion rules nor reranking address it.

C8's correct deny under K5+both is LLM temperature noise, not structural recovery. A52464-b02 contains a pointer to A55426 ("must meet the documentation requirements in Article A55426..."). The LLM infers the deny from this pointer in one config but outputs needs_more_info under configs that retrieve the identical block set. A55426's content never enters the context window in any config.

Reranking can undo what structural retrieval preserved. C6's pump block (L33794-b04) is correctly returned by parent-document promotion in Stage 4, then demoted by the cross-encoder in Stage 7 (rerank=−4.74, rank 7 of 12, below the top-5 cutoff). The cross-encoder has no representation of cross-document conjunction; its local relevance score is rational, the failure is architectural.

Logprob confidence margin provides no discriminative signal. After fixing an alias-collision bug in the scorer (multiple surface forms of the same token — ' NEEDS', ' needs', ' Needs' — were colliding in a dict; last-write wins produced wildly wrong probability estimates), the corrected result is unambiguous: Gemini 2.5 Flash outputs margin ≈1.0 for essentially all decisions — correct, wrong, certain, and genuinely ambiguous cases alike. The signal cannot separate cases that should be uncertain from cases that should be confident.

The needs_more_info decision class itself remains a useful signal: routing on the label triggers an RFI workflow regardless of margin. High-confidence wrong decisions (C8 type) require retrieval-side intervention, not a better confidence scorer.

The ReAct agent's intermediate reasoning step is not a neutral wrapper. A hand-written 2-hop agent (Stage 8) introduced a sufficiency-check LLM call between retrieval and the final decision. That call changed decisions on the same context compared to the standard prompt: three regressions (C1, C5, C13), all in cases that never even triggered a second hop. The framing "can you decide or do you need more?" made genuinely ambiguous cases overconfident (needs_more_infodeny) and a simple approval case uncertain. Any multi-hop architecture must treat intermediate reasoning as a potential decision-altering intervention.

C8's first genuine fix points to multi-query retrieval as the simpler solution. The ReAct agent's NEED query used procedural vocabulary ("Standard Written Order documentation requirements") that the clinical query could not produce — and that query surfaced A55426-b02 at cosine score=0.4447 where the clinical query had failed across all prior stages. The failure was query generation, not retrieval capability. Multi-query retrieval — generate diverse query variants upfront (clinical + procedural), retrieve in parallel, take the union, make one decision — would get exactly this vocabulary-bridging benefit without the intermediate reasoning step that caused the regressions. A full ReAct loop is warranted when the follow-up query is unknowable before seeing hop-1 results; for procedural prerequisite documents like A55426, a targeted query is predictable from the document structure.


Results Summary

Config Correct / 14 Notes
Naive w200 / K=8 14 / 14 Tuned canonical; K is the dominant lever
Structure-aware K_sim=5 +both 14 / 14 C8 pass is LLM noise; A55426 never retrieved
Naive reranked (K=20 → top-5) 11 / 14 Stricter window drops some correct cases
Struct reranked (K_sim=10 → top-5) 12 / 14 C6 regresses; C8 now needs_more_info (more honest)
ReAct agent (K_sim=5 +both, max 2 hops) 11 / 14 C8 first genuine fix (A55426 actually retrieved); C1/C5/C13 regress from sufficiency-check prompt framing

Accuracy is not the primary metric here. The meaningful comparison is the composition of retrieved blocks and the failure mode each config produces — see FINDINGS.md for the case-by-case breakdown.


Stage-by-Stage Summary

Stage What Key learning
0 Corpus ingestion + 14 eval cases Corpus design: distractors, cross-doc, procedural prerequisites
1 Structural parse → 57 blocks with auto-metadata block_type heuristic (criteria/definition/exclusion/general) from regex; no hand-authored tree
2 Naive baseline grid: w100/w200 × K3/5/8 K is the dominant lever; w200/K8 = 14/14
3 Eval harness (doc-level retrieval recall) Document-level recall measurable; block-level GT deferred
4 Structure-aware 8-config ablation Parent-doc retrieval works (C6); completion rules net neutral
5 Decision accuracy, both retrievers Both 14/14; C8 instability exposed by cross-config comparison
6 Logprob confidence scoring Alias-collision bug found; corrected result: uniform overconfidence
7 Cross-encoder reranking Noise reduction confirmed; C6 regresses; C8 more honest
8 Hand-written ReAct agent (max 2 hops) C8 first genuine fix: LLM-generated procedural query surfaces A55426; prompt-framing regressions reveal intermediate reasoning step is not neutral

Full research observations: FINDINGS.md
Design rationale and build plan: PROJECT_PLAN_v2.md


Setup

Requirements: Python 3.12, conda, Google Cloud SDK with ADC credentials.

conda create -n sarag python=3.12
conda activate sarag
pip install -r requirements.txt

Vertex AI authentication:

gcloud auth application-default login
# Project: structureawarerag  Region: us-central1  Model: gemini-2.5-flash

The processed corpus artifacts (data/processed/) are already included so you can inspect the parsed block structure without re-running the ingestion step. To regenerate:

python parse_corpus.py          # → data/processed/blocks.json (57 blocks)

Running the Experiments

Each script accepts --stub to run the full pipeline with a local stub LLM (no API calls, for verifying the plumbing):

# Stage 2: naive baseline
python run_naive_baseline.py --stub        # pipeline check
python run_naive_baseline.py               # real Vertex run

# Stage 2: chunk-size × K grid sweep
python run_naive_grid.py

# Stage 4+5: structure-aware ablation (8 configs × 14 cases)
python run_structure_ablation.py --stub
python run_structure_ablation.py

# Stage 6: confidence scoring from logprobs
python run_confidence_analysis.py --stub
python run_confidence_analysis.py

# Stage 7: cross-encoder reranking
python run_rerank_eval.py --stub
python run_rerank_eval.py

Raw LLM outputs are persisted to results/ (gitignored) for reproducibility. Each JSON file includes the full retrieved context, raw model response, and logprob arrays.


Repo Layout

.
├── data/
│   ├── raw/                    # 6 CMS corpus files (source of truth)
│   └── processed/              # parsed blocks + chunk sets (generated)
│       ├── blocks.json         # 57 structural blocks with auto-metadata
│       ├── chunks_w200.json    # naive chunks, 200-word windows
│       └── chunks_w100.json    # naive chunks, 100-word windows
├── eval/
│   ├── cases/eval_cases.json   # 14 eval cases with GT labels
│   └── metrics/                # retrieval recall + decision accuracy harnesses
├── src/
│   ├── ingest/                 # corpus loading + structural parser
│   ├── chunking/               # naive fixed-size chunker
│   ├── retrieval/              # NaiveRetriever, StructureAwareRetriever, CrossEncoder reranker
│   ├── reasoning/              # LLM decision + rationale; prompts/
│   ├── confidence/             # logprob-based confidence scorer
│   └── llm/                    # VertexAIClient + StubClient
├── run_naive_baseline.py       # Stage 2 runner
├── run_naive_grid.py           # Stage 2 grid sweep
├── run_structure_ablation.py   # Stage 4+5 runner
├── run_confidence_analysis.py  # Stage 6 runner
├── run_rerank_eval.py          # Stage 7 runner
├── parse_corpus.py             # Stage 1 ingestion
├── FINDINGS.md                 # research observations (all stages)
└── PROJECT_PLAN_v2.md          # design rationale and build plan

Technical Stack

  • Embeddings: all-MiniLM-L6-v2 (sentence-transformers)
  • Vector index: FAISS IndexFlatIP — exact inner-product search over L2-normalised vectors; equivalent to cosine similarity
  • Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2
  • LLM: Gemini 2.5 Flash via Vertex AI (ADC, us-central1)
  • Confidence: token-level logprobs (top-5 per position) → 3-way softmax over decision tokens

n=14 eval cases is a mechanism check, not a statistical claim. All numerical findings should be read qualitatively. The primary deliverable is a clear, honest account of what each retrieval mechanism does and does not recover.

About

End-to-end RAG practice on policy coverage

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages