Skip to content

Commit 28d87a0

Browse files
unamedkrclaude
andcommitted
docs: add "Beyond RAG: Document-Level Context" concept
New section in README and guide page explaining how KV compression enables a complementary approach to traditional chunk-level RAG: - RAG decides WHICH documents to look at - Long-context decides HOW DEEPLY to understand them - Pre-computed .kv library pattern: process once, query forever - Honest framing: complementary, not competitive Key insight: chunk-level RAG loses cross-page relationships. Document-level context (enabled by 6.4x KV compression) preserves full document understanding. Together they're stronger than either alone. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0fa8cec commit 28d87a0

File tree

2 files changed

+646
-297
lines changed

2 files changed

+646
-297
lines changed

README.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,50 @@ Pre-built wheels: Linux x86_64/aarch64, macOS arm64 (Python 3.9–3.13). Others
135135

136136
## Why quant.cpp?
137137

138-
When AI models have long conversations, they need memory called the **KV cache**. This memory grows with every message and often exceeds the model itself. quant.cpp compresses it 3x — so the same laptop can handle 3x longer conversations.
138+
When AI models have long conversations, they need memory called the **KV cache**. This memory grows with every message and often exceeds the model itself. quant.cpp compresses it **6.4x** and prunes unimportant tokens — so the same laptop can handle **6x longer conversations at 59% lower attention cost**.
139+
140+
---
141+
142+
## Beyond RAG: Document-Level Context
143+
144+
Traditional RAG splits documents into small chunks (512 tokens), embeds them, and retrieves fragments. This works for large corpora but has fundamental limitations:
145+
146+
- **Chunking destroys relationships** — information spanning pages 3, 47, and 103 can't be found by any single chunk search
147+
- **Retrieval can fail** — if the question uses different words than the document ("employee retention" vs "turnover rate")
148+
- **No multi-hop reasoning** — connecting A → B → C across chunks is impossible when each is retrieved independently
149+
150+
**Long-context KV compression offers a complementary approach:**
151+
152+
```
153+
Chunk-Level RAG: 100K docs → chunk(512) → embed → search → 5 chunks → LLM(4K)
154+
↑ information loss here
155+
156+
Document-Level RAG: 100K docs → doc-level index → search → 2-3 full docs → LLM(64K-128K)
157+
↑ KV compression makes this fit
158+
```
159+
160+
RAG decides **which documents** to look at. Long-context decides **how deeply** to understand them. Each does what it's best at.
161+
162+
| | Chunk-RAG alone | Long-Context alone | **RAG + Long-Context** |
163+
|--|----------------|-------------------|----------------------|
164+
| 100K documents | only option | impossible | **RAG selects** |
165+
| Cross-page reasoning | fails | works | **works** |
166+
| Multi-hop Q&A | limited | works | **works** |
167+
| Exact recall | depends on retrieval | depends on model size | **best of both** |
168+
| Infrastructure | vector DB + 4 systems | LLM + .kv file | **practical hybrid** |
169+
170+
**Pre-computed KV library** — process once, query forever:
171+
```python
172+
# Overnight (GPU or batch): process each document once
173+
m.ask(open("operations_manual.txt").read())
174+
m.save_context("ops_manual.kv") # 1.5 GB, compressed
175+
176+
# Anytime (laptop, offline): instant load + unlimited questions
177+
m.load_context("ops_manual.kv") # 0.5 seconds
178+
m.ask("What's the expense reimbursement process?") # instant
179+
```
180+
181+
Without 6.4x KV compression, loading a full 50K-token document into a 3B model needs ~17 GB of KV memory (impossible on 16GB Mac). With compression: ~2.7 GB (fits easily).
139182

140183
<details>
141184
<summary><b>Technical detail: The KV cache problem</b></summary>

0 commit comments

Comments
 (0)