You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New section in README and guide page explaining how KV compression
enables a complementary approach to traditional chunk-level RAG:
- RAG decides WHICH documents to look at
- Long-context decides HOW DEEPLY to understand them
- Pre-computed .kv library pattern: process once, query forever
- Honest framing: complementary, not competitive
Key insight: chunk-level RAG loses cross-page relationships.
Document-level context (enabled by 6.4x KV compression) preserves
full document understanding. Together they're stronger than either alone.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When AI models have long conversations, they need memory called the **KV cache**. This memory grows with every message and often exceeds the model itself. quant.cpp compresses it 3x — so the same laptop can handle 3x longer conversations.
138
+
When AI models have long conversations, they need memory called the **KV cache**. This memory grows with every message and often exceeds the model itself. quant.cpp compresses it **6.4x** and prunes unimportant tokens — so the same laptop can handle **6x longer conversations at 59% lower attention cost**.
139
+
140
+
---
141
+
142
+
## Beyond RAG: Document-Level Context
143
+
144
+
Traditional RAG splits documents into small chunks (512 tokens), embeds them, and retrieves fragments. This works for large corpora but has fundamental limitations:
145
+
146
+
-**Chunking destroys relationships** — information spanning pages 3, 47, and 103 can't be found by any single chunk search
147
+
-**Retrieval can fail** — if the question uses different words than the document ("employee retention" vs "turnover rate")
148
+
-**No multi-hop reasoning** — connecting A → B → C across chunks is impossible when each is retrieved independently
149
+
150
+
**Long-context KV compression offers a complementary approach:**
m.ask("What's the expense reimbursement process?") # instant
179
+
```
180
+
181
+
Without 6.4x KV compression, loading a full 50K-token document into a 3B model needs ~17 GB of KV memory (impossible on 16GB Mac). With compression: ~2.7 GB (fits easily).
139
182
140
183
<details>
141
184
<summary><b>Technical detail: The KV cache problem</b></summary>
0 commit comments