|
| 1 | +# notcurses RAG — C++ Library Expert via llama.cpp |
| 2 | + |
| 3 | +A self-contained RAG (Retrieval-Augmented Generation) pipeline in C++17 |
| 4 | +that turns a GGUF inference model into a focused expert on any C/C++ library. |
| 5 | +Demonstrated here with [notcurses](https://github.com/dankamongmen/notcurses) |
| 6 | +but works with any header-based library. |
| 7 | + |
| 8 | +No fixed limits on chunk count, chunk length, or embedding dimension. |
| 9 | +No Python, no vector database daemon, no external dependencies beyond llama.cpp. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## How it works |
| 14 | + |
| 15 | +``` |
| 16 | +INDEXING (one-time offline) |
| 17 | +──────────────────────────────────────────────────────────────── |
| 18 | +notcurses headers |
| 19 | + │ |
| 20 | + ▼ |
| 21 | +chunk_headers ← semantic chunker, outputs chunks.jsonl |
| 22 | + │ |
| 23 | + ▼ |
| 24 | +rag_index ← embeds each chunk via nomic-embed-text GGUF |
| 25 | + │ |
| 26 | + ▼ |
| 27 | +notcurses.db ← binary vector store (embeddings + text) |
| 28 | +
|
| 29 | +
|
| 30 | +RUNTIME (each query) |
| 31 | +──────────────────────────────────────────────────────────────── |
| 32 | +user query |
| 33 | + │ |
| 34 | + ▼ |
| 35 | +rag_retrieve() ← embeds query, cosine similarity against db |
| 36 | + │ ← skips chunks already seen this session |
| 37 | + ▼ |
| 38 | +new top-k chunks ← most relevant unseen API fragments |
| 39 | + │ |
| 40 | + ▼ |
| 41 | +prompt assembly ← system + prior history + new context + query |
| 42 | + │ |
| 43 | + ▼ |
| 44 | +Qwen3 inference ← <|think|> reasoning + final answer |
| 45 | + │ |
| 46 | + ▼ |
| 47 | +history ← appended for next turn (KV cache intact) |
| 48 | +``` |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Files |
| 53 | + |
| 54 | +| File | Purpose | |
| 55 | +|---|---| |
| 56 | +| `chunk_headers.cpp` | Parses C/C++ headers into semantic chunks, outputs `.jsonl` | |
| 57 | +| `rag_index.cpp` | Reads `.jsonl`, embeds each chunk, saves binary `.db` | |
| 58 | +| `rag.hpp` | Single-header C++17 runtime — load db, session, retrieve | |
| 59 | +| `example.cpp` | Full pipeline wired together, multi-turn query loop | |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## Dependencies |
| 64 | + |
| 65 | +- [llama.cpp](https://github.com/ggerganov/llama.cpp) — `libllama` + `llama.h` |
| 66 | +- A GGUF **inference model** — tested with `Qwen3.5-9B-Q4_K_M.gguf` |
| 67 | +- A GGUF **embedding model** — |
| 68 | + `nomic-embed-text-v1.5.Q4_K_M.gguf` |
| 69 | + ([download](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF)) |
| 70 | +- C++17 compiler (gcc 8+, clang 7+, MSVC 2019+) |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## Build |
| 75 | + |
| 76 | +```bash |
| 77 | +c++ -std=c++17 -o chunk_headers chunk_headers.cpp |
| 78 | +c++ -std=c++17 -o rag_index rag_index.cpp -lllama -lm |
| 79 | +c++ -std=c++17 -o example example.cpp -lllama -lm |
| 80 | +``` |
| 81 | + |
| 82 | +If llama.cpp is not on your system library path: |
| 83 | + |
| 84 | +```bash |
| 85 | +c++ -std=c++17 -o rag_index rag_index.cpp \ |
| 86 | + -I/path/to/llama.cpp/include \ |
| 87 | + -L/path/to/llama.cpp/build -lllama -lm |
| 88 | +``` |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## Usage |
| 93 | + |
| 94 | +### Step 1 — Chunk the headers (one-time) |
| 95 | + |
| 96 | +```bash |
| 97 | +./chunk_headers notcurses/include/notcurses/ > chunks.jsonl |
| 98 | +``` |
| 99 | + |
| 100 | +Accepts a single file or a directory (walked recursively). |
| 101 | +Multiple paths can be given: |
| 102 | + |
| 103 | +```bash |
| 104 | +./chunk_headers include/foo.h include/bar.h src/examples/ > chunks.jsonl |
| 105 | +``` |
| 106 | + |
| 107 | +Handles `.h`, `.hpp`, `.c`, `.cpp`. Inspect before indexing: |
| 108 | + |
| 109 | +```bash |
| 110 | +head -5 chunks.jsonl | python3 -m json.tool |
| 111 | +``` |
| 112 | + |
| 113 | +### Step 2 — Embed and index (one-time) |
| 114 | + |
| 115 | +```bash |
| 116 | +./rag_index \ |
| 117 | + --model nomic-embed-text-v1.5.Q4_K_M.gguf \ |
| 118 | + --input chunks.jsonl \ |
| 119 | + --output notcurses.db |
| 120 | +``` |
| 121 | + |
| 122 | +Takes a few minutes for a large corpus. The `.db` is reusable |
| 123 | +until the library changes. |
| 124 | + |
| 125 | +### Step 3 — Run |
| 126 | + |
| 127 | +```bash |
| 128 | +./example \ |
| 129 | + --model Qwen3.5-9B-Q4_K_M.gguf \ |
| 130 | + --embed nomic-embed-text-v1.5.Q4_K_M.gguf \ |
| 131 | + --db notcurses.db |
| 132 | +``` |
| 133 | + |
| 134 | +``` |
| 135 | +notcurses expert ready. ctrl+d to quit. |
| 136 | +
|
| 137 | +you: how do I create a plane and render text into it? |
| 138 | +assistant: ... |
| 139 | +
|
| 140 | +you: what options does it take? ← follow-up; no repeated context |
| 141 | +assistant: ... |
| 142 | +``` |
| 143 | + |
| 144 | +--- |
| 145 | + |
| 146 | +## Using rag.hpp in your own project |
| 147 | + |
| 148 | +Single-header, stb-style. In **one** `.cpp` file: |
| 149 | + |
| 150 | +```cpp |
| 151 | +#define RAG_IMPLEMENTATION |
| 152 | +#include "rag.hpp" |
| 153 | +``` |
| 154 | + |
| 155 | +All other files that need the types: |
| 156 | + |
| 157 | +```cpp |
| 158 | +#include "rag.hpp" |
| 159 | +``` |
| 160 | + |
| 161 | +### Minimal integration |
| 162 | + |
| 163 | +```cpp |
| 164 | +// startup |
| 165 | +RagDB db; |
| 166 | +rag_load(db, "notcurses.db"); |
| 167 | + |
| 168 | +RagSession session; |
| 169 | +session.init(db.size(), 8192); // n_chunks, your n_ctx |
| 170 | +session.score_threshold = 0.60f; |
| 171 | + |
| 172 | +// each turn |
| 173 | +std::string context = rag_retrieve(db, embed_ctx, embed_model, |
| 174 | + user_query, 5, session); |
| 175 | +// context is empty string if nothing new/relevant was found |
| 176 | +// build prompt with context and hand to your inference context |
| 177 | +``` |
| 178 | +
|
| 179 | +### Stateless retrieval (no deduplication) |
| 180 | +
|
| 181 | +```cpp |
| 182 | +std::string context = rag_retrieve(db, embed_ctx, embed_model, |
| 183 | + user_query, 5); |
| 184 | +``` |
| 185 | + |
| 186 | +### API |
| 187 | + |
| 188 | +```cpp |
| 189 | +// Load .db file (version 2). Returns true on success. |
| 190 | +bool rag_load(RagDB &db, const std::string &path); |
| 191 | + |
| 192 | +// Retrieve with session deduplication + token budget. |
| 193 | +// Returns context string ready to inject into prompt. |
| 194 | +// Empty string if nothing new or relevant was found. |
| 195 | +std::string rag_retrieve(const RagDB &db, |
| 196 | + llama_context *embed_ctx, |
| 197 | + llama_model *embed_model, |
| 198 | + const std::string &query, |
| 199 | + int top_k, |
| 200 | + RagSession &session); |
| 201 | + |
| 202 | +// Stateless overload — no deduplication. |
| 203 | +std::string rag_retrieve(const RagDB &db, |
| 204 | + llama_context *embed_ctx, |
| 205 | + llama_model *embed_model, |
| 206 | + const std::string &query, |
| 207 | + int top_k); |
| 208 | +``` |
| 209 | +
|
| 210 | +### RagSession fields |
| 211 | +
|
| 212 | +```cpp |
| 213 | +struct RagSession { |
| 214 | + std::vector<bool> seen; // one bit per chunk, sized to db |
| 215 | + int tokens_used = 0; // running token estimate |
| 216 | + int tokens_max = 0; // your n_ctx ceiling |
| 217 | + float score_threshold = 0.60f; // skip weak matches |
| 218 | +
|
| 219 | + void init(int n_chunks, int ctx_size); |
| 220 | + void reset(); // start a fresh conversation |
| 221 | +}; |
| 222 | +``` |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## Chunking strategy |
| 227 | + |
| 228 | +`chunk_headers` uses a state machine that keeps each **semantic unit** |
| 229 | +together as one chunk: |
| 230 | + |
| 231 | +- Block comment (`/* ... */`) + following declaration |
| 232 | +- `//` line comments + following declaration |
| 233 | +- `typedef struct` / `typedef enum` entire body |
| 234 | +- Consecutive `#define` macro groups |
| 235 | +- Multi-line function signatures |
| 236 | + |
| 237 | +Example — this stays as one chunk: |
| 238 | + |
| 239 | +```c |
| 240 | +// ncplane_create() - create a new plane as a child of 'n'. |
| 241 | +// 'nopts' may be NULL for defaults. Returns NULL on error. |
| 242 | +struct ncplane* ncplane_create(struct ncplane *n, |
| 243 | + const struct ncplane_options *nopts); |
| 244 | +``` |
| 245 | +
|
| 246 | +--- |
| 247 | +
|
| 248 | +## Session deduplication |
| 249 | +
|
| 250 | +The KV cache is not cleared between turns, so the model already has |
| 251 | +earlier chunks in memory. `RagSession` tracks which chunks have been |
| 252 | +injected and skips them on subsequent turns: |
| 253 | +
|
| 254 | +``` |
| 255 | +Turn 1: retrieved chunks [42, 17, 83] → all new → inject all |
| 256 | +Turn 2: retrieved chunks [42, 55, 17] → 42,17 seen → inject only [55] |
| 257 | +Turn 3: retrieved chunks [7, 14, 55] → 55 seen → inject [7, 14] |
| 258 | +``` |
| 259 | +
|
| 260 | +Context window grows efficiently — no repeated API reference, and the |
| 261 | +model remembers everything already seen via the intact KV cache. |
| 262 | +
|
| 263 | +--- |
| 264 | +
|
| 265 | +## Adapting to other libraries |
| 266 | +
|
| 267 | +Change only the input to `chunk_headers`: |
| 268 | +
|
| 269 | +| Library | Input | |
| 270 | +|---|---| |
| 271 | +| stb (stb_image, stb_truetype ...) | single `.h` file | |
| 272 | +| SDL2 / OpenGL / Vulkan | `include/` directory | |
| 273 | +| Your own engine | any `.h` / `.hpp` mix | |
| 274 | +| Spring / Java | extend chunker for Javadoc + `.java` | |
| 275 | +
|
| 276 | +Re-run steps 1 and 2 to produce a new `.db`. Runtime code unchanged. |
| 277 | +Multiple `.db` files can be loaded and queried independently. |
| 278 | +
|
| 279 | +--- |
| 280 | +
|
| 281 | +## .db file format (version 2) |
| 282 | +
|
| 283 | +Variable-length fields — no wasted padding. |
| 284 | +
|
| 285 | +``` |
| 286 | +Header (16 bytes): |
| 287 | + uint32 magic = 0x52414744 ("RAGD") |
| 288 | + uint32 version = 2 |
| 289 | + uint32 n_chunks |
| 290 | + uint32 embed_dim |
| 291 | + |
| 292 | +Per chunk: |
| 293 | + uint32 text_len |
| 294 | + char[] text (text_len bytes, no null) |
| 295 | + uint16 source_len |
| 296 | + char[] source (source_len bytes, no null) |
| 297 | + uint8 type_len |
| 298 | + char[] type (type_len bytes, no null) |
| 299 | + float[] embedding (embed_dim × 4 bytes) |
| 300 | +``` |
| 301 | +
|
| 302 | +--- |
| 303 | +
|
| 304 | +## GPU memory |
| 305 | +
|
| 306 | +On an 8 GB GPU with `Qwen3.5-9B-Q4_K_M`: |
| 307 | +
|
| 308 | +| Component | VRAM | |
| 309 | +|---|---| |
| 310 | +| Inference model (Q4_K_M 9B) | ~5.5 GB | |
| 311 | +| Embedding model (nomic Q4) | ~0.3 GB | |
| 312 | +| KV cache (8k ctx, Q4_0 K/V) | ~0.5 GB | |
| 313 | +| **Total** | **~6.3 GB** | |
| 314 | +
|
| 315 | +--- |
| 316 | +
|
| 317 | +## Qwen3 thinking mode |
| 318 | +
|
| 319 | +The model emits `<|think|>...<|/think|>` before its answer. |
| 320 | +`example.cpp` strips this with `strip_think()` before printing. |
| 321 | +The think block improves RAG quality — the model explicitly reasons |
| 322 | +over injected context chunks before answering. |
| 323 | +
|
| 324 | +To expose reasoning (useful for debugging retrieval quality), remove |
| 325 | +the `strip_think()` call and print `raw` directly. |
0 commit comments