Skip to content

Commit 6c278bd

Browse files
author
Chris Warren-Smith
committed
LLAMA: RAG experiment to increase domain knowledge of a particular lib
1 parent 46cafef commit 6c278bd

13 files changed

Lines changed: 1508 additions & 197 deletions

llama/CMakeLists.txt

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,28 @@ set_target_properties(llm_test PROPERTIES
171171
RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin
172172
)
173173

174+
# -----------------------------
175+
# RAG indexer
176+
# -----------------------------
177+
add_executable(rag_index
178+
rag_index.cpp
179+
)
180+
181+
target_include_directories(rag_index PRIVATE
182+
${LLAMA_DIR}/include
183+
${LLAMA_DIR}/ggml/include
184+
)
185+
186+
target_link_libraries(rag_index PRIVATE
187+
llm
188+
llama
189+
ggml
190+
)
191+
192+
set_target_properties(rag_index PROPERTIES
193+
RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin
194+
)
195+
174196
# ------------------------------------------------------------------
175197
# Android native library
176198
# ------------------------------------------------------------------

llama/RAG.md

Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
# notcurses RAG — C++ Library Expert via llama.cpp
2+
3+
A self-contained RAG (Retrieval-Augmented Generation) pipeline in C++17
4+
that turns a GGUF inference model into a focused expert on any C/C++ library.
5+
Demonstrated here with [notcurses](https://github.com/dankamongmen/notcurses)
6+
but works with any header-based library.
7+
8+
No fixed limits on chunk count, chunk length, or embedding dimension.
9+
No Python, no vector database daemon, no external dependencies beyond llama.cpp.
10+
11+
---
12+
13+
## How it works
14+
15+
```
16+
INDEXING (one-time offline)
17+
────────────────────────────────────────────────────────────────
18+
notcurses headers
19+
20+
21+
chunk_headers ← semantic chunker, outputs chunks.jsonl
22+
23+
24+
rag_index ← embeds each chunk via nomic-embed-text GGUF
25+
26+
27+
notcurses.db ← binary vector store (embeddings + text)
28+
29+
30+
RUNTIME (each query)
31+
────────────────────────────────────────────────────────────────
32+
user query
33+
34+
35+
rag_retrieve() ← embeds query, cosine similarity against db
36+
│ ← skips chunks already seen this session
37+
38+
new top-k chunks ← most relevant unseen API fragments
39+
40+
41+
prompt assembly ← system + prior history + new context + query
42+
43+
44+
Qwen3 inference ← <|think|> reasoning + final answer
45+
46+
47+
history ← appended for next turn (KV cache intact)
48+
```
49+
50+
---
51+
52+
## Files
53+
54+
| File | Purpose |
55+
|---|---|
56+
| `chunk_headers.cpp` | Parses C/C++ headers into semantic chunks, outputs `.jsonl` |
57+
| `rag_index.cpp` | Reads `.jsonl`, embeds each chunk, saves binary `.db` |
58+
| `rag.hpp` | Single-header C++17 runtime — load db, session, retrieve |
59+
| `example.cpp` | Full pipeline wired together, multi-turn query loop |
60+
61+
---
62+
63+
## Dependencies
64+
65+
- [llama.cpp](https://github.com/ggerganov/llama.cpp)`libllama` + `llama.h`
66+
- A GGUF **inference model** — tested with `Qwen3.5-9B-Q4_K_M.gguf`
67+
- A GGUF **embedding model**
68+
`nomic-embed-text-v1.5.Q4_K_M.gguf`
69+
([download](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF))
70+
- C++17 compiler (gcc 8+, clang 7+, MSVC 2019+)
71+
72+
---
73+
74+
## Build
75+
76+
```bash
77+
c++ -std=c++17 -o chunk_headers chunk_headers.cpp
78+
c++ -std=c++17 -o rag_index rag_index.cpp -lllama -lm
79+
c++ -std=c++17 -o example example.cpp -lllama -lm
80+
```
81+
82+
If llama.cpp is not on your system library path:
83+
84+
```bash
85+
c++ -std=c++17 -o rag_index rag_index.cpp \
86+
-I/path/to/llama.cpp/include \
87+
-L/path/to/llama.cpp/build -lllama -lm
88+
```
89+
90+
---
91+
92+
## Usage
93+
94+
### Step 1 — Chunk the headers (one-time)
95+
96+
```bash
97+
./chunk_headers notcurses/include/notcurses/ > chunks.jsonl
98+
```
99+
100+
Accepts a single file or a directory (walked recursively).
101+
Multiple paths can be given:
102+
103+
```bash
104+
./chunk_headers include/foo.h include/bar.h src/examples/ > chunks.jsonl
105+
```
106+
107+
Handles `.h`, `.hpp`, `.c`, `.cpp`. Inspect before indexing:
108+
109+
```bash
110+
head -5 chunks.jsonl | python3 -m json.tool
111+
```
112+
113+
### Step 2 — Embed and index (one-time)
114+
115+
```bash
116+
./rag_index \
117+
--model nomic-embed-text-v1.5.Q4_K_M.gguf \
118+
--input chunks.jsonl \
119+
--output notcurses.db
120+
```
121+
122+
Takes a few minutes for a large corpus. The `.db` is reusable
123+
until the library changes.
124+
125+
### Step 3 — Run
126+
127+
```bash
128+
./example \
129+
--model Qwen3.5-9B-Q4_K_M.gguf \
130+
--embed nomic-embed-text-v1.5.Q4_K_M.gguf \
131+
--db notcurses.db
132+
```
133+
134+
```
135+
notcurses expert ready. ctrl+d to quit.
136+
137+
you: how do I create a plane and render text into it?
138+
assistant: ...
139+
140+
you: what options does it take? ← follow-up; no repeated context
141+
assistant: ...
142+
```
143+
144+
---
145+
146+
## Using rag.hpp in your own project
147+
148+
Single-header, stb-style. In **one** `.cpp` file:
149+
150+
```cpp
151+
#define RAG_IMPLEMENTATION
152+
#include "rag.hpp"
153+
```
154+
155+
All other files that need the types:
156+
157+
```cpp
158+
#include "rag.hpp"
159+
```
160+
161+
### Minimal integration
162+
163+
```cpp
164+
// startup
165+
RagDB db;
166+
rag_load(db, "notcurses.db");
167+
168+
RagSession session;
169+
session.init(db.size(), 8192); // n_chunks, your n_ctx
170+
session.score_threshold = 0.60f;
171+
172+
// each turn
173+
std::string context = rag_retrieve(db, embed_ctx, embed_model,
174+
user_query, 5, session);
175+
// context is empty string if nothing new/relevant was found
176+
// build prompt with context and hand to your inference context
177+
```
178+
179+
### Stateless retrieval (no deduplication)
180+
181+
```cpp
182+
std::string context = rag_retrieve(db, embed_ctx, embed_model,
183+
user_query, 5);
184+
```
185+
186+
### API
187+
188+
```cpp
189+
// Load .db file (version 2). Returns true on success.
190+
bool rag_load(RagDB &db, const std::string &path);
191+
192+
// Retrieve with session deduplication + token budget.
193+
// Returns context string ready to inject into prompt.
194+
// Empty string if nothing new or relevant was found.
195+
std::string rag_retrieve(const RagDB &db,
196+
llama_context *embed_ctx,
197+
llama_model *embed_model,
198+
const std::string &query,
199+
int top_k,
200+
RagSession &session);
201+
202+
// Stateless overload — no deduplication.
203+
std::string rag_retrieve(const RagDB &db,
204+
llama_context *embed_ctx,
205+
llama_model *embed_model,
206+
const std::string &query,
207+
int top_k);
208+
```
209+
210+
### RagSession fields
211+
212+
```cpp
213+
struct RagSession {
214+
std::vector<bool> seen; // one bit per chunk, sized to db
215+
int tokens_used = 0; // running token estimate
216+
int tokens_max = 0; // your n_ctx ceiling
217+
float score_threshold = 0.60f; // skip weak matches
218+
219+
void init(int n_chunks, int ctx_size);
220+
void reset(); // start a fresh conversation
221+
};
222+
```
223+
224+
---
225+
226+
## Chunking strategy
227+
228+
`chunk_headers` uses a state machine that keeps each **semantic unit**
229+
together as one chunk:
230+
231+
- Block comment (`/* ... */`) + following declaration
232+
- `//` line comments + following declaration
233+
- `typedef struct` / `typedef enum` entire body
234+
- Consecutive `#define` macro groups
235+
- Multi-line function signatures
236+
237+
Example — this stays as one chunk:
238+
239+
```c
240+
// ncplane_create() - create a new plane as a child of 'n'.
241+
// 'nopts' may be NULL for defaults. Returns NULL on error.
242+
struct ncplane* ncplane_create(struct ncplane *n,
243+
const struct ncplane_options *nopts);
244+
```
245+
246+
---
247+
248+
## Session deduplication
249+
250+
The KV cache is not cleared between turns, so the model already has
251+
earlier chunks in memory. `RagSession` tracks which chunks have been
252+
injected and skips them on subsequent turns:
253+
254+
```
255+
Turn 1: retrieved chunks [42, 17, 83] → all new → inject all
256+
Turn 2: retrieved chunks [42, 55, 17] → 42,17 seen → inject only [55]
257+
Turn 3: retrieved chunks [7, 14, 55] → 55 seen → inject [7, 14]
258+
```
259+
260+
Context window grows efficiently — no repeated API reference, and the
261+
model remembers everything already seen via the intact KV cache.
262+
263+
---
264+
265+
## Adapting to other libraries
266+
267+
Change only the input to `chunk_headers`:
268+
269+
| Library | Input |
270+
|---|---|
271+
| stb (stb_image, stb_truetype ...) | single `.h` file |
272+
| SDL2 / OpenGL / Vulkan | `include/` directory |
273+
| Your own engine | any `.h` / `.hpp` mix |
274+
| Spring / Java | extend chunker for Javadoc + `.java` |
275+
276+
Re-run steps 1 and 2 to produce a new `.db`. Runtime code unchanged.
277+
Multiple `.db` files can be loaded and queried independently.
278+
279+
---
280+
281+
## .db file format (version 2)
282+
283+
Variable-length fields — no wasted padding.
284+
285+
```
286+
Header (16 bytes):
287+
uint32 magic = 0x52414744 ("RAGD")
288+
uint32 version = 2
289+
uint32 n_chunks
290+
uint32 embed_dim
291+
292+
Per chunk:
293+
uint32 text_len
294+
char[] text (text_len bytes, no null)
295+
uint16 source_len
296+
char[] source (source_len bytes, no null)
297+
uint8 type_len
298+
char[] type (type_len bytes, no null)
299+
float[] embedding (embed_dim × 4 bytes)
300+
```
301+
302+
---
303+
304+
## GPU memory
305+
306+
On an 8 GB GPU with `Qwen3.5-9B-Q4_K_M`:
307+
308+
| Component | VRAM |
309+
|---|---|
310+
| Inference model (Q4_K_M 9B) | ~5.5 GB |
311+
| Embedding model (nomic Q4) | ~0.3 GB |
312+
| KV cache (8k ctx, Q4_0 K/V) | ~0.5 GB |
313+
| **Total** | **~6.3 GB** |
314+
315+
---
316+
317+
## Qwen3 thinking mode
318+
319+
The model emits `<|think|>...<|/think|>` before its answer.
320+
`example.cpp` strips this with `strip_think()` before printing.
321+
The think block improves RAG quality — the model explicitly reasons
322+
over injected context chunks before answering.
323+
324+
To expose reasoning (useful for debugging retrieval quality), remove
325+
the `strip_think()` call and print `raw` directly.

0 commit comments

Comments
 (0)