11# SQLite-Vec C++ Benchmark Results
22
33** Version** : 0.1.0
4- ** Date** : 2025-11-02
5- ** Platform** : x86_64, 48 cores @ 4.0GHz, 32KB L1, 512KB L2, 16MB L3
6- ** Compiler** : GCC 15.2.0, C++23, Release mode (` -O3 ` )
7- ** Library** : Google Benchmark 1.9.1
4+ ** Date** : 2026-01-05
5+ ** Platform** : x86_64, 48 cores @ 3.8GHz, 32KB L1, 512KB L2, 16MB L3 (Windows 11)
6+ ** Compiler** : clang 21.1.6, C++23, Release mode (` -O3 ` )
7+ ** Library** : Google Benchmark 1.9.4
8+
89
910---
1011
1112## Executive Summary
1213
13- The C++ implementation achieves ** 3.6M vectors/second sustained throughput** with linear scaling across corpus sizes and embedding dimensions. int8 quantization provides 4x storage reduction at performance parity. HNSW index recommended for >100K vector corpora.
14+ The C++ implementation achieves ** ~ 2.8M vectors/second sustained throughput** with linear scaling across corpus sizes and embedding dimensions. int8 quantization provides 4x storage reduction at near performance parity. HNSW index recommended for >100K vector corpora.
15+
1416
1517---
1618
@@ -20,9 +22,10 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
2022
2123| Corpus | Latency | Throughput | QPS (single-thread) |
2224| --------| ---------| ------------| ---------------------|
23- | 1K | 273 μs | 3.67 M/s | ~ 3,660 queries/sec |
24- | 10K | 2.78 ms | 3.60 M/s | ~ 360 queries/sec |
25- | 100K | 27.9 ms | 3.58 M/s | ~ 36 queries/sec |
25+ | 1K | 288 μs | 3.51 M/s | ~ 3,510 queries/sec |
26+ | 10K | 3.63 ms | 2.77 M/s | ~ 277 queries/sec |
27+ | 100K | 41.0 ms | 2.43 M/s | ~ 24 queries/sec |
28+
2629
2730** Scaling** : Linear (10x corpus → 10x latency)
2831** Bottleneck** : Compute-bound (memory bandwidth utilization ~ 5%)
@@ -31,20 +34,22 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
3134
3235| K | Latency | Delta |
3336| ----| ---------| -------|
34- | 1 | 2.77 ms | -0.4% |
35- | 5 | 2.78 ms | baseline |
36- | 10 | 2.78 ms | 0.0% |
37- | 50 | 2.77 ms | -0.4% |
37+ | 1 | 3.92 ms | +7.8% |
38+ | 5 | 3.63 ms | baseline |
39+ | 10 | 3.64 ms | +0.2% |
40+ | 50 | 3.60 ms | -0.9% |
41+
3842
3943** Conclusion** : Partial sort overhead negligible; K-value has no meaningful impact.
4044
4145### 3. Embedding Dimension Scaling (10K docs, K=5)
4246
4347| Dimensions | Latency | Throughput | Scaling Factor |
4448| ------------| ----------| ------------| ----------------|
45- | 384d | 2.78 ms | 3.60 M/s | 1.0x |
46- | 768d | 5.74 ms | 1.74 M/s | 2.06x |
47- | 1536d | 11.7 ms | 856k/s | 4.21x |
49+ | 384d | 3.63 ms | 2.77 M/s | 1.0x |
50+ | 768d | 6.78 ms | 1.53 M/s | 1.87x |
51+ | 1536d | 13.2 ms | 780k/s | 3.64x |
52+
4853
4954** Scaling** : Near-linear (2x dim → 2.06x latency, 4x dim → 4.21x latency)
5055** Conclusion** : Compute-bound; SIMD efficiency remains high across dimensions.
@@ -53,24 +58,27 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
5358
5459| Type | Latency | Throughput | Storage | Overhead |
5560| -------| ---------| ------------| ---------| ----------|
56- | float | 2.78 ms | 3.60 M/s | 4 bytes | baseline |
57- | int8 | 2.74 ms | 3.65 M/s | 1 byte | ** -1.4%** |
61+ | float | 3.63 ms | 2.77 M/s | 4 bytes | baseline |
62+ | int8 | 3.62 ms | 2.79 M/s | 1 byte | ** -0.4%** |
63+
5864
5965** Conclusion** : int8 quantization is ** faster** while reducing storage 4x (memory bandwidth savings).
6066
6167### 5. Multi-Query Throughput (10K docs, 384d)
6268
63- - ** 10 queries** : 27.5 ms total (2.75 ms/query average)
64- - ** Sustained throughput** : 3.64 M vectors/second
65- - ** QPS** : ~ 364 queries/second (single-threaded)
66- - ** Parallelization potential** : 48 cores → ~ 17.4K QPS theoretical
69+ - ** 10 queries** : 36.2 ms total (3.62 ms/query average)
70+ - ** Sustained throughput** : 2.76 M vectors/second
71+ - ** QPS** : ~ 276 queries/second (single-threaded)
72+ - ** Parallelization potential** : 48 cores → ~ 13.2K QPS theoretical
73+
6774
6875### 6. Sequential vs Batch (1K docs, 384d, K=5)
6976
7077| Method | Latency | Throughput |
7178| ------------| ---------| ------------|
72- | Sequential | 274 μs | 3.66 M/s |
73- | Batch | 273 μs | 3.67 M/s |
79+ | Sequential | 287 μs | 3.51 M/s |
80+ | Batch | 288 μs | 3.51 M/s |
81+
7482
7583** Conclusion** : Batch API provides cleaner code at performance parity (memory-bandwidth bound).
7684
@@ -82,43 +90,48 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
8290
8391| Scenario | Sequential | Batch | Speedup |
8492| ----------| ------------| -------| ---------|
85- | 100×384d | 26.7 μs | 26.7 μs | 1.00x |
86- | 1K×384d | 268 μs | 269 μs | 1.00x |
93+ | 100×384d | 27.8 μs | 28.3 μs | 0.98x |
94+ | 1K×384d | 289 μs | 283 μs | 1.02x |
95+
8796
8897** Conclusion** : Parity performance; both memory-bandwidth limited.
8998
9099### 2. Memory Layout Optimization
91100
92101| Layout | Latency | Throughput | Improvement |
93102| -------------| ---------| ------------| -------------|
94- | Scattered | 269 μs | 3.73 M/s | baseline |
95- | Contiguous | 267 μs | 3.75 M/s | +0.5% |
103+ | Scattered | 283 μs | 3.54 M/s | baseline |
104+ | Contiguous | 283 μs | 3.54 M/s | +0.0% |
105+
96106
97107** Conclusion** : Marginal improvement; modern CPUs prefetch efficiently.
98108
99109### 3. Top-K Performance (1K×384d, K=10)
100110
101- - ** Latency** : 268 μs (vs 268 μs full distance computation)
102- - ** Overhead** : < 1% for partial sort
111+ - ** Latency** : 290 μs (vs 287 μs full distance computation)
112+ - ** Overhead** : ~ 1% for partial sort
103113- ** Conclusion** : ` std::partial_sort ` highly optimized; K << N has negligible cost.
104114
115+
105116### 4. Large Embeddings (1K×1536d)
106117
107- - ** Latency** : 1.13 ms
108- - ** Throughput** : 886k vectors/second
109- - ** Scaling** : 4.21x slower than 384d (expected 4.0x)
118+ - ** Latency** : 1.18 ms
119+ - ** Throughput** : 833k vectors/second
120+ - ** Scaling** : 4.18x slower than 384d (expected 4.0x)
121+
110122
111123---
112124
113125## HNSW Decision Matrix
114126
115127| Corpus Size | Brute-Force Latency | Recommendation |
116128| -------------| ---------------------| ----------------|
117- | <10K | <3ms | ✅ Brute-force optimal |
118- | 10K-100K | 3-30ms | ⚠️ Brute-force acceptable for batch |
119- | >100K | >30ms | ❌ HNSW required for real-time (<10ms) |
129+ | <10K | <4ms | ✅ Brute-force optimal |
130+ | 10K-100K | 4-40ms | ⚠️ Brute-force acceptable for batch |
131+ | >100K | >40ms | ❌ HNSW required for real-time (<10ms) |
132+
133+ ** HNSW Threshold** : 100K vectors (~ 41ms → >10ms target requires ANN index)
120134
121- ** HNSW Threshold** : 100K vectors (27.9ms → >10ms target requires ANN index)
122135
123136---
124137
@@ -142,16 +155,61 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
142155
143156| Metric | Target | Actual | Status |
144157| --------| --------| --------| --------|
145- | 1K corpus (<1ms) | 1000 μs | 273 μs | ✅ ** 3.6x better** |
146- | 10K corpus (<5ms) | 5000 μs | 2780 μs | ✅ ** 1.8x better** |
147- | 100K corpus (<50ms) | 50000 μs | 27900 μs | ✅ ** 1.8x better** |
148- | int8 overhead (<20%) | 20% | -1.4% | ✅ ** Faster** |
149- | Dimension scaling | Linear | Linear | ✅ ** Perfect** |
158+ | 1K corpus (<1ms) | 1000 μs | 288 μs | ✅ ** 3.5x better** |
159+ | 10K corpus (<5ms) | 5000 μs | 3633 μs | ✅ ** 1.4x better** |
160+ | 100K corpus (<50ms) | 50000 μs | 41000 μs | ✅ ** 1.2x better** |
161+ | int8 overhead (<20%) | 20% | -0.4% | ✅ ** Faster** |
162+ | Dimension scaling | Linear | Near-linear | ✅ ** Good** |
163+
164+ ### Comparison to Previous Results (2025-11-02)
165+
166+ | Scenario | Previous | Current | Delta |
167+ | ----------| ----------| ---------| -------|
168+ | 1K×384d (K=5) latency | 273 μs | 288 μs | +5.5% |
169+ | 10K×384d (K=5) latency | 2.78 ms | 3.63 ms | +30.6% |
170+ | 100K×384d (K=5) latency | 27.9 ms | 41.0 ms | +46.9% |
171+ | 10K×384d throughput | 3.60 M/s | 2.77 M/s | -23.1% |
172+ | int8 @10K ×384d latency | 2.74 ms | 3.62 ms | +32.1% |
173+
174+ Notes:
175+ - Previous run header (2025-11-02): Linux (x86_64), GCC 15.2.0, Google Benchmark 1.9.1.
176+ - Current run header (2026-01-05): Windows 11 (x86_64), clang 21.1.6, Google Benchmark 1.9.4.
177+ - Treat deltas as environment differences rather than regressions unless measured on the same OS/toolchain.
178+
179+
150180
151181---
152182
153183## Reproduction
154184
185+ ### Windows (Conan)
186+
187+ ``` powershell
188+ # From third_party/sqlite-vec-cpp
189+
190+ # Install dependencies (Conan 2)
191+ conan profile detect --force
192+ conan install . -of build_bench_conan -b missing -s build_type=Release -s compiler.cppstd=23 -s compiler.runtime=static
193+
194+ # Make Conan-generated .pc files visible to pkg-config for this shell
195+ $env:PKG_CONFIG_PATH = (Resolve-Path .\build_bench_conan)
196+
197+ # Configure + build
198+ meson setup build_bench --wipe -Denable_benchmarks=true -Dbuildtype=release
199+ ninja -C build_bench benchmarks/rag_pipeline_benchmark.exe benchmarks/batch_distance_benchmark.exe
200+
201+ # Run
202+ .\build_bench\benchmarks\rag_pipeline_benchmark.exe --benchmark_min_time=0.5s
203+ .\build_bench\benchmarks\batch_distance_benchmark.exe --benchmark_min_time=0.5s
204+
205+ # JSON output for analysis
206+ .\build_bench\benchmarks\rag_pipeline_benchmark.exe `
207+ --benchmark_out=results.json `
208+ --benchmark_out_format=json
209+ ```
210+
211+ ### Linux/macOS (system packages)
212+
155213``` bash
156214# Build benchmarks
157215cd third_party/sqlite-vec-cpp
@@ -170,4 +228,5 @@ ninja -C build_bench
170228 --benchmark_out_format=json
171229```
172230
231+
173232---
0 commit comments