Skip to content

Commit 5f26996

Browse files
committed
Merge branch 'master' of https://github.com/trvon/sqlite-vec-cpp
2 parents 6752b4f + 4ca6929 commit 5f26996

5 files changed

Lines changed: 126 additions & 42 deletions

File tree

.github/workflows/ci.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
name: ci
22

3+
permissions:
4+
contents: read
5+
36
on:
47
push:
58
pull_request:
@@ -37,6 +40,7 @@ jobs:
3740

3841
- name: Coverage (summary + artifacts)
3942
run: |
43+
mkdir -p coverage
4044
gcovr -r . \
4145
--object-directory build_coverage \
4246
--exclude '.*build.*' \
@@ -46,6 +50,7 @@ jobs:
4650
--html-details coverage/index.html \
4751
--cobertura coverage/coverage.xml
4852
53+
4954
- name: Upload coverage artifacts
5055
uses: actions/upload-artifact@v4
5156
with:

BENCHMARKS.md

Lines changed: 100 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
# SQLite-Vec C++ Benchmark Results
22

33
**Version**: 0.1.0
4-
**Date**: 2025-11-02
5-
**Platform**: x86_64, 48 cores @ 4.0GHz, 32KB L1, 512KB L2, 16MB L3
6-
**Compiler**: GCC 15.2.0, C++23, Release mode (`-O3`)
7-
**Library**: Google Benchmark 1.9.1
4+
**Date**: 2026-01-05
5+
**Platform**: x86_64, 48 cores @ 3.8GHz, 32KB L1, 512KB L2, 16MB L3 (Windows 11)
6+
**Compiler**: clang 21.1.6, C++23, Release mode (`-O3`)
7+
**Library**: Google Benchmark 1.9.4
8+
89

910
---
1011

1112
## Executive Summary
1213

13-
The C++ implementation achieves **3.6M vectors/second sustained throughput** with linear scaling across corpus sizes and embedding dimensions. int8 quantization provides 4x storage reduction at performance parity. HNSW index recommended for >100K vector corpora.
14+
The C++ implementation achieves **~2.8M vectors/second sustained throughput** with linear scaling across corpus sizes and embedding dimensions. int8 quantization provides 4x storage reduction at near performance parity. HNSW index recommended for >100K vector corpora.
15+
1416

1517
---
1618

@@ -20,9 +22,10 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
2022

2123
| Corpus | Latency | Throughput | QPS (single-thread) |
2224
|--------|---------|------------|---------------------|
23-
| 1K | 273 μs | 3.67 M/s | ~3,660 queries/sec |
24-
| 10K | 2.78 ms | 3.60 M/s | ~360 queries/sec |
25-
| 100K | 27.9 ms | 3.58 M/s | ~36 queries/sec |
25+
| 1K | 288 μs | 3.51 M/s | ~3,510 queries/sec |
26+
| 10K | 3.63 ms | 2.77 M/s | ~277 queries/sec |
27+
| 100K | 41.0 ms | 2.43 M/s | ~24 queries/sec |
28+
2629

2730
**Scaling**: Linear (10x corpus → 10x latency)
2831
**Bottleneck**: Compute-bound (memory bandwidth utilization ~5%)
@@ -31,20 +34,22 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
3134

3235
| K | Latency | Delta |
3336
|----|---------|-------|
34-
| 1 | 2.77 ms | -0.4% |
35-
| 5 | 2.78 ms | baseline |
36-
| 10 | 2.78 ms | 0.0% |
37-
| 50 | 2.77 ms | -0.4% |
37+
| 1 | 3.92 ms | +7.8% |
38+
| 5 | 3.63 ms | baseline |
39+
| 10 | 3.64 ms | +0.2% |
40+
| 50 | 3.60 ms | -0.9% |
41+
3842

3943
**Conclusion**: Partial sort overhead negligible; K-value has no meaningful impact.
4044

4145
### 3. Embedding Dimension Scaling (10K docs, K=5)
4246

4347
| Dimensions | Latency | Throughput | Scaling Factor |
4448
|------------|----------|------------|----------------|
45-
| 384d | 2.78 ms | 3.60 M/s | 1.0x |
46-
| 768d | 5.74 ms | 1.74 M/s | 2.06x |
47-
| 1536d | 11.7 ms | 856k/s | 4.21x |
49+
| 384d | 3.63 ms | 2.77 M/s | 1.0x |
50+
| 768d | 6.78 ms | 1.53 M/s | 1.87x |
51+
| 1536d | 13.2 ms | 780k/s | 3.64x |
52+
4853

4954
**Scaling**: Near-linear (2x dim → 2.06x latency, 4x dim → 4.21x latency)
5055
**Conclusion**: Compute-bound; SIMD efficiency remains high across dimensions.
@@ -53,24 +58,27 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
5358

5459
| Type | Latency | Throughput | Storage | Overhead |
5560
|-------|---------|------------|---------|----------|
56-
| float | 2.78 ms | 3.60 M/s | 4 bytes | baseline |
57-
| int8 | 2.74 ms | 3.65 M/s | 1 byte | **-1.4%** |
61+
| float | 3.63 ms | 2.77 M/s | 4 bytes | baseline |
62+
| int8 | 3.62 ms | 2.79 M/s | 1 byte | **-0.4%** |
63+
5864

5965
**Conclusion**: int8 quantization is **faster** while reducing storage 4x (memory bandwidth savings).
6066

6167
### 5. Multi-Query Throughput (10K docs, 384d)
6268

63-
- **10 queries**: 27.5 ms total (2.75 ms/query average)
64-
- **Sustained throughput**: 3.64 M vectors/second
65-
- **QPS**: ~364 queries/second (single-threaded)
66-
- **Parallelization potential**: 48 cores → ~17.4K QPS theoretical
69+
- **10 queries**: 36.2 ms total (3.62 ms/query average)
70+
- **Sustained throughput**: 2.76 M vectors/second
71+
- **QPS**: ~276 queries/second (single-threaded)
72+
- **Parallelization potential**: 48 cores → ~13.2K QPS theoretical
73+
6774

6875
### 6. Sequential vs Batch (1K docs, 384d, K=5)
6976

7077
| Method | Latency | Throughput |
7178
|------------|---------|------------|
72-
| Sequential | 274 μs | 3.66 M/s |
73-
| Batch | 273 μs | 3.67 M/s |
79+
| Sequential | 287 μs | 3.51 M/s |
80+
| Batch | 288 μs | 3.51 M/s |
81+
7482

7583
**Conclusion**: Batch API provides cleaner code at performance parity (memory-bandwidth bound).
7684

@@ -82,43 +90,48 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
8290

8391
| Scenario | Sequential | Batch | Speedup |
8492
|----------|------------|-------|---------|
85-
| 100×384d | 26.7 μs | 26.7 μs | 1.00x |
86-
| 1K×384d | 268 μs | 269 μs | 1.00x |
93+
| 100×384d | 27.8 μs | 28.3 μs | 0.98x |
94+
| 1K×384d | 289 μs | 283 μs | 1.02x |
95+
8796

8897
**Conclusion**: Parity performance; both memory-bandwidth limited.
8998

9099
### 2. Memory Layout Optimization
91100

92101
| Layout | Latency | Throughput | Improvement |
93102
|-------------|---------|------------|-------------|
94-
| Scattered | 269 μs | 3.73 M/s | baseline |
95-
| Contiguous | 267 μs | 3.75 M/s | +0.5% |
103+
| Scattered | 283 μs | 3.54 M/s | baseline |
104+
| Contiguous | 283 μs | 3.54 M/s | +0.0% |
105+
96106

97107
**Conclusion**: Marginal improvement; modern CPUs prefetch efficiently.
98108

99109
### 3. Top-K Performance (1K×384d, K=10)
100110

101-
- **Latency**: 268 μs (vs 268 μs full distance computation)
102-
- **Overhead**: <1% for partial sort
111+
- **Latency**: 290 μs (vs 287 μs full distance computation)
112+
- **Overhead**: ~1% for partial sort
103113
- **Conclusion**: `std::partial_sort` highly optimized; K << N has negligible cost.
104114

115+
105116
### 4. Large Embeddings (1K×1536d)
106117

107-
- **Latency**: 1.13 ms
108-
- **Throughput**: 886k vectors/second
109-
- **Scaling**: 4.21x slower than 384d (expected 4.0x)
118+
- **Latency**: 1.18 ms
119+
- **Throughput**: 833k vectors/second
120+
- **Scaling**: 4.18x slower than 384d (expected 4.0x)
121+
110122

111123
---
112124

113125
## HNSW Decision Matrix
114126

115127
| Corpus Size | Brute-Force Latency | Recommendation |
116128
|-------------|---------------------|----------------|
117-
| <10K | <3ms | ✅ Brute-force optimal |
118-
| 10K-100K | 3-30ms | ⚠️ Brute-force acceptable for batch |
119-
| >100K | >30ms | ❌ HNSW required for real-time (<10ms) |
129+
| <10K | <4ms | ✅ Brute-force optimal |
130+
| 10K-100K | 4-40ms | ⚠️ Brute-force acceptable for batch |
131+
| >100K | >40ms | ❌ HNSW required for real-time (<10ms) |
132+
133+
**HNSW Threshold**: 100K vectors (~41ms → >10ms target requires ANN index)
120134

121-
**HNSW Threshold**: 100K vectors (27.9ms → >10ms target requires ANN index)
122135

123136
---
124137

@@ -142,16 +155,61 @@ The C++ implementation achieves **3.6M vectors/second sustained throughput** wit
142155

143156
| Metric | Target | Actual | Status |
144157
|--------|--------|--------|--------|
145-
| 1K corpus (<1ms) | 1000 μs | 273 μs |**3.6x better** |
146-
| 10K corpus (<5ms) | 5000 μs | 2780 μs |**1.8x better** |
147-
| 100K corpus (<50ms) | 50000 μs | 27900 μs |**1.8x better** |
148-
| int8 overhead (<20%) | 20% | -1.4% |**Faster** |
149-
| Dimension scaling | Linear | Linear |**Perfect** |
158+
| 1K corpus (<1ms) | 1000 μs | 288 μs |**3.5x better** |
159+
| 10K corpus (<5ms) | 5000 μs | 3633 μs |**1.4x better** |
160+
| 100K corpus (<50ms) | 50000 μs | 41000 μs |**1.2x better** |
161+
| int8 overhead (<20%) | 20% | -0.4% |**Faster** |
162+
| Dimension scaling | Linear | Near-linear |**Good** |
163+
164+
### Comparison to Previous Results (2025-11-02)
165+
166+
| Scenario | Previous | Current | Delta |
167+
|----------|----------|---------|-------|
168+
| 1K×384d (K=5) latency | 273 μs | 288 μs | +5.5% |
169+
| 10K×384d (K=5) latency | 2.78 ms | 3.63 ms | +30.6% |
170+
| 100K×384d (K=5) latency | 27.9 ms | 41.0 ms | +46.9% |
171+
| 10K×384d throughput | 3.60 M/s | 2.77 M/s | -23.1% |
172+
| int8 @10K×384d latency | 2.74 ms | 3.62 ms | +32.1% |
173+
174+
Notes:
175+
- Previous run header (2025-11-02): Linux (x86_64), GCC 15.2.0, Google Benchmark 1.9.1.
176+
- Current run header (2026-01-05): Windows 11 (x86_64), clang 21.1.6, Google Benchmark 1.9.4.
177+
- Treat deltas as environment differences rather than regressions unless measured on the same OS/toolchain.
178+
179+
150180

151181
---
152182

153183
## Reproduction
154184

185+
### Windows (Conan)
186+
187+
```powershell
188+
# From third_party/sqlite-vec-cpp
189+
190+
# Install dependencies (Conan 2)
191+
conan profile detect --force
192+
conan install . -of build_bench_conan -b missing -s build_type=Release -s compiler.cppstd=23 -s compiler.runtime=static
193+
194+
# Make Conan-generated .pc files visible to pkg-config for this shell
195+
$env:PKG_CONFIG_PATH = (Resolve-Path .\build_bench_conan)
196+
197+
# Configure + build
198+
meson setup build_bench --wipe -Denable_benchmarks=true -Dbuildtype=release
199+
ninja -C build_bench benchmarks/rag_pipeline_benchmark.exe benchmarks/batch_distance_benchmark.exe
200+
201+
# Run
202+
.\build_bench\benchmarks\rag_pipeline_benchmark.exe --benchmark_min_time=0.5s
203+
.\build_bench\benchmarks\batch_distance_benchmark.exe --benchmark_min_time=0.5s
204+
205+
# JSON output for analysis
206+
.\build_bench\benchmarks\rag_pipeline_benchmark.exe `
207+
--benchmark_out=results.json `
208+
--benchmark_out_format=json
209+
```
210+
211+
### Linux/macOS (system packages)
212+
155213
```bash
156214
# Build benchmarks
157215
cd third_party/sqlite-vec-cpp
@@ -170,4 +228,5 @@ ninja -C build_bench
170228
--benchmark_out_format=json
171229
```
172230

231+
173232
---

conanfile.txt

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
[requires]
2+
sqlite3/3.46.1
3+
benchmark/1.9.4
4+
5+
[generators]
6+
MesonToolchain
7+
PkgConfigDeps
8+
9+
[options]
10+
benchmark/*:with_libbacktrace=False
11+
benchmark/*:with_libbpf=False
12+
benchmark/*:with_perf_counters=False
13+
benchmark/*:with_pthread=False

meson.build

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,12 @@ else
3737
endif
3838

3939
# SQLite3 dependency
40-
sqlite3_dep = dependency('sqlite3', version: '>=3.38.0', required: true)
40+
# Prefer pkg-config on Unix-y systems, but on Windows we may not have pkg-config
41+
# available and `find_package(sqlite3)` isn't consistently provided.
42+
#
43+
# Using `method: 'system'` lets Meson try pkg-config, CMake, and other
44+
# platform-appropriate mechanisms.
45+
sqlite3_dep = dependency('sqlite3', version: '>=3.38.0', required: true, method: 'pkg-config')
4146

4247
# Project include directories
4348
inc_dirs = include_directories('include')

tests/test_hnsw.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,11 @@
55
#include <cassert>
66
#include <cmath>
77
#include <iostream>
8+
#include <numeric>
89
#include <random>
910
#include <span>
1011
#include <vector>
12+
1113
#include <sqlite-vec-cpp/distances/l2.hpp>
1214
#include <sqlite-vec-cpp/index/hnsw.hpp>
1315
#include <sqlite-vec-cpp/index/hnsw_persistence.hpp>

0 commit comments

Comments
 (0)