perf(vector): vectorize RaBitQ dist table quantization by BubbleCal · Pull Request #7241 · lance-format/lance

BubbleCal · 2026-06-12T06:00:41Z

What

quantize_dist_table_into / quantize_dist_table_u16_into quantize the per-(query, partition)
dim * 4-entry f32 FastScan distance table into u8 (fast/normal approx modes) or u16 (accurate
mode) LUT entries. Both were scalar: itertools::minmax_by(total_cmp) for the min/max pass
(branchy pairwise compares that never vectorize; minmax_impl alone is 6.0-6.6% of query time in
cpu-clock profiles of IVF_RQ on dbpedia-openai-1M, dim=1536, num_bits=5, nprobes=24) plus a
scalar quantize-write loop.

This moves them into a dedicated module vector/bq/dist_table_quant.rs with the same
runtime-dispatch treatment as the ex-dot kernels (#7205):

min/max: two-accumulator 16-lane AVX-512 / 8-lane AVX2 folds; elsewhere a portable 16-lane
fold that LLVM auto-vectorizes (NEON is part of the aarch64 baseline).
quantize: (d - qmin) * factor, cvtps_epi32 (nearest-even, MXCSR default), then
unsigned-saturating narrows (vpmovusdb/vpmovusdw on AVX-512; packus + lane-restore
permutes on AVX2); scalar fallback uses round_ties_even.

Numeric semantics

Rounding changes from f32::round (half away from zero) to half-to-even so the scalar
fallback and all SIMD paths are bit-exact with each other. Relative to the old code this can
move a LUT entry by 1 only on exact .5 ties, within the table's inherent quantization error.
SIMD min/max vs total_cmp differs only on NaN (inputs are finite sums of rotated-query
components) and the sign of zero, which callers cannot observe (d - qmin and the
qmin == qmax early-out are unchanged either way).
Degenerate ranges (table spread below ~255/f32::MAX, which overflows factor to inf, or
above f32::MAX) now collapse to the zeroed-LUT early-out. Previously these produced garbage
(NaN -> 0 casts); the SIMD narrows would additionally disagree across kernels on the NaN
products, so the early-out keeps every path deterministic and bit-exact.

Benchmarks

GCP c4-standard-16 (AVX-512), taskset -c 4, criterion baseline = main (b6a99cd),
cargo bench -p lance-index --bench rq:

bench (DIM=1536, rows=16384)	before	after	change
binary-only distance_all num_bits=3	130.89 us	117.31 us	-10.3%
binary-only distance_all num_bits=5	127.69 us	110.43 us	-13.6%
binary-only distance_all num_bits=9	127.91 us	106.43 us	-16.4%
full distance_all num_bits=3	413.89 us	399.48 us	-3.9%

The untouched RQ bulk ex kernel loop control group moved within +-3% between runs, so the
binary-only wins are far above the machine noise floor. Full-path results for num_bits=5/9 are
within that noise envelope (the binary stage is a small share of those).

On Apple M-series (NEON: portable min/max fold + autovectorized scalar quantize), the same
paired criterion run improves binary-only distance_all by -24% / -13% / -8% (num_bits=3/5/9);
the Mac's run-to-run noise is larger than the pinned GCP runs but all three move well past it.

Tests

New differential tests run every available kernel (scalar, AVX2, AVX-512 when detected)
against a straightforward total_cmp + round_ties_even reference and require bit-exact
agreement: random inputs (lengths 1..6160 including non-multiples of 16/32, scales
1e-3..1e4), exact .5 ties (integer tables constructed so factor == 0.5), all-equal inputs,
signed-zero mixes, degenerate ranges, and scratch-buffer reuse.
cargo test -p lance-index --lib vector::bq: 207 pass on aarch64 (scalar + NEON dispatch)
and on AVX-512/AVX2 hardware (GCP c4, both debug and release).
cargo fmt --all and cargo clippy --all --tests --benches -- -D warnings are clean.

🤖 Generated with Claude Code

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-06-12T06:40:32Z

Codecov Report

❌ Patch coverage is 78.13121% with 110 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/vector/bq/dist_table_quant.rs	78.13%	105 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

BubbleCal and others added 2 commits June 12, 2026 12:55

perf(vector): vectorize RaBitQ dist table quantization

853b726

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

perf(vector): handle degenerate dist table ranges in SIMD quantization

1084414

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vector): vectorize RaBitQ dist table quantization#7241

perf(vector): vectorize RaBitQ dist table quantization#7241
BubbleCal wants to merge 2 commits into
mainfrom
yang/rq-dist-table-quant-simd

BubbleCal commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BubbleCal commented Jun 12, 2026

What

Numeric semantics

Benchmarks

Tests

Uh oh!

codecov Bot commented Jun 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant