perf(vector): vectorize RaBitQ dist table quantization#7241
Open
BubbleCal wants to merge 2 commits into
Open
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
quantize_dist_table_into/quantize_dist_table_u16_intoquantize the per-(query, partition)dim * 4-entry f32 FastScan distance table into u8 (fast/normal approx modes) or u16 (accuratemode) LUT entries. Both were scalar:
itertools::minmax_by(total_cmp)for the min/max pass(branchy pairwise compares that never vectorize;
minmax_implalone is 6.0-6.6% of query time incpu-clock profiles of IVF_RQ on dbpedia-openai-1M, dim=1536, num_bits=5, nprobes=24) plus a
scalar quantize-write loop.
This moves them into a dedicated module
vector/bq/dist_table_quant.rswith the sameruntime-dispatch treatment as the ex-dot kernels (#7205):
fold that LLVM auto-vectorizes (NEON is part of the aarch64 baseline).
(d - qmin) * factor,cvtps_epi32(nearest-even, MXCSR default), thenunsigned-saturating narrows (
vpmovusdb/vpmovusdwon AVX-512;packus+ lane-restorepermutes on AVX2); scalar fallback uses
round_ties_even.Numeric semantics
f32::round(half away from zero) to half-to-even so the scalarfallback and all SIMD paths are bit-exact with each other. Relative to the old code this can
move a LUT entry by 1 only on exact .5 ties, within the table's inherent quantization error.
total_cmpdiffers only on NaN (inputs are finite sums of rotated-querycomponents) and the sign of zero, which callers cannot observe (
d - qminand theqmin == qmaxearly-out are unchanged either way).255/f32::MAX, which overflowsfactorto inf, orabove
f32::MAX) now collapse to the zeroed-LUT early-out. Previously these produced garbage(NaN -> 0 casts); the SIMD narrows would additionally disagree across kernels on the NaN
products, so the early-out keeps every path deterministic and bit-exact.
Benchmarks
GCP c4-standard-16 (AVX-512),
taskset -c 4, criterion baseline = main (b6a99cd),cargo bench -p lance-index --bench rq:The untouched
RQ bulk ex kernel loopcontrol group moved within +-3% between runs, so thebinary-only wins are far above the machine noise floor. Full-path results for num_bits=5/9 are
within that noise envelope (the binary stage is a small share of those).
On Apple M-series (NEON: portable min/max fold + autovectorized scalar quantize), the same
paired criterion run improves binary-only distance_all by -24% / -13% / -8% (num_bits=3/5/9);
the Mac's run-to-run noise is larger than the pinned GCP runs but all three move well past it.
Tests
against a straightforward
total_cmp+round_ties_evenreference and require bit-exactagreement: random inputs (lengths 1..6160 including non-multiples of 16/32, scales
1e-3..1e4), exact .5 ties (integer tables constructed so
factor == 0.5), all-equal inputs,signed-zero mixes, degenerate ranges, and scratch-buffer reuse.
cargo test -p lance-index --lib vector::bq: 207 pass on aarch64 (scalar + NEON dispatch)and on AVX-512/AVX2 hardware (GCP c4, both debug and release).
cargo fmt --allandcargo clippy --all --tests --benches -- -D warningsare clean.🤖 Generated with Claude Code