Skip to content

perf(vector): vectorize RaBitQ dist table quantization#7241

Open
BubbleCal wants to merge 2 commits into
mainfrom
yang/rq-dist-table-quant-simd
Open

perf(vector): vectorize RaBitQ dist table quantization#7241
BubbleCal wants to merge 2 commits into
mainfrom
yang/rq-dist-table-quant-simd

Conversation

@BubbleCal

Copy link
Copy Markdown
Contributor

What

quantize_dist_table_into / quantize_dist_table_u16_into quantize the per-(query, partition)
dim * 4-entry f32 FastScan distance table into u8 (fast/normal approx modes) or u16 (accurate
mode) LUT entries. Both were scalar: itertools::minmax_by(total_cmp) for the min/max pass
(branchy pairwise compares that never vectorize; minmax_impl alone is 6.0-6.6% of query time in
cpu-clock profiles of IVF_RQ on dbpedia-openai-1M, dim=1536, num_bits=5, nprobes=24) plus a
scalar quantize-write loop.

This moves them into a dedicated module vector/bq/dist_table_quant.rs with the same
runtime-dispatch treatment as the ex-dot kernels (#7205):

  • min/max: two-accumulator 16-lane AVX-512 / 8-lane AVX2 folds; elsewhere a portable 16-lane
    fold that LLVM auto-vectorizes (NEON is part of the aarch64 baseline).
  • quantize: (d - qmin) * factor, cvtps_epi32 (nearest-even, MXCSR default), then
    unsigned-saturating narrows (vpmovusdb/vpmovusdw on AVX-512; packus + lane-restore
    permutes on AVX2); scalar fallback uses round_ties_even.

Numeric semantics

  • Rounding changes from f32::round (half away from zero) to half-to-even so the scalar
    fallback and all SIMD paths are bit-exact with each other. Relative to the old code this can
    move a LUT entry by 1 only on exact .5 ties, within the table's inherent quantization error.
  • SIMD min/max vs total_cmp differs only on NaN (inputs are finite sums of rotated-query
    components) and the sign of zero, which callers cannot observe (d - qmin and the
    qmin == qmax early-out are unchanged either way).
  • Degenerate ranges (table spread below ~255/f32::MAX, which overflows factor to inf, or
    above f32::MAX) now collapse to the zeroed-LUT early-out. Previously these produced garbage
    (NaN -> 0 casts); the SIMD narrows would additionally disagree across kernels on the NaN
    products, so the early-out keeps every path deterministic and bit-exact.

Benchmarks

GCP c4-standard-16 (AVX-512), taskset -c 4, criterion baseline = main (b6a99cd),
cargo bench -p lance-index --bench rq:

bench (DIM=1536, rows=16384) before after change
binary-only distance_all num_bits=3 130.89 us 117.31 us -10.3%
binary-only distance_all num_bits=5 127.69 us 110.43 us -13.6%
binary-only distance_all num_bits=9 127.91 us 106.43 us -16.4%
full distance_all num_bits=3 413.89 us 399.48 us -3.9%

The untouched RQ bulk ex kernel loop control group moved within +-3% between runs, so the
binary-only wins are far above the machine noise floor. Full-path results for num_bits=5/9 are
within that noise envelope (the binary stage is a small share of those).

On Apple M-series (NEON: portable min/max fold + autovectorized scalar quantize), the same
paired criterion run improves binary-only distance_all by -24% / -13% / -8% (num_bits=3/5/9);
the Mac's run-to-run noise is larger than the pinned GCP runs but all three move well past it.

Tests

  • New differential tests run every available kernel (scalar, AVX2, AVX-512 when detected)
    against a straightforward total_cmp + round_ties_even reference and require bit-exact
    agreement: random inputs (lengths 1..6160 including non-multiples of 16/32, scales
    1e-3..1e4), exact .5 ties (integer tables constructed so factor == 0.5), all-equal inputs,
    signed-zero mixes, degenerate ranges, and scratch-buffer reuse.
  • cargo test -p lance-index --lib vector::bq: 207 pass on aarch64 (scalar + NEON dispatch)
    and on AVX-512/AVX2 hardware (GCP c4, both debug and release).
  • cargo fmt --all and cargo clippy --all --tests --benches -- -D warnings are clean.

🤖 Generated with Claude Code

BubbleCal and others added 2 commits June 12, 2026 12:55
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.13121% with 110 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/vector/bq/dist_table_quant.rs 78.13% 105 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant