perf(vector): vectorize RaBitQ top-k lower-bound pruning scan by BubbleCal · Pull Request #7243 · lance-format/lance

BubbleCal · 2026-06-12T06:44:36Z

What

The per-partition top-k scan of multi-bit IVF_RQ search (accumulate_raw_query_multi_bit_topk_with_scratch, Normal/Accurate modes) walked all n rows with a scalar lower-bound computation — per row: 4 bounds-checked loads, ~5 FLOPs, two compares, iterator and Option plumbing — even though ~99.9% of rows are pruned once the heap is tight. On dbpedia-openai-1M (1536d, num_bits=5, nprobes=24, k=10) this loop profiled at ~10–13% of query self time.

This PR vectorizes the classification:

Dense path (accumulate_topk_with_scratch, rows 0..n): new bq::prune kernels evaluate the lower bound and both pruning compares for 16 rows per call, returning bit masks. Mask-zero groups (the common case) are skipped whole; surviving lanes run the existing scalar rerank with live values. The row_id mapping is now only invoked for rows that reach the scalar tail, and no scratch buffer is needed — everything stays in registers.
Sparse path (prefiltered accumulate_filtered_topk_with_scratch): unchanged scalar loop.

Kernels follow the ex_dot dispatch pattern: #[target_feature] AVX-512 and AVX2 implementations behind a LazyLock runtime-dispatched fn pointer, with a portable 16-wide fallback that LLVM auto-vectorizes (NEON is baseline on aarch64).

accumulate_distances_into_heap (Fast-mode bypass) is left as is: it has no factor arrays and would need a separate kernel shape, and Fast mode doesn't take the gated path.

Correctness

The dense path is bit-identical to the scalar implementation, not just statistically equivalent:

The kernels keep the scalar operation order (multiplies and adds, no FMA), so the lower bounds match raw_query_lower_bound bit for bit, and comparisons use ordered-quiet GE (_CMP_GE_OQ) matching scalar >= (a NaN lower bound is never pruned).
The heap threshold snapshot taken at each 16-row group start can be stale, but the threshold only ever tightens, so the masks can only over-select survivors — and survivors are re-checked per row against live values.
Heap contents, processing order, and the LANCE_RQ_PRUNE_STATS counters are unchanged.

Tests

vector::bq::prune: every available kernel (portable, AVX2, AVX-512, dispatched) against a per-lane scalar reference on random inputs, exact >= boundary ties, and NaN/±inf semantics.
test_raw_query_multi_bit_topk_dense_matches_sparse: differential test of the dense path against the unchanged sparse scalar path with crafted factor columns controlling lower bounds and exact distances — n ∈ {1, 15, 16, 17, 100, 4109} × k ∈ {1, 10, n+7} × bounds, distance orderings descending (constant heap churn), ascending (mass pruning), random, duplicates, and exact ties, with a second pass on the shared heap (the carried tight-threshold regime). Asserts identical heap contents (row ids + distance bit patterns) and the k-smallest-distances reference.
cargo test -p lance-index --lib vector::bq and cargo test -p lance ivf_rq pass on aarch64 (portable kernel) and on x86_64 with AVX-512 (GCP c4-standard-16).

Benchmark

New RQ heap topk bench (binary FastScan + pruning scan + exact rerank; 4096 rows, k=10, DIM=1536, num_bits=5, error factors present so gating is enabled), GCP c4-standard-16 (AVX-512), pinned core:

mode	before	after	change
normal	70.7 µs	64.9 µs	−9.4% (p = 0.00)
accurate	93.4 µs	87.9 µs	−5.7% (p = 0.00)

The binary FastScan portion of the bench is unchanged; the delta is the pruning scan itself, matching the ~5–9% end-to-end win predicted from the profile.

🤖 Generated with Claude Code

Benchmark the gated raw-query multi-bit top-k path (binary FastScan + lower-bound pruning scan + exact rerank) in Normal and Accurate modes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The per-partition multi-bit IVF_RQ top-k scan classified every row with a scalar lower-bound computation and two compares; with ~99.9% of rows pruned, the scan itself dominated. Classify 16 rows at a time with dedicated AVX-512/AVX2 kernels (portable auto-vectorized fallback elsewhere) and run the existing scalar rerank only for surviving lanes. The kernels keep the scalar operation order (no FMA), so the lower bounds are bit-identical and heap contents, iteration order, and prune stats are unchanged. The group-start heap threshold can be stale, but it only ever tightens, so the masks can only over-select survivors, which the per-row re-check then prunes with live values. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-06-12T07:24:31Z

Codecov Report

❌ Patch coverage is 88.56209% with 70 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/vector/bq/prune.rs	82.67%	58 Missing and 3 partials ⚠️
rust/lance-index/src/vector/bq/storage.rs	96.53%	9 Missing ⚠️

📢 Thoughts on this report? Let us know!

BubbleCal and others added 2 commits June 12, 2026 14:16

test(vector): add RQ heap topk benchmark

feee8f8

Benchmark the gated raw-query multi-bit top-k path (binary FastScan + lower-bound pruning scan + exact rerank) in Normal and Accurate modes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vector): vectorize RaBitQ top-k lower-bound pruning scan#7243

perf(vector): vectorize RaBitQ top-k lower-bound pruning scan#7243
BubbleCal wants to merge 2 commits into
mainfrom
yang/rq-prune-scan-simd

BubbleCal commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BubbleCal commented Jun 12, 2026

What

Correctness

Tests

Benchmark

Uh oh!

codecov Bot commented Jun 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant