Skip to content

perf(vector): vectorize RaBitQ top-k lower-bound pruning scan#7243

Open
BubbleCal wants to merge 2 commits into
mainfrom
yang/rq-prune-scan-simd
Open

perf(vector): vectorize RaBitQ top-k lower-bound pruning scan#7243
BubbleCal wants to merge 2 commits into
mainfrom
yang/rq-prune-scan-simd

Conversation

@BubbleCal

Copy link
Copy Markdown
Contributor

What

The per-partition top-k scan of multi-bit IVF_RQ search (accumulate_raw_query_multi_bit_topk_with_scratch, Normal/Accurate modes) walked all n rows with a scalar lower-bound computation — per row: 4 bounds-checked loads, ~5 FLOPs, two compares, iterator and Option plumbing — even though ~99.9% of rows are pruned once the heap is tight. On dbpedia-openai-1M (1536d, num_bits=5, nprobes=24, k=10) this loop profiled at ~10–13% of query self time.

This PR vectorizes the classification:

  • Dense path (accumulate_topk_with_scratch, rows 0..n): new bq::prune kernels evaluate the lower bound and both pruning compares for 16 rows per call, returning bit masks. Mask-zero groups (the common case) are skipped whole; surviving lanes run the existing scalar rerank with live values. The row_id mapping is now only invoked for rows that reach the scalar tail, and no scratch buffer is needed — everything stays in registers.
  • Sparse path (prefiltered accumulate_filtered_topk_with_scratch): unchanged scalar loop.

Kernels follow the ex_dot dispatch pattern: #[target_feature] AVX-512 and AVX2 implementations behind a LazyLock runtime-dispatched fn pointer, with a portable 16-wide fallback that LLVM auto-vectorizes (NEON is baseline on aarch64).

accumulate_distances_into_heap (Fast-mode bypass) is left as is: it has no factor arrays and would need a separate kernel shape, and Fast mode doesn't take the gated path.

Correctness

The dense path is bit-identical to the scalar implementation, not just statistically equivalent:

  • The kernels keep the scalar operation order (multiplies and adds, no FMA), so the lower bounds match raw_query_lower_bound bit for bit, and comparisons use ordered-quiet GE (_CMP_GE_OQ) matching scalar >= (a NaN lower bound is never pruned).
  • The heap threshold snapshot taken at each 16-row group start can be stale, but the threshold only ever tightens, so the masks can only over-select survivors — and survivors are re-checked per row against live values.
  • Heap contents, processing order, and the LANCE_RQ_PRUNE_STATS counters are unchanged.

Tests

  • vector::bq::prune: every available kernel (portable, AVX2, AVX-512, dispatched) against a per-lane scalar reference on random inputs, exact >= boundary ties, and NaN/±inf semantics.
  • test_raw_query_multi_bit_topk_dense_matches_sparse: differential test of the dense path against the unchanged sparse scalar path with crafted factor columns controlling lower bounds and exact distances — n ∈ {1, 15, 16, 17, 100, 4109} × k ∈ {1, 10, n+7} × bounds, distance orderings descending (constant heap churn), ascending (mass pruning), random, duplicates, and exact ties, with a second pass on the shared heap (the carried tight-threshold regime). Asserts identical heap contents (row ids + distance bit patterns) and the k-smallest-distances reference.
  • cargo test -p lance-index --lib vector::bq and cargo test -p lance ivf_rq pass on aarch64 (portable kernel) and on x86_64 with AVX-512 (GCP c4-standard-16).

Benchmark

New RQ heap topk bench (binary FastScan + pruning scan + exact rerank; 4096 rows, k=10, DIM=1536, num_bits=5, error factors present so gating is enabled), GCP c4-standard-16 (AVX-512), pinned core:

mode before after change
normal 70.7 µs 64.9 µs −9.4% (p = 0.00)
accurate 93.4 µs 87.9 µs −5.7% (p = 0.00)

The binary FastScan portion of the bench is unchanged; the delta is the pruning scan itself, matching the ~5–9% end-to-end win predicted from the profile.

🤖 Generated with Claude Code

BubbleCal and others added 2 commits June 12, 2026 14:16
Benchmark the gated raw-query multi-bit top-k path (binary FastScan +
lower-bound pruning scan + exact rerank) in Normal and Accurate modes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The per-partition multi-bit IVF_RQ top-k scan classified every row with
a scalar lower-bound computation and two compares; with ~99.9% of rows
pruned, the scan itself dominated. Classify 16 rows at a time with
dedicated AVX-512/AVX2 kernels (portable auto-vectorized fallback
elsewhere) and run the existing scalar rerank only for surviving lanes.

The kernels keep the scalar operation order (no FMA), so the lower
bounds are bit-identical and heap contents, iteration order, and prune
stats are unchanged. The group-start heap threshold can be stale, but it
only ever tightens, so the masks can only over-select survivors, which
the per-row re-check then prunes with live values.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.56209% with 70 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/vector/bq/prune.rs 82.67% 58 Missing and 3 partials ⚠️
rust/lance-index/src/vector/bq/storage.rs 96.53% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant