perf(vector): vectorize RaBitQ top-k lower-bound pruning scan#7243
Open
BubbleCal wants to merge 2 commits into
Open
perf(vector): vectorize RaBitQ top-k lower-bound pruning scan#7243BubbleCal wants to merge 2 commits into
BubbleCal wants to merge 2 commits into
Conversation
Benchmark the gated raw-query multi-bit top-k path (binary FastScan + lower-bound pruning scan + exact rerank) in Normal and Accurate modes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The per-partition multi-bit IVF_RQ top-k scan classified every row with a scalar lower-bound computation and two compares; with ~99.9% of rows pruned, the scan itself dominated. Classify 16 rows at a time with dedicated AVX-512/AVX2 kernels (portable auto-vectorized fallback elsewhere) and run the existing scalar rerank only for surviving lanes. The kernels keep the scalar operation order (no FMA), so the lower bounds are bit-identical and heap contents, iteration order, and prune stats are unchanged. The group-start heap threshold can be stale, but it only ever tightens, so the masks can only over-select survivors, which the per-row re-check then prunes with live values. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The per-partition top-k scan of multi-bit IVF_RQ search (
accumulate_raw_query_multi_bit_topk_with_scratch, Normal/Accurate modes) walked allnrows with a scalar lower-bound computation — per row: 4 bounds-checked loads, ~5 FLOPs, two compares, iterator andOptionplumbing — even though ~99.9% of rows are pruned once the heap is tight. On dbpedia-openai-1M (1536d, num_bits=5, nprobes=24, k=10) this loop profiled at ~10–13% of query self time.This PR vectorizes the classification:
accumulate_topk_with_scratch, rows0..n): newbq::prunekernels evaluate the lower bound and both pruning compares for 16 rows per call, returning bit masks. Mask-zero groups (the common case) are skipped whole; surviving lanes run the existing scalar rerank with live values. Therow_idmapping is now only invoked for rows that reach the scalar tail, and no scratch buffer is needed — everything stays in registers.accumulate_filtered_topk_with_scratch): unchanged scalar loop.Kernels follow the
ex_dotdispatch pattern:#[target_feature]AVX-512 and AVX2 implementations behind aLazyLockruntime-dispatched fn pointer, with a portable 16-wide fallback that LLVM auto-vectorizes (NEON is baseline on aarch64).accumulate_distances_into_heap(Fast-mode bypass) is left as is: it has no factor arrays and would need a separate kernel shape, and Fast mode doesn't take the gated path.Correctness
The dense path is bit-identical to the scalar implementation, not just statistically equivalent:
raw_query_lower_boundbit for bit, and comparisons use ordered-quiet GE (_CMP_GE_OQ) matching scalar>=(a NaN lower bound is never pruned).LANCE_RQ_PRUNE_STATScounters are unchanged.Tests
vector::bq::prune: every available kernel (portable, AVX2, AVX-512, dispatched) against a per-lane scalar reference on random inputs, exact>=boundary ties, and NaN/±inf semantics.test_raw_query_multi_bit_topk_dense_matches_sparse: differential test of the dense path against the unchanged sparse scalar path with crafted factor columns controlling lower bounds and exact distances — n ∈ {1, 15, 16, 17, 100, 4109} × k ∈ {1, 10, n+7} × bounds, distance orderings descending (constant heap churn), ascending (mass pruning), random, duplicates, and exact ties, with a second pass on the shared heap (the carried tight-threshold regime). Asserts identical heap contents (row ids + distance bit patterns) and the k-smallest-distances reference.cargo test -p lance-index --lib vector::bqandcargo test -p lance ivf_rqpass on aarch64 (portable kernel) and on x86_64 with AVX-512 (GCP c4-standard-16).Benchmark
New
RQ heap topkbench (binary FastScan + pruning scan + exact rerank; 4096 rows, k=10, DIM=1536, num_bits=5, error factors present so gating is enabled), GCP c4-standard-16 (AVX-512), pinned core:The binary FastScan portion of the bench is unchanged; the delta is the pruning scan itself, matching the ~5–9% end-to-end win predicted from the profile.
🤖 Generated with Claude Code