feat(beir-bench): within-query-threaded sign baseline + DistL2 HNSW#244
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Code Review by Qodo
1.
|
PR Summary by QodoBEIR bench: threaded sign→rq2 baseline and safer DistL2 HNSW WalkthroughsDescription• Add sign-rq2-threaded method: within-query parallel sign scan with deterministic exact-m candidates. • Switch HNSW distance to DistL2 and rank by -distance to prevent panics. • Register ordvec-sign-rq2-threaded in beir_eval family metadata for reporting. Diagramgraph TD
A["beir-bench main"] --> B["Method switch"] --> C["run_sign_threaded"] --> D["sign_scan_topm_par"] --> E["RankQuant rerank"] --> F["finalize outputs"] --> H["beir_eval.py family_meta"]
B --> G["run_hnsw (DistL2)"] --> F
High-Level AssessmentThe following are alternative approaches to this PR: 1. Keep DistDot but clamp distance inputs
2. Use cosine distance implementation
3. Hybrid parallelism (query-parallel + within-query stripes)
Recommendation: The PR’s approach is appropriate for the stated goals: switching HNSW to DistL2 is the simplest way to avoid DistDot’s near-duplicate precision panic while preserving neighbor ordering for unit-normalized vectors, and scoring as -distance keeps the evaluator’s “higher-is-better” convention. For the new baseline, the deterministic exact-m selection (count desc, id asc) is the right design choice to isolate latency improvements from quality changes; alternative parallelization strategies are possible, but they increase complexity and can compromise candidate equivalence. File ChangesEnhancement (1)
Other (1)
|
There was a problem hiding this comment.
Code Review
This pull request introduces a within-query-threaded baseline for the sign->rq2 search method in the BEIR benchmarks, and switches the HNSW distance metric from DistDot to DistL2 to prevent panics on near-duplicate vector pairs. The review feedback suggests two key improvements: parallelizing the sequential initialization of codes using Rayon to speed up index build times, and refactoring the manual chunking and merging of filtered candidates in sign_scan_topm_par to use a cleaner, more idiomatic Rayon parallel iterator pipeline.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Add `sign-rq2-threaded`: a within-query-threaded SignBitmap -> RankQuant b=2 method. Doc-major sign codes are scanned in parallel over doc-stripes with hardware VPOPCNTDQ, then an exact-m candidate selection (count desc, id asc tie-break) reproduces the serial `sign-rq2` set byte-for-byte (top-100 overlap 1.0000) -- isolating the within-query threading speedup from any quality change. Harden HNSW to DistL2 (was DistDot). Embeddings are unit-normalized, so min-L2 == max-dot == max-cosine (identical neighbors), but DistL2 avoids anndists' DistDot `1-dot` distance assertion that panics on near-duplicate pairs whose float dot rounds just past 1.0 -- rare at 171K docs, frequent near 1M. Score becomes `-distance` (nearer = higher), preserving nearest-first order. beir_eval: add the family_meta entry for ordvec-sign-rq2-threaded. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
97386e3 to
8091e26
Compare
Address PR #244 bot review (gemini/qodo): - qodo: sign_scan_topm_par stored agreement (= dim - hamming) as u16 and cast silently, which would wrap for dim > 65535 into a wrong tau / candidate set. Add a fail-loud assert on the precondition (BEIR / embedding dims are <= ~4096, far below this) covering both the scalar path and the AVX-512 kernel reached only from this fn. - gemini: parallelize the previously sequential doc-major sign-code build loop with rayon (one wpd-word stripe per doc), matching the adjacent rq.add which is already parallel. Benchmark harness only; core ordvec crate untouched. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
Remediation wave for the bot review, all scoped to `benchmarks/` (core crate untouched), pushed in 42ae55d:
Gate: `cargo fmt --check` + `cargo clippy --release --all-targets -- -D warnings` + `cargo build --release` all clean on `benchmarks/beir-bench`. |
|
/gemini review |
|
/review |
There was a problem hiding this comment.
Code Review
This pull request introduces a within-query-threaded sign->rq2 baseline (sign-rq2-threaded) to the BEIR benchmark suite, leveraging parallelized sign-agreement scans with hardware VPOPCNTDQ and AVX-512 optimizations. It also updates the HNSW implementation to use DistL2 instead of DistDot to prevent panics on near-duplicate vector pairs. The review feedback focuses on optimizing the newly added sign_scan_topm_par function to eliminate memory allocation overhead in the benchmark hot path by pre-allocating and reusing buffers for histograms and candidate pools, as well as optimizing Phase B to use a sequential scan.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Code review by qodo was updated up to the latest commit 42ae55d |
Address PR #244 re-review (gemini HIGH + qodo) on 42ae55d: - qodo (u16 truncation): widen the agreement buffer + AVX-512 kernel from u16 to u32, eliminating the (dim - hamming) cast entirely (the prior assert didn't satisfy the static check). u32 is ~free here -- the agreement buffer is a few % of the codes traffic the scan streams. - gemini (per-query allocs in the timed hot path): make sign_scan_topm_par allocation-free by threading caller-owned reusable scratch -- a per-stripe histogram buffer (hists_buf) and the candidate pool (poolv). Histogram tau now sums per-stripe columns directly (no merge buffer); Phase B stays PARALLEL via par_extend into the reused poolv (rejected gemini's sequential-Phase-B suggestion: it would serialize an O(n) pass and hurt the threaded baseline's latency at 1M+ docs). - qodo (rerank buffer-size mismatch): the rerank was called with top_k but buffers are sized out_k = min(top_k, m/candidates, n_docs); pass out_k instead in BOTH the threaded and the serial sign-rq2 baselines, so the length assert can't panic when the candidate budget is below top_k. Candidate-equivalence preserved (select_exact_m imposes a strict total order, so the parallel extract order is irrelevant). Verified on scifact full corpus: sign-rq2 vs sign-rq2-threaded top-100 overlap = 1.0000 (min 1.0000), 300/300 exact-identical ranked lists. Benchmark harness only; core ordvec crate untouched. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
Second remediation wave (commit 750e97c), still qodo
gemini (HIGH, per-query allocations)
Equivalence preserved & verified: Gate: |
|
/gemini review |
|
/review |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
Code review by qodo was updated up to the latest commit 750e97c |
qodo re-review (750e97c): guard the hists_buf allocation length with checked_mul so a pathological threads/dim can't wrap usize into a too-small buffer in release (unreachable on 64-bit native, but matches the core crate's util::checked_* convention and fails loud). Benchmark harness only; core ordvec crate untouched. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
|
Final wave (commit f95f294),
Gate clean ( |
Codex stop-gate: the dispatch only checked is_x86_feature_detected!( "avx512vpopcntdq") but scan_agree_avx512 is #[target_feature(enable = "avx512f,avx512vpopcntdq")]. Calling a target_feature fn is only sound when the caller verifies EVERY enabled feature at runtime, so the avx512f check was missing. Guard both, matching the core crate's dispatch convention (lib.rs / multi_bucket.rs). Re-verified candidate-equivalence on scifact (top-100 overlap 1.0000, 300/300 exact). Benchmark harness only; core crate untouched. Signed-off-by: Nelson Spence <nelson@projectnavi.ai>
Summary
Two benchmark-harness improvements, both scoped to
benchmarks/(the coreordveccrate is untouched):1. New method
sign-rq2-threadedA within-query-threaded variant of the
sign-rq2two-stage baseline. Doc-major sign codes are scanned in parallel over doc-stripes with hardware VPOPCNTDQ (parallel agreement histogram → global top-M threshold), then an exact-mcandidate selection (countdesc,idasc tie-break) reproduces the serialsign-rq2candidate set byte-for-byte.This isolates the within-query threading speedup from any quality change — the two methods rerank the identical candidate set, so nDCG/recall are unchanged and only latency differs. The serial
sign-rq2scans single-threaded per query; this version spreads one query across the rayon pool.Verified candidate-equivalence on trec-covid: top-100 overlap = 1.0000 (min 1.0000) vs
sign-rq2.2. Harden HNSW distance to
DistL2run_hnswpreviously usedDistDot, whoseanndistsimplementation asserts on1 - dotand panics on near-duplicate pairs whose unit-vector float dot product rounds just past1.0— rare at 171K docs, frequent approaching ~1M, where it corrupts the graph build.Embeddings are unit-normalized, so min-L2 ≡ max-dot ≡ max-cosine (identical neighbors);
DistL2gives the same ranking without the fragile assertion. Score is now-distance(nearer = smaller L2 = higher score), preserving nearest-first ordering in the eval. HNSW nDCG@10 is unchanged within noise (0.7593 vs prior 0.7584).Test plan
cargo build --manifest-path benchmarks/beir-bench/Cargo.toml --releasecargo clippy --manifest-path benchmarks/beir-bench/Cargo.toml --release --all-targets -- -D warnings(clean)cargo fmt --manifest-path benchmarks/beir-bench/Cargo.toml --check(clean)--methods sign-rq2,sign-rq2-threadedon trec-covid → candidate sets byte-identical (top-100 overlap 1.0000)DistL2build no longer panics; nDCG@10 unchanged within bootstrap noisemake benchmark-beirend-to-end on a clean machine