parquet: SIMD-accelerate Sbbf probe via autovectorization#10011
parquet: SIMD-accelerate Sbbf probe via autovectorization#10011dmatth1 wants to merge 2 commits into
Conversation
|
It looks to me like with one small change to the |
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`simd_x86::sbbf_{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (tf-shim) | Speedup | |-----------|--------|-------:|------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
|
Great callout. Measured bench and the numbers with autovectorization are better: Same-host, same-session medians (Cascade Lake-class Xeon @ 2.8 GHz), via the public
Changes here: main...dmatth1:arrow-rs:sbbf-autovec-tf |
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.
… shim Per @jhorstmann's review on apache#10011: no `_mm256_*` intrinsics. The single probe implementation lives in `Block::{check,insert}` and a thin `#[target_feature(enable = "avx2")]` shim calls into it. Because the shim is compiled with AVX2 on, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, no `target-cpu` flag required. The shim is reached after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`). Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the shim. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. The same branchless `Block::check` also autovectorizes to NEON on aarch64 — no shim, no `target_feature` needed (NEON is baseline). On main, the short-circuit form left aarch64 fully scalar. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`: | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1: | Regime | Path | Scalar | Autovec (NEON) | Speedup | |-----------|--------|-------:|---------------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_simd_{check,insert}_matches_scalar` diff the AVX2 shim against the baseline-compiled scalar across 10K random pairs; `test_check_matches_reference_aarch64` diffs the autovec'd check against an inline short-circuit reference for the aarch64 codegen path. All bloom_filter tests pass with and without `-C target-cpu=native`.
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no target_feature shim, no runtime dispatch. `Block::check` is rewritten as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape) and LLVM autovectorizes it directly to NEON on aarch64 and to AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected: the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch) beat the only thing it bought, which was AVX2 codegen for default- built binaries on AVX2 hardware — production deployments that care already set the target-cpu flag. Preconditions: `Block::mask` is `#[inline]` (folds into the call site) and `Block` is `#[repr(C, align(32))]` with size/align asserted (so the 256-bit load/store hits one cache line). A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert ties on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs, on every target. All bloom_filter tests pass.
`Sbbf::{check,insert}` are on the hot path of Parquet row-group
skipping for every reader downstream of `arrow-rs` (DataFusion,
Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit
Parquet block is exactly one AVX2 vector / two NEON `uint32x4_t`
halves; the K=8 lane test is a one-instruction `vptest` on AVX2 and
an equivalent SIMD reduce on NEON. This PR vectorises the probe
without changing the algorithm, hash, salts, or wire format.
Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no
target_feature shim, no runtime dispatch. `Block::check` is rewritten
as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction
shape) and LLVM autovectorizes it directly to NEON on aarch64 and to
AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`,
or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected:
the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch)
beat the only thing it bought, which was AVX2 codegen for default-
built binaries on AVX2 hardware — production deployments that care
already set the target-cpu flag.
Preconditions: `Block::mask` is `#[inline]` (folds into the call
site) and `Block` is `#[repr(C, align(32))]` with size/align
asserted (so the 256-bit load/store hits one cache line).
A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default
profile, same-session medians, ns/op.
x86_64 — Cascade Lake-class Xeon @ 2.8 GHz,
`-C target-cpu=x86-64-v3`:
| Regime | Path | Scalar | Autovec | Speedup |
|-----------|--------|-------:|--------:|--------:|
| S 128 KiB | miss | 13.02 | 4.96 | 2.62x |
| S 128 KiB | hit | 13.47 | 4.95 | 2.72x |
| S 128 KiB | insert | 11.62 | 5.41 | 2.15x |
| M 2 MiB | miss | 18.88 | 7.47 | 2.53x |
| M 2 MiB | hit | 18.12 | 7.22 | 2.51x |
| M 2 MiB | insert | 14.99 | 8.45 | 1.77x |
| L 32 MiB | miss | 27.56 | 11.07 | 2.49x |
| L 32 MiB | hit | 26.57 | 11.23 | 2.37x |
| L 32 MiB | insert | 23.53 | 12.77 | 1.84x |
aarch64 — Apple Silicon M1 (NEON via baseline autovec):
| Regime | Path | Scalar | Autovec | Speedup |
|-----------|--------|-------:|--------:|--------:|
| S 128 KiB | miss | 4.61 | 3.24 | 1.42x |
| S 128 KiB | hit | 6.84 | 3.17 | 2.16x |
| S 128 KiB | insert | 3.25 | 3.19 | 1.02x |
| M 2 MiB | miss | 5.20 | 3.24 | 1.61x |
| M 2 MiB | hit | 7.16 | 3.26 | 2.20x |
| M 2 MiB | insert | 3.34 | 3.31 | 1.01x |
| L 32 MiB | miss | 6.66 | 5.42 | 1.23x |
| L 32 MiB | hit | 9.72 | 5.25 | 1.85x |
| L 32 MiB | insert | 5.19 | 5.38 | 0.96x |
Insert ties on aarch64 because main's `Block::insert` was already
vectorizer-friendly. The PR's aarch64 win lives in `check`.
Tests: `test_check_matches_reference` diffs the autovec'd
`Block::check` against an inline short-circuit reference across 10K
random pairs, on every target. All bloom_filter tests pass.
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
|
Tested locally on aarch64 too (Apple Silicon M1, baseline NEON autovec):
Big simplifier. I included details about how autovec reduces/lowers instructions in the new commit message. Going to force-push to use this approach. One thing beyond your suggestion: I prototyped a runtime AVX2-detect shim and dropped it for the simplification (no |
There was a problem hiding this comment.
Thanks @dmatth1, this looks interesting. On my older Alder Lake CPU this does lead to quite a regression. Can we feature gate this?
group no_vec vectorized
----- ------ ----------
check/hit/l_32MiB 1.00 394.6±6.37µs 120.8 MElem/sec 1.27 501.4±7.02µs 95.1 MElem/sec
check/hit/m_2MiB 1.00 399.2±6.42µs 119.4 MElem/sec 1.32 525.1±4.28µs 90.8 MElem/sec
check/hit/s_128KiB 1.00 341.9±2.18µs 139.5 MElem/sec 1.27 434.9±2.58µs 109.7 MElem/sec
check/miss/l_32MiB 1.00 376.6±3.07µs 126.6 MElem/sec 1.33 500.0±5.55µs 95.4 MElem/sec
check/miss/m_2MiB 1.00 373.3±2.89µs 127.7 MElem/sec 1.41 526.0±3.99µs 90.6 MElem/sec
check/miss/s_128KiB 1.00 322.9±6.76µs 147.7 MElem/sec 1.35 436.1±2.63µs 109.3 MElem/sec
Edit: I'm dumb...forgot to compile with target-cpu=native 😅
group native no_vec
----- ------ ------
check/hit/l_32MiB 1.00 200.2±2.79µs 238.1 MElem/sec 1.97 394.6±6.37µs 120.8 MElem/sec
check/hit/m_2MiB 1.00 198.8±1.85µs 239.8 MElem/sec 2.01 399.2±6.42µs 119.4 MElem/sec
check/hit/s_128KiB 1.00 150.0±1.04µs 317.8 MElem/sec 2.28 341.9±2.18µs 139.5 MElem/sec
check/miss/l_32MiB 1.00 199.9±3.97µs 238.6 MElem/sec 1.88 376.6±3.07µs 126.6 MElem/sec
check/miss/m_2MiB 1.00 200.1±1.78µs 238.3 MElem/sec 1.87 373.3±2.89µs 127.7 MElem/sec
check/miss/s_128KiB 1.00 150.0±0.91µs 317.8 MElem/sec 2.15 322.9±6.76µs 147.7 MElem/sec
| } | ||
| } | ||
|
|
||
| impl std::ops::Index<usize> for Block { |
There was a problem hiding this comment.
Why are these impls removed? I think that makes this a breaking API change.
There was a problem hiding this comment.
Block is private type and only used here so I don't think so. I ran cargo public-api diff for this branch vs main and didn't see any differences
There was a problem hiding this comment.
Right you are. Had Sbbf in my head 😅
| group.finish(); | ||
| } | ||
|
|
||
| /// Benchmark `Sbbf::insert` across the same three cache regimes as |
There was a problem hiding this comment.
It would be nice for the bench changes to be a separate PR.
There was a problem hiding this comment.
Mainly for conciseness? Or should we push the bench changes first, then this Sbbf probe change second that way its easy to compare? Otherwise I'd lean towards keeping it in here
There was a problem hiding this comment.
Or should we push the bench changes first, then this Sbbf probe change second that way its easy to compare?
This (see the contributing guide). Thanks!
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.
Adds `bench_check` and `bench_insert` benchmarks
for`Sbbf::{check,insert}`. Originally benchmarks were part of #10011 but
were split out to follow Contributing guidelines
# Are these changes tested?
Benchmarks compiled and ran using `cargo bench -p parquet --bench
bloom_filter`.
# Are there any user-facing changes?
No.
|
run benchmark bloom_filter |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing sbbf-simd (5ad8cc0) to 511ad06 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark bloom_filter env:
RUSTFLAGS: -Ctarget-cpu=native |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing sbbf-simd (5ad8cc0) to 511ad06 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
No tracked issue — opening directly, following the precedent of apache/arrow-go#336 which shipped AVX2/SSE4/NEON SBBF probes in 18.3.0, and paralleling an in-progress
[DISCUSS] thread on
dev@arrow.apache.orgfor the C++ port of the same kernel.Rationale for this change
Sbbf::check/Sbbf::insertare on the hot path of Parquet row-group skipping for every reader downstream ofarrow-rs(DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector;the K=8 lane test collapses to one
vptest(_mm256_testc_si256). This PR vectorises that loop on x86_64 without changing the algorithm, hash, salts, or wire format. NEON / aarch64 SIMD support is slated for a follow-up PR.What changes are included in this PR?
simd_x86, dispatched via cachedis_x86_feature_detected!("avx2")(dead-coded when-C target-cpu=native).Block::{check,insert}retained as the production fallback for non-AVX2 x86 / aarch64 / wasm32 / RISC-V / 32-bit / big-endian, and as the correctness reference the AVX2 kernel is diff-tested against.Blockchanged from#[repr(transparent)]to#[repr(C, align(32))]. Byte layout unchanged; alignment is asserted at compile time so the AVX2 aligned load/store contract is load-bearing.Are these changes tested?
Yes. The 31 pre-existing
bloom_filterunit tests continue to pass on x86_64 with and without-C target-cpu=native. Two new diff tests —test_simd_{check,insert}_matches_scalar— assert bit-identical AVX2-vs-scalar output across 10K random(block, hash)pairs each. Benchmark results (Cascade Lake-class Xeon) are in the commit message. Benchmarks obtained with the changes in #10041.Are there any user-facing changes?
No. Public API, MSRV, dependencies, and wire format are all unchanged. The only observable effect is faster
Sbbf::check/Sbbf::inserton x86_64 hosts with AVX2.The SIMD kernel was drafted with AI assistance and reviewed line-by-line; correctness is enforced in CI by the diff tests above.
cargo fmt --all -- --checkandcargo clippy -p parquet --all-targets -- -D warningsboth clean on this branch.