parquet: SIMD-accelerate Sbbf probe via autovectorization by dmatth1 · Pull Request #10011 · apache/arrow-rs

dmatth1 · 2026-05-23T14:16:18Z

Which issue does this PR close?

No tracked issue — opening directly, following the precedent of apache/arrow-go#336 which shipped AVX2/SSE4/NEON SBBF probes in 18.3.0, and paralleling an in-progress
[DISCUSS] thread on dev@arrow.apache.org for the C++ port of the same kernel.

Rationale for this change

Sbbf::check / Sbbf::insert are on the hot path of Parquet row-group skipping for every reader downstream of arrow-rs (DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector;
the K=8 lane test collapses to one vptest (_mm256_testc_si256). This PR vectorises that loop on x86_64 without changing the algorithm, hash, salts, or wire format. NEON / aarch64 SIMD support is slated for a follow-up PR.

What changes are included in this PR?

AVX2 kernel in simd_x86, dispatched via cached is_x86_feature_detected!("avx2") (dead-coded when -C target-cpu=native).
Scalar Block::{check,insert} retained as the production fallback for non-AVX2 x86 / aarch64 / wasm32 / RISC-V / 32-bit / big-endian, and as the correctness reference the AVX2 kernel is diff-tested against.
Block changed from #[repr(transparent)] to #[repr(C, align(32))]. Byte layout unchanged; alignment is asserted at compile time so the AVX2 aligned load/store contract is load-bearing.

Are these changes tested?

Yes. The 31 pre-existing bloom_filter unit tests continue to pass on x86_64 with and without -C target-cpu=native. Two new diff tests — test_simd_{check,insert}_matches_scalar — assert bit-identical AVX2-vs-scalar output across 10K random (block, hash) pairs each. Benchmark results (Cascade Lake-class Xeon) are in the commit message. Benchmarks obtained with the changes in #10041.

Are there any user-facing changes?

No. Public API, MSRV, dependencies, and wire format are all unchanged. The only observable effect is faster Sbbf::check / Sbbf::insert on x86_64 hosts with AVX2.

The SIMD kernel was drafted with AI assistance and reviewed line-by-line; correctness is enforced in CI by the diff tests above. cargo fmt --all -- --check and cargo clippy -p parquet --all-targets -- -D warnings both clean on this branch.

jhorstmann · 2026-05-23T15:33:17Z

It looks to me like with one small change to the check function, autovectorization can already generate the same instruction sequences: https://rust.godbolt.org/z/a3zT5Gezr

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`simd_x86::sbbf_{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (tf-shim) | Speedup | |-----------|--------|-------:|------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

dmatth1 · 2026-05-23T17:08:21Z

Great callout. Measured bench and the numbers with autovectorization are better:

Same-host, same-session medians (Cascade Lake-class Xeon @ 2.8 GHz), via the public Sbbf::{check,insert} API (XXH64 + probe), criterion default profile, ns/op:

Regime	Path	Scalar	Autovec (this)	Hand-written AVX2	Autovec vs scalar	Autovec vs hand-written
S 128 KiB	miss	13.02	4.96	5.14	2.62×	+4%
S 128 KiB	hit	13.47	4.95	5.20	2.72×	+5%
S 128 KiB	insert	11.62	5.41	5.38	2.15×	tied
M 2 MiB	miss	18.88	7.47	8.18	2.53×	+9%
M 2 MiB	hit	18.12	7.22	8.01	2.51×	+11%
M 2 MiB	insert	14.99	8.45	8.59	1.77×	tied
L 32 MiB	miss	27.56	11.07	13.47	2.49×	+18%
L 32 MiB	hit	26.57	11.23	13.40	2.37×	+16%
L 32 MiB	insert	23.53	12.77	12.60	1.84×	tied

Changes here: main...dmatth1:arrow-rs:sbbf-autovec-tf
This unlocks Neon/aarch64 so I will bundle those numbers into the next revision.

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

… shim Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's review on apache#10011: there are no `_mm256_*` intrinsics here. The single probe implementation lives in `Block::{check,insert}`, written in the vectorizer-friendly shape, and a thin `#[target_feature(enable = "avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, with no downstream `target-cpu` flag. The shim is reached only after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`); on the scalar fallback path the same source compiles to SSE2. Two details are load-bearing for the autovectorizer: - `Block::check` is the branchless integer OR-accumulator `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction shape), not a short-circuiting `.all()`. The short-circuit form defeats vectorization; a bool-`&=` form fails to vectorize through the target_feature shim on a baseline build. - `Block::mask` is `#[inline]` so it folds into the shim and is vectorized with it rather than staying a scalar call. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the autovectorized 256-bit load/store hits one cache line. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build` (no `target-cpu`): | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | Tests: the two `test_simd_*_matches_scalar` diff tests assert the AVX2-compiled shim and the baseline-compiled scalar path produce identical output across 10K random `(blocks, hash)` pairs each (guarding against an autovectorizer miscompile). All 35 bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

… shim Per @jhorstmann's review on apache#10011: no `_mm256_*` intrinsics. The single probe implementation lives in `Block::{check,insert}` and a thin `#[target_feature(enable = "avx2")]` shim calls into it. Because the shim is compiled with AVX2 on, LLVM autovectorizes the plain Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline `x86-64` build, no `target-cpu` flag required. The shim is reached after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`). Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the shim. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. The same branchless `Block::check` also autovectorizes to NEON on aarch64 — no shim, no `target_feature` needed (NEON is baseline). On main, the short-circuit form left aarch64 fully scalar. A/B vs the scalar fallback through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"` (removed before this commit). x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`: | Regime | Path | Scalar | Autovec (avx2 shim) | Speedup | |-----------|--------|-------:|--------------------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1: | Regime | Path | Scalar | Autovec (NEON) | Speedup | |-----------|--------|-------:|---------------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_simd_{check,insert}_matches_scalar` diff the AVX2 shim against the baseline-compiled scalar across 10K random pairs; `test_check_matches_reference_aarch64` diffs the autovec'd check against an inline short-circuit reference for the aarch64 codegen path. All bloom_filter tests pass with and without `-C target-cpu=native`.

@jhorstmann

Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

@jhorstmann

Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the convention for analytical Rust binaries (Polars, DataFusion, Databend distros). A runtime AVX2-detect shim was prototyped and rejected for this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch branch in the hot path, in exchange for AVX2 codegen on default-built binaries running on AVX2 hardware. The simplification was preferred. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

@jhorstmann

Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no target_feature shim, no runtime dispatch. `Block::check` is rewritten as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape) and LLVM autovectorizes it directly to NEON on aarch64 and to AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected: the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch) beat the only thing it bought, which was AVX2 codegen for default- built binaries on AVX2 hardware — production deployments that care already set the target-cpu flag. Preconditions: `Block::mask` is `#[inline]` (folds into the call site) and `Block` is `#[repr(C, align(32))]` with size/align asserted (so the 256-bit load/store hits one cache line). A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert ties on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs, on every target. All bloom_filter tests pass.

@jhorstmann

`Sbbf::{check,insert}` are on the hot path of Parquet row-group skipping for every reader downstream of `arrow-rs` (DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector / two NEON `uint32x4_t` halves; the K=8 lane test is a one-instruction `vptest` on AVX2 and an equivalent SIMD reduce on NEON. This PR vectorises the probe without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no target_feature shim, no runtime dispatch. `Block::check` is rewritten as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape) and LLVM autovectorizes it directly to NEON on aarch64 and to AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected: the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch) beat the only thing it bought, which was AVX2 codegen for default- built binaries on AVX2 hardware — production deployments that care already set the target-cpu flag. Preconditions: `Block::mask` is `#[inline]` (folds into the call site) and `Block` is `#[repr(C, align(32))]` with size/align asserted (so the 256-bit load/store hits one cache line). A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert ties on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs, on every target. All bloom_filter tests pass.

@jhorstmann

Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

@jhorstmann

Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

dmatth1 · 2026-05-23T20:43:26Z

Tested locally on aarch64 too (Apple Silicon M1, baseline NEON autovec):

Regime	Path	Scalar	Autovec	Speedup
S 128 KiB	miss	4.61	3.24	1.42x
S 128 KiB	hit	6.84	3.17	2.16x
S 128 KiB	insert	3.25	3.19	1.02x
M 2 MiB	miss	5.20	3.24	1.61x
M 2 MiB	hit	7.16	3.26	2.20x
M 2 MiB	insert	3.34	3.31	1.01x
L 32 MiB	miss	6.66	5.42	1.23x
L 32 MiB	hit	9.72	5.25	1.85x
L 32 MiB	insert	5.19	5.38	0.96x

Big simplifier. I included details about how autovec reduces/lowers instructions in the new commit message. Going to force-push to use this approach.

One thing beyond your suggestion: I prototyped a runtime AVX2-detect shim and dropped it for the simplification (no unsafe, no Sbbf field, no hot-path branch) since users who care about AVX2 probably already set -C target-cpu=....

etseidl

Thanks @dmatth1, this looks interesting. On my older Alder Lake CPU this does lead to quite a regression. Can we feature gate this?

group                            no_vec                                  vectorized
-----                            ------                                  ----------
check/hit/l_32MiB                1.00    394.6±6.37µs 120.8 MElem/sec    1.27    501.4±7.02µs 95.1 MElem/sec
check/hit/m_2MiB                 1.00    399.2±6.42µs 119.4 MElem/sec    1.32    525.1±4.28µs 90.8 MElem/sec
check/hit/s_128KiB               1.00    341.9±2.18µs 139.5 MElem/sec    1.27    434.9±2.58µs 109.7 MElem/sec
check/miss/l_32MiB               1.00    376.6±3.07µs 126.6 MElem/sec    1.33    500.0±5.55µs 95.4 MElem/sec
check/miss/m_2MiB                1.00    373.3±2.89µs 127.7 MElem/sec    1.41    526.0±3.99µs 90.6 MElem/sec
check/miss/s_128KiB              1.00    322.9±6.76µs 147.7 MElem/sec    1.35    436.1±2.63µs 109.3 MElem/sec

Edit: I'm dumb...forgot to compile with target-cpu=native 😅

group                            native                                  no_vec
-----                            ------                                  ------
check/hit/l_32MiB                1.00    200.2±2.79µs 238.1 MElem/sec    1.97    394.6±6.37µs 120.8 MElem/sec
check/hit/m_2MiB                 1.00    198.8±1.85µs 239.8 MElem/sec    2.01    399.2±6.42µs 119.4 MElem/sec
check/hit/s_128KiB               1.00    150.0±1.04µs 317.8 MElem/sec    2.28    341.9±2.18µs 139.5 MElem/sec
check/miss/l_32MiB               1.00    199.9±3.97µs 238.6 MElem/sec    1.88    376.6±3.07µs 126.6 MElem/sec
check/miss/m_2MiB                1.00    200.1±1.78µs 238.3 MElem/sec    1.87    373.3±2.89µs 127.7 MElem/sec
check/miss/s_128KiB              1.00    150.0±0.91µs 317.8 MElem/sec    2.15    322.9±6.76µs 147.7 MElem/sec

etseidl · 2026-05-28T16:27:01Z

-    }
-}
-
-impl std::ops::Index<usize> for Block {


Why are these impls removed? I think that makes this a breaking API change.

Block is private type and only used here so I don't think so. I ran cargo public-api diff for this branch vs main and didn't see any differences

Right you are. Had Sbbf in my head 😅

etseidl · 2026-05-28T16:27:48Z

    group.finish();
 }

+/// Benchmark `Sbbf::insert` across the same three cache regimes as


It would be nice for the bench changes to be a separate PR.

Mainly for conciseness? Or should we push the bench changes first, then this Sbbf probe change second that way its easy to compare? Otherwise I'd lean towards keeping it in here

Or should we push the bench changes first, then this Sbbf probe change second that way its easy to compare?

This (see the contributing guide). Thanks!

Done! See #10041

@jhorstmann

Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane test collapses to one `vptest` (`_mm256_testc_si256`). This PR vectorises that loop without changing the algorithm, hash, salts, or wire format. Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the vectorizer-friendly branchless shape and LLVM autovectorizes it directly to whatever SIMD ISA is enabled at compile time: - aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is mandatory baseline, so the default build autovectorizes to `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`. - x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`): autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`. - Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only): partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the per-lane variable shift in the mask compute partly scalarizes. - wasm32, RISC-V, 32-bit: whatever the toolchain's target features allow; falls back to scalar otherwise. Production deployments that care about x86 SBBF perf should set `RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime AVX2-detect shim was prototyped but I prefer this simplification. Two preconditions for autovec: - `Block::check` is the branchless `acc |= !block & mask; acc == 0` ("testc" reduction shape); a short-circuiting `.all()` defeats vectorization. - `Block::mask` is `#[inline]` so it folds into the call site. `Block` is `#[repr(C, align(32))]` (size/align asserted at module scope) so the 256-bit load/store hits one cache line. A/B vs scalar (short-circuit `Block::check`) through the public `Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile, same-session medians, ns/op. x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with `-C target-cpu=x86-64-v3`: | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 13.02 | 4.96 | 2.62x | | S 128 KiB | hit | 13.47 | 4.95 | 2.72x | | S 128 KiB | insert | 11.62 | 5.41 | 2.15x | | M 2 MiB | miss | 18.88 | 7.47 | 2.53x | | M 2 MiB | hit | 18.12 | 7.22 | 2.51x | | M 2 MiB | insert | 14.99 | 8.45 | 1.77x | | L 32 MiB | miss | 27.56 | 11.07 | 2.49x | | L 32 MiB | hit | 26.57 | 11.23 | 2.37x | | L 32 MiB | insert | 23.53 | 12.77 | 1.84x | aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build): | Regime | Path | Scalar | Autovec | Speedup | |-----------|--------|-------:|--------:|--------:| | S 128 KiB | miss | 4.61 | 3.24 | 1.42x | | S 128 KiB | hit | 6.84 | 3.17 | 2.16x | | S 128 KiB | insert | 3.25 | 3.19 | 1.02x | | M 2 MiB | miss | 5.20 | 3.24 | 1.61x | | M 2 MiB | hit | 7.16 | 3.26 | 2.20x | | M 2 MiB | insert | 3.34 | 3.31 | 1.01x | | L 32 MiB | miss | 6.66 | 5.42 | 1.23x | | L 32 MiB | hit | 9.72 | 5.25 | 1.85x | | L 32 MiB | insert | 5.19 | 5.38 | 0.96x | Insert is ~tied on aarch64 because main's `Block::insert` was already vectorizer-friendly. The PR's aarch64 win lives in `check`, where the branchless form unlocks NEON autovec. Tests: `test_check_matches_reference` diffs the autovec'd `Block::check` against an inline short-circuit reference across 10K random pairs on every target the crate is built for. All bloom_filter tests pass.

Adds `bench_check` and `bench_insert` benchmarks for`Sbbf::{check,insert}`. Originally benchmarks were part of #10011 but were split out to follow Contributing guidelines # Are these changes tested? Benchmarks compiled and ran using `cargo bench -p parquet --bench bloom_filter`. # Are there any user-facing changes? No.

etseidl · 2026-05-31T03:26:09Z

run benchmark bloom_filter

adriangbot · 2026-05-31T03:29:56Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4585582765-382-4bwhf 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sbbf-simd (5ad8cc0) to 511ad06 (merge-base) diff
BENCH_NAME=bloom_filter
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench bloom_filter
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-31T03:37:54Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                            main                                    sbbf-simd
-----                            ----                                    ---------
check/hit/l_32MiB                1.38    522.5±2.51µs 91.3 MElem/sec     1.00    378.1±6.45µs 126.1 MElem/sec
check/hit/m_2MiB                 1.57    475.9±5.96µs 100.2 MElem/sec    1.00    303.1±2.72µs 157.3 MElem/sec
check/hit/s_128KiB               1.63    392.2±0.66µs 121.6 MElem/sec    1.00    240.8±0.88µs 198.1 MElem/sec
check/miss/l_32MiB               1.00    360.1±6.16µs 132.4 MElem/sec    1.04    373.8±1.11µs 127.6 MElem/sec
check/miss/m_2MiB                1.04    322.7±2.72µs 147.8 MElem/sec    1.00    310.9±4.59µs 153.4 MElem/sec
check/miss/s_128KiB              1.00    231.0±1.15µs 206.5 MElem/sec    1.04    241.3±0.66µs 197.6 MElem/sec
fold_to_target_fpp/ndv/1000      1.00     71.1±1.55µs 439.7 MElem/sec    1.00     71.0±1.37µs 440.3 MElem/sec
fold_to_target_fpp/ndv/10000     1.00     65.9±1.82µs 474.2 MElem/sec    1.07    70.2±11.00µs 444.9 MElem/sec
fold_to_target_fpp/ndv/100000    1.10    78.5±15.79µs 398.3 MElem/sec    1.00     71.2±8.88µs 438.7 MElem/sec
insert/ins/l_32MiB               1.08    385.1±2.08µs 123.8 MElem/sec    1.00    358.0±6.11µs 133.2 MElem/sec
insert/ins/m_2MiB                1.04    309.0±2.26µs 154.3 MElem/sec    1.00    295.7±4.41µs 161.2 MElem/sec
insert/ins/s_128KiB              1.01    227.5±0.65µs 209.6 MElem/sec    1.00    226.3±0.46µs 210.7 MElem/sec
insert_and_fold/values/1000      1.00     83.4±1.84µs 11.4 MElem/sec     1.06     88.0±0.18µs 10.8 MElem/sec
insert_and_fold/values/10000     1.04    109.4±0.77µs 87.2 MElem/sec     1.00    104.9±3.73µs 90.9 MElem/sec
insert_and_fold/values/100000    1.00    352.9±2.59µs 270.2 MElem/sec    1.00    351.7±2.22µs 271.1 MElem/sec
insert_only/values/1000          1.00     13.6±0.02µs 70.2 MElem/sec     1.00     13.7±0.01µs 69.8 MElem/sec
insert_only/values/10000         1.07     39.5±0.37µs 241.7 MElem/sec    1.00     37.0±0.12µs 257.5 MElem/sec
insert_only/values/100000        1.00    286.5±1.27µs 332.9 MElem/sec    1.00    285.4±1.97µs 334.1 MElem/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	185.0s
Peak memory	4.6 GiB
Avg memory	4.2 GiB
CPU user	165.6s
CPU sys	16.6s
Peak spill	0 B

branch

Metric	Value
Wall time	185.0s
Peak memory	4.5 GiB
Avg memory	4.2 GiB
CPU user	164.1s
CPU sys	18.8s
Peak spill	0 B

File an issue against this benchmark runner

etseidl · 2026-05-31T04:43:09Z

run benchmark bloom_filter

env:
  RUSTFLAGS: -Ctarget-cpu=native

adriangbot · 2026-05-31T04:46:40Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4585724751-383-nm4ns 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sbbf-simd (5ad8cc0) to 511ad06 (merge-base) diff
BENCH_NAME=bloom_filter
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench bloom_filter
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-31T04:55:00Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                            main                                    sbbf-simd
-----                            ----                                    ---------
check/hit/l_32MiB                1.34    521.9±2.11µs 91.4 MElem/sec     1.00    388.6±1.17µs 122.7 MElem/sec
check/hit/m_2MiB                 1.45    484.4±4.52µs 98.4 MElem/sec     1.00    333.9±2.38µs 142.8 MElem/sec
check/hit/s_128KiB               1.62    393.5±0.59µs 121.2 MElem/sec    1.00    242.8±0.13µs 196.4 MElem/sec
check/miss/l_32MiB               1.00    349.2±1.04µs 136.5 MElem/sec    1.09    381.4±2.76µs 125.0 MElem/sec
check/miss/m_2MiB                1.00    316.3±2.91µs 150.8 MElem/sec    1.10    348.6±5.68µs 136.8 MElem/sec
check/miss/s_128KiB              1.00    229.8±0.51µs 207.5 MElem/sec    1.06    243.9±0.12µs 195.5 MElem/sec
fold_to_target_fpp/ndv/1000      1.00     59.3±1.84µs 527.4 MElem/sec    1.03     60.7±1.70µs 514.5 MElem/sec
fold_to_target_fpp/ndv/10000     1.03     55.5±3.36µs 563.3 MElem/sec    1.00     53.9±1.15µs 580.1 MElem/sec
fold_to_target_fpp/ndv/100000    1.00     53.2±1.41µs 587.1 MElem/sec    1.00     53.1±1.25µs 588.3 MElem/sec
insert/ins/l_32MiB               1.03    377.9±1.35µs 126.2 MElem/sec    1.00    365.1±0.94µs 130.6 MElem/sec
insert/ins/m_2MiB                1.00    311.6±4.80µs 153.1 MElem/sec    1.03    321.5±1.97µs 148.3 MElem/sec
insert/ins/s_128KiB              1.01    225.5±0.14µs 211.4 MElem/sec    1.00    224.1±0.16µs 212.8 MElem/sec
insert_and_fold/values/1000      1.00     69.3±0.14µs 13.8 MElem/sec     1.02     70.8±0.13µs 13.5 MElem/sec
insert_and_fold/values/10000     1.00     87.1±0.15µs 109.5 MElem/sec    1.09     95.0±0.35µs 100.4 MElem/sec
insert_and_fold/values/100000    1.00    316.4±0.56µs 301.4 MElem/sec    1.27    403.0±1.44µs 236.7 MElem/sec
insert_only/values/1000          1.00     13.6±0.01µs 70.3 MElem/sec     1.01     13.7±0.01µs 69.8 MElem/sec
insert_only/values/10000         1.00     36.7±0.03µs 260.1 MElem/sec    1.02     37.3±0.18µs 255.6 MElem/sec
insert_only/values/100000        1.00    268.2±0.38µs 355.6 MElem/sec    1.10    296.3±2.71µs 321.8 MElem/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	185.0s
Peak memory	4.6 GiB
Avg memory	4.2 GiB
CPU user	164.3s
CPU sys	16.1s
Peak spill	0 B

branch

Metric	Value
Wall time	180.0s
Peak memory	4.5 GiB
Avg memory	4.1 GiB
CPU user	160.7s
CPU sys	18.4s
Peak spill	0 B

File an issue against this benchmark runner

github-actions Bot added the parquet Changes to the parquet crate label May 23, 2026

dmatth1 force-pushed the sbbf-simd branch from 5dd4690 to 4f1e5df Compare May 23, 2026 15:57

dmatth1 force-pushed the sbbf-simd branch from 4f1e5df to 2c47d62 Compare May 23, 2026 20:44

dmatth1 mentioned this pull request May 24, 2026

GH-50026: [C++][Parquet] SIMD-accelerate SBBF probe via branchless autovec apache/arrow#50030

Open

etseidl reviewed May 28, 2026

View reviewed changes

dmatth1 changed the title ~~parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback)~~ parquet: SIMD-accelerate Sbbf probe via autovectorization May 29, 2026

dmatth1 mentioned this pull request May 30, 2026

bench(parquet): add Sbbf check/insert benchmarks #10041

Merged

dmatth1 force-pushed the sbbf-simd branch from 2c47d62 to ffd3c3d Compare May 30, 2026 21:52

etseidl added the performance label May 31, 2026

Merge branch 'main' into sbbf-simd

5ad8cc0

Conversation

dmatth1 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jhorstmann commented May 23, 2026

Uh oh!

dmatth1 commented May 23, 2026

Uh oh!

dmatth1 commented May 23, 2026

Uh oh!

etseidl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl May 28, 2026

Choose a reason for hiding this comment

Uh oh!

dmatth1 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

etseidl May 29, 2026

Choose a reason for hiding this comment

Uh oh!

etseidl May 28, 2026

Choose a reason for hiding this comment

Uh oh!

dmatth1 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

etseidl May 29, 2026

Choose a reason for hiding this comment

Uh oh!

dmatth1 May 30, 2026

Choose a reason for hiding this comment

Uh oh!

etseidl commented May 31, 2026

Uh oh!

adriangbot commented May 31, 2026

Uh oh!

adriangbot commented May 31, 2026

Uh oh!

etseidl commented May 31, 2026

Uh oh!

adriangbot commented May 31, 2026

Uh oh!

adriangbot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dmatth1 commented May 23, 2026 •

edited

Loading

etseidl left a comment •

edited

Loading