Skip to content

parquet: SIMD-accelerate Sbbf probe via autovectorization#10011

Open
dmatth1 wants to merge 2 commits into
apache:mainfrom
dmatth1:sbbf-simd
Open

parquet: SIMD-accelerate Sbbf probe via autovectorization#10011
dmatth1 wants to merge 2 commits into
apache:mainfrom
dmatth1:sbbf-simd

Conversation

@dmatth1
Copy link
Copy Markdown
Contributor

@dmatth1 dmatth1 commented May 23, 2026

Which issue does this PR close?

No tracked issue — opening directly, following the precedent of apache/arrow-go#336 which shipped AVX2/SSE4/NEON SBBF probes in 18.3.0, and paralleling an in-progress
[DISCUSS] thread on dev@arrow.apache.org for the C++ port of the same kernel.

Rationale for this change

Sbbf::check / Sbbf::insert are on the hot path of Parquet row-group skipping for every reader downstream of arrow-rs (DataFusion, Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit Parquet block is exactly one AVX2 vector;
the K=8 lane test collapses to one vptest (_mm256_testc_si256). This PR vectorises that loop on x86_64 without changing the algorithm, hash, salts, or wire format. NEON / aarch64 SIMD support is slated for a follow-up PR.

What changes are included in this PR?

  • AVX2 kernel in simd_x86, dispatched via cached is_x86_feature_detected!("avx2") (dead-coded when -C target-cpu=native).
  • Scalar Block::{check,insert} retained as the production fallback for non-AVX2 x86 / aarch64 / wasm32 / RISC-V / 32-bit / big-endian, and as the correctness reference the AVX2 kernel is diff-tested against.
  • Block changed from #[repr(transparent)] to #[repr(C, align(32))]. Byte layout unchanged; alignment is asserted at compile time so the AVX2 aligned load/store contract is load-bearing.

Are these changes tested?

Yes. The 31 pre-existing bloom_filter unit tests continue to pass on x86_64 with and without -C target-cpu=native. Two new diff tests — test_simd_{check,insert}_matches_scalar — assert bit-identical AVX2-vs-scalar output across 10K random (block, hash) pairs each. Benchmark results (Cascade Lake-class Xeon) are in the commit message. Benchmarks obtained with the changes in #10041.

Are there any user-facing changes?

No. Public API, MSRV, dependencies, and wire format are all unchanged. The only observable effect is faster Sbbf::check / Sbbf::insert on x86_64 hosts with AVX2.


The SIMD kernel was drafted with AI assistance and reviewed line-by-line; correctness is enforced in CI by the diff tests above. cargo fmt --all -- --check and cargo clippy -p parquet --all-targets -- -D warnings both clean on this branch.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 23, 2026
@jhorstmann
Copy link
Copy Markdown
Contributor

It looks to me like with one small change to the check function, autovectorization can already generate the same instruction sequences: https://rust.godbolt.org/z/a3zT5Gezr

dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`simd_x86::sbbf_{check,insert}_hash`) calls it. Because
the shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (tf-shim) | Speedup |
  |-----------|--------|-------:|------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |              4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |              4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |              5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |              7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |              7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |              8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |             11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |             11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |             12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the
shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
@dmatth1
Copy link
Copy Markdown
Contributor Author

dmatth1 commented May 23, 2026

Great callout. Measured bench and the numbers with autovectorization are better:

Same-host, same-session medians (Cascade Lake-class Xeon @ 2.8 GHz), via the public Sbbf::{check,insert} API (XXH64 + probe), criterion default profile, ns/op:

Regime Path Scalar Autovec (this) Hand-written AVX2 Autovec vs scalar Autovec vs hand-written
S 128 KiB miss 13.02 4.96 5.14 2.62× +4%
S 128 KiB hit 13.47 4.95 5.20 2.72× +5%
S 128 KiB insert 11.62 5.41 5.38 2.15× tied
M 2 MiB miss 18.88 7.47 8.18 2.53× +9%
M 2 MiB hit 18.12 7.22 8.01 2.51× +11%
M 2 MiB insert 14.99 8.45 8.59 1.77× tied
L 32 MiB miss 27.56 11.07 13.47 2.49× +18%
L 32 MiB hit 26.57 11.23 13.40 2.37× +16%
L 32 MiB insert 23.53 12.77 12.60 1.84× tied

Changes here: main...dmatth1:arrow-rs:sbbf-autovec-tf
This unlocks Neon/aarch64 so I will bundle those numbers into the next revision.

dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the
shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Alternative to the hand-written AVX2 intrinsics, per @jhorstmann's
review on apache#10011: there are no `_mm256_*` intrinsics here. The single
probe implementation lives in `Block::{check,insert}`, written in the
vectorizer-friendly shape, and a thin `#[target_feature(enable =
"avx2")]` shim (`avx2::{check,insert}_hash`) calls it. Because the
shim is compiled with AVX2 enabled, LLVM autovectorizes the plain
Rust body to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` —
on a baseline `x86-64` build, with no downstream `target-cpu` flag.
The shim is reached only after a runtime `is_x86_feature_detected!`
check (cached on `Sbbf`); on the scalar fallback path the same source
compiles to SSE2.

Two details are load-bearing for the autovectorizer:
- `Block::check` is the branchless integer OR-accumulator
  `acc |= !block[i] & mask[i]; acc == 0` (the "testc" reduction
  shape), not a short-circuiting `.all()`. The short-circuit form
  defeats vectorization; a bool-`&=` form fails to vectorize through
  the target_feature shim on a baseline build.
- `Block::mask` is `#[inline]` so it folds into the shim and is
  vectorized with it rather than staying a scalar call.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the autovectorized 256-bit load/store hits one cache line.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`
  (no `target-cpu`):

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

Tests: the two `test_simd_*_matches_scalar` diff tests assert the
AVX2-compiled shim and the baseline-compiled scalar path produce
identical output across 10K random `(blocks, hash)` pairs each
(guarding against an autovectorizer miscompile). All 35 bloom_filter
tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
… shim

Per @jhorstmann's review on apache#10011: no `_mm256_*` intrinsics. The single
probe implementation lives in `Block::{check,insert}` and a thin
`#[target_feature(enable = "avx2")]` shim calls into it. Because the
shim is compiled with AVX2 on, LLVM autovectorizes the plain Rust body
to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest` — on a baseline
`x86-64` build, no `target-cpu` flag required. The shim is reached
after a runtime `is_x86_feature_detected!` check (cached on `Sbbf`).

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the shim.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

The same branchless `Block::check` also autovectorizes to NEON on
aarch64 — no shim, no `target_feature` needed (NEON is baseline). On
main, the short-circuit form left aarch64 fully scalar.

A/B vs the scalar fallback through the public `Sbbf::{check,insert}`
API (XXH64 + probe), criterion default profile, same-session medians,
ns/op. Scalar baseline via `RUSTFLAGS="--cfg sbbf_scalar_baseline"`
(removed before this commit).

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, default `cargo build`:

  | Regime    | Path   | Scalar | Autovec (avx2 shim) | Speedup |
  |-----------|--------|-------:|--------------------:|--------:|
  | S 128 KiB | miss   |  13.02 |                4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |                4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |                5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |                7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |                7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |                8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |               11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |               11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |               12.77 | 1.84x   |

  aarch64 — Apple Silicon M1:

  | Regime    | Path   | Scalar | Autovec (NEON) | Speedup |
  |-----------|--------|-------:|---------------:|--------:|
  | S 128 KiB | miss   |   4.61 |           3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |           3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |           3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |           3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |           3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |           3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |           5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |           5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |           5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_simd_{check,insert}_matches_scalar` diff the AVX2 shim
against the baseline-compiled scalar across 10K random pairs;
`test_check_matches_reference_aarch64` diffs the autovec'd check
against an inline short-circuit reference for the aarch64 codegen
path. All bloom_filter tests pass with and without `-C target-cpu=native`.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON
intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check`
is rewritten in the vectorizer-friendly branchless shape and LLVM
autovectorizes it directly to whatever SIMD ISA is enabled at compile
time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the
convention for analytical Rust binaries (Polars, DataFusion, Databend
distros). A runtime AVX2-detect shim was prototyped and rejected for
this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch
branch in the hot path, in exchange for AVX2 codegen on default-built
binaries running on AVX2 hardware. The simplification was preferred.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Per @jhorstmann's review on apache#10011: no hand-written `_mm256_*` / NEON
intrinsics, no runtime dispatch, no `target_feature` shim. `Block::check`
is rewritten in the vectorizer-friendly branchless shape and LLVM
autovectorizes it directly to whatever SIMD ISA is enabled at compile
time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). This is already the
convention for analytical Rust binaries (Polars, DataFusion, Databend
distros). A runtime AVX2-detect shim was prototyped and rejected for
this PR — it adds `unsafe`, a per-`Sbbf` cached bool, and a dispatch
branch in the hot path, in exchange for AVX2 codegen on default-built
binaries running on AVX2 hardware. The simplification was preferred.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no
target_feature shim, no runtime dispatch. `Block::check` is rewritten
as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction
shape) and LLVM autovectorizes it directly to NEON on aarch64 and to
AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`,
or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected:
the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch)
beat the only thing it bought, which was AVX2 codegen for default-
built binaries on AVX2 hardware — production deployments that care
already set the target-cpu flag.

Preconditions: `Block::mask` is `#[inline]` (folds into the call
site) and `Block` is `#[repr(C, align(32))]` with size/align
asserted (so the 256-bit load/store hits one cache line).

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default
profile, same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz,
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert ties on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`.

Tests: `test_check_matches_reference` diffs the autovec'd
`Block::check` against an inline short-circuit reference across 10K
random pairs, on every target. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
`Sbbf::{check,insert}` are on the hot path of Parquet row-group
skipping for every reader downstream of `arrow-rs` (DataFusion,
Databend, InfluxDB / IOx, RisingWave, GreptimeDB). Each 256-bit
Parquet block is exactly one AVX2 vector / two NEON `uint32x4_t`
halves; the K=8 lane test is a one-instruction `vptest` on AVX2 and
an equivalent SIMD reduce on NEON. This PR vectorises the probe
without changing the algorithm, hash, salts, or wire format.

Per @jhorstmann's review on apache#10011: no hand-written intrinsics, no
target_feature shim, no runtime dispatch. `Block::check` is rewritten
as the branchless `acc |= !block & mask; acc == 0` ("testc" reduction
shape) and LLVM autovectorizes it directly to NEON on aarch64 and to
AVX2 on x86_64 built with `-C target-cpu=x86-64-v3` (or `=native`,
or `+avx2`). A runtime AVX2-detect shim was prototyped and rejected:
the simplification (no `unsafe`, no `Sbbf` field, no hot-path branch)
beat the only thing it bought, which was AVX2 codegen for default-
built binaries on AVX2 hardware — production deployments that care
already set the target-cpu flag.

Preconditions: `Block::mask` is `#[inline]` (folds into the call
site) and `Block` is `#[repr(C, align(32))]` with size/align
asserted (so the 256-bit load/store hits one cache line).

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default
profile, same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz,
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert ties on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`.

Tests: `test_check_matches_reference` diffs the autovec'd
`Block::check` against an inline short-circuit reference across 10K
random pairs, on every target. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane
test collapses to one `vptest` (`_mm256_testc_si256`). This PR
vectorises that loop without changing the algorithm, hash, salts, or
wire format.

Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the
vectorizer-friendly branchless shape and LLVM autovectorizes it
directly to whatever SIMD ISA is enabled at compile time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime
AVX2-detect shim was prototyped but I prefer this simplification.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
dmatth1 added a commit to dmatth1/arrow-rs that referenced this pull request May 23, 2026
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane
test collapses to one `vptest` (`_mm256_testc_si256`). This PR
vectorises that loop without changing the algorithm, hash, salts, or
wire format.

Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the
vectorizer-friendly branchless shape and LLVM autovectorizes it
directly to whatever SIMD ISA is enabled at compile time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime
AVX2-detect shim was prototyped but I prefer this simplification.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
@dmatth1
Copy link
Copy Markdown
Contributor Author

dmatth1 commented May 23, 2026

Tested locally on aarch64 too (Apple Silicon M1, baseline NEON autovec):

Regime Path Scalar Autovec Speedup
S 128 KiB miss 4.61 3.24 1.42x
S 128 KiB hit 6.84 3.17 2.16x
S 128 KiB insert 3.25 3.19 1.02x
M 2 MiB miss 5.20 3.24 1.61x
M 2 MiB hit 7.16 3.26 2.20x
M 2 MiB insert 3.34 3.31 1.01x
L 32 MiB miss 6.66 5.42 1.23x
L 32 MiB hit 9.72 5.25 1.85x
L 32 MiB insert 5.19 5.38 0.96x

Big simplifier. I included details about how autovec reduces/lowers instructions in the new commit message. Going to force-push to use this approach.

One thing beyond your suggestion: I prototyped a runtime AVX2-detect shim and dropped it for the simplification (no unsafe, no Sbbf field, no hot-path branch) since users who care about AVX2 probably already set -C target-cpu=....

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dmatth1, this looks interesting. On my older Alder Lake CPU this does lead to quite a regression. Can we feature gate this?

group                            no_vec                                  vectorized
-----                            ------                                  ----------
check/hit/l_32MiB                1.00    394.6±6.37µs 120.8 MElem/sec    1.27    501.4±7.02µs 95.1 MElem/sec
check/hit/m_2MiB                 1.00    399.2±6.42µs 119.4 MElem/sec    1.32    525.1±4.28µs 90.8 MElem/sec
check/hit/s_128KiB               1.00    341.9±2.18µs 139.5 MElem/sec    1.27    434.9±2.58µs 109.7 MElem/sec
check/miss/l_32MiB               1.00    376.6±3.07µs 126.6 MElem/sec    1.33    500.0±5.55µs 95.4 MElem/sec
check/miss/m_2MiB                1.00    373.3±2.89µs 127.7 MElem/sec    1.41    526.0±3.99µs 90.6 MElem/sec
check/miss/s_128KiB              1.00    322.9±6.76µs 147.7 MElem/sec    1.35    436.1±2.63µs 109.3 MElem/sec

Edit: I'm dumb...forgot to compile with target-cpu=native 😅

group                            native                                  no_vec
-----                            ------                                  ------
check/hit/l_32MiB                1.00    200.2±2.79µs 238.1 MElem/sec    1.97    394.6±6.37µs 120.8 MElem/sec
check/hit/m_2MiB                 1.00    198.8±1.85µs 239.8 MElem/sec    2.01    399.2±6.42µs 119.4 MElem/sec
check/hit/s_128KiB               1.00    150.0±1.04µs 317.8 MElem/sec    2.28    341.9±2.18µs 139.5 MElem/sec
check/miss/l_32MiB               1.00    199.9±3.97µs 238.6 MElem/sec    1.88    376.6±3.07µs 126.6 MElem/sec
check/miss/m_2MiB                1.00    200.1±1.78µs 238.3 MElem/sec    1.87    373.3±2.89µs 127.7 MElem/sec
check/miss/s_128KiB              1.00    150.0±0.91µs 317.8 MElem/sec    2.15    322.9±6.76µs 147.7 MElem/sec

}
}

impl std::ops::Index<usize> for Block {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these impls removed? I think that makes this a breaking API change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block is private type and only used here so I don't think so. I ran cargo public-api diff for this branch vs main and didn't see any differences

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are. Had Sbbf in my head 😅

group.finish();
}

/// Benchmark `Sbbf::insert` across the same three cache regimes as
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice for the bench changes to be a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly for conciseness? Or should we push the bench changes first, then this Sbbf probe change second that way its easy to compare? Otherwise I'd lean towards keeping it in here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should we push the bench changes first, then this Sbbf probe change second that way its easy to compare?

This (see the contributing guide). Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! See #10041

@dmatth1 dmatth1 changed the title parquet: SIMD-accelerate Sbbf probe (AVX2, scalar fallback) parquet: SIMD-accelerate Sbbf probe via autovectorization May 29, 2026
Each 256-bit Parquet block is exactly one AVX2 vector; the K=8 lane
test collapses to one `vptest` (`_mm256_testc_si256`). This PR
vectorises that loop without changing the algorithm, hash, salts, or
wire format.

Per @jhorstmann's review on apache#10011: `Block::check` is rewritten in the
vectorizer-friendly branchless shape and LLVM autovectorizes it
directly to whatever SIMD ISA is enabled at compile time:

- aarch64 (Apple Silicon, Graviton 2/3/4, Ampere, Cobalt): NEON is
  mandatory baseline, so the default build autovectorizes to
  `vmulq + vshrq + vshlq + vbicq + vorrq + vmaxvq`.
- x86_64 with `-C target-cpu=x86-64-v3` (or `=native`, or `+avx2`):
  autovectorizes to `vpmulld + vpsrld + vpsllvd + vpandn + vpor + ptest`.
- Default `cargo build` on x86_64 (baseline `x86-64`, SSE2 only):
  partial SSE2 autovec — `vpsllvd` doesn't exist pre-AVX2, so the
  per-lane variable shift in the mask compute partly scalarizes.
- wasm32, RISC-V, 32-bit: whatever the toolchain's target features
  allow; falls back to scalar otherwise.

Production deployments that care about x86 SBBF perf should set
`RUSTFLAGS="-C target-cpu=x86-64-v3"` (or higher). A runtime
AVX2-detect shim was prototyped but I prefer this simplification.

Two preconditions for autovec:
- `Block::check` is the branchless `acc |= !block & mask; acc == 0`
  ("testc" reduction shape); a short-circuiting `.all()` defeats
  vectorization.
- `Block::mask` is `#[inline]` so it folds into the call site.

`Block` is `#[repr(C, align(32))]` (size/align asserted at module
scope) so the 256-bit load/store hits one cache line.

A/B vs scalar (short-circuit `Block::check`) through the public
`Sbbf::{check,insert}` API (XXH64 + probe), criterion default profile,
same-session medians, ns/op.

  x86_64 — Cascade Lake-class Xeon @ 2.8 GHz, built with
  `-C target-cpu=x86-64-v3`:

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |  13.02 |    4.96 | 2.62x   |
  | S 128 KiB | hit    |  13.47 |    4.95 | 2.72x   |
  | S 128 KiB | insert |  11.62 |    5.41 | 2.15x   |
  | M 2 MiB   | miss   |  18.88 |    7.47 | 2.53x   |
  | M 2 MiB   | hit    |  18.12 |    7.22 | 2.51x   |
  | M 2 MiB   | insert |  14.99 |    8.45 | 1.77x   |
  | L 32 MiB  | miss   |  27.56 |   11.07 | 2.49x   |
  | L 32 MiB  | hit    |  26.57 |   11.23 | 2.37x   |
  | L 32 MiB  | insert |  23.53 |   12.77 | 1.84x   |

  aarch64 — Apple Silicon M1 (NEON via baseline autovec, default build):

  | Regime    | Path   | Scalar | Autovec | Speedup |
  |-----------|--------|-------:|--------:|--------:|
  | S 128 KiB | miss   |   4.61 |    3.24 | 1.42x   |
  | S 128 KiB | hit    |   6.84 |    3.17 | 2.16x   |
  | S 128 KiB | insert |   3.25 |    3.19 | 1.02x   |
  | M 2 MiB   | miss   |   5.20 |    3.24 | 1.61x   |
  | M 2 MiB   | hit    |   7.16 |    3.26 | 2.20x   |
  | M 2 MiB   | insert |   3.34 |    3.31 | 1.01x   |
  | L 32 MiB  | miss   |   6.66 |    5.42 | 1.23x   |
  | L 32 MiB  | hit    |   9.72 |    5.25 | 1.85x   |
  | L 32 MiB  | insert |   5.19 |    5.38 | 0.96x   |

  Insert is ~tied on aarch64 because main's `Block::insert` was already
  vectorizer-friendly. The PR's aarch64 win lives in `check`, where the
  branchless form unlocks NEON autovec.

Tests: `test_check_matches_reference` diffs the autovec'd `Block::check`
against an inline short-circuit reference across 10K random pairs on
every target the crate is built for. All bloom_filter tests pass.
etseidl pushed a commit that referenced this pull request May 31, 2026
Adds `bench_check` and `bench_insert` benchmarks
for`Sbbf::{check,insert}`. Originally benchmarks were part of #10011 but
were split out to follow Contributing guidelines

# Are these changes tested?

Benchmarks compiled and ran using `cargo bench -p parquet --bench
bloom_filter`.

# Are there any user-facing changes?

No.
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 31, 2026

run benchmark bloom_filter

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4585582765-382-4bwhf 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sbbf-simd (5ad8cc0) to 511ad06 (merge-base) diff
BENCH_NAME=bloom_filter
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench bloom_filter
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                            main                                    sbbf-simd
-----                            ----                                    ---------
check/hit/l_32MiB                1.38    522.5±2.51µs 91.3 MElem/sec     1.00    378.1±6.45µs 126.1 MElem/sec
check/hit/m_2MiB                 1.57    475.9±5.96µs 100.2 MElem/sec    1.00    303.1±2.72µs 157.3 MElem/sec
check/hit/s_128KiB               1.63    392.2±0.66µs 121.6 MElem/sec    1.00    240.8±0.88µs 198.1 MElem/sec
check/miss/l_32MiB               1.00    360.1±6.16µs 132.4 MElem/sec    1.04    373.8±1.11µs 127.6 MElem/sec
check/miss/m_2MiB                1.04    322.7±2.72µs 147.8 MElem/sec    1.00    310.9±4.59µs 153.4 MElem/sec
check/miss/s_128KiB              1.00    231.0±1.15µs 206.5 MElem/sec    1.04    241.3±0.66µs 197.6 MElem/sec
fold_to_target_fpp/ndv/1000      1.00     71.1±1.55µs 439.7 MElem/sec    1.00     71.0±1.37µs 440.3 MElem/sec
fold_to_target_fpp/ndv/10000     1.00     65.9±1.82µs 474.2 MElem/sec    1.07    70.2±11.00µs 444.9 MElem/sec
fold_to_target_fpp/ndv/100000    1.10    78.5±15.79µs 398.3 MElem/sec    1.00     71.2±8.88µs 438.7 MElem/sec
insert/ins/l_32MiB               1.08    385.1±2.08µs 123.8 MElem/sec    1.00    358.0±6.11µs 133.2 MElem/sec
insert/ins/m_2MiB                1.04    309.0±2.26µs 154.3 MElem/sec    1.00    295.7±4.41µs 161.2 MElem/sec
insert/ins/s_128KiB              1.01    227.5±0.65µs 209.6 MElem/sec    1.00    226.3±0.46µs 210.7 MElem/sec
insert_and_fold/values/1000      1.00     83.4±1.84µs 11.4 MElem/sec     1.06     88.0±0.18µs 10.8 MElem/sec
insert_and_fold/values/10000     1.04    109.4±0.77µs 87.2 MElem/sec     1.00    104.9±3.73µs 90.9 MElem/sec
insert_and_fold/values/100000    1.00    352.9±2.59µs 270.2 MElem/sec    1.00    351.7±2.22µs 271.1 MElem/sec
insert_only/values/1000          1.00     13.6±0.02µs 70.2 MElem/sec     1.00     13.7±0.01µs 69.8 MElem/sec
insert_only/values/10000         1.07     39.5±0.37µs 241.7 MElem/sec    1.00     37.0±0.12µs 257.5 MElem/sec
insert_only/values/100000        1.00    286.5±1.27µs 332.9 MElem/sec    1.00    285.4±1.97µs 334.1 MElem/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 185.0s
Peak memory 4.6 GiB
Avg memory 4.2 GiB
CPU user 165.6s
CPU sys 16.6s
Peak spill 0 B

branch

Metric Value
Wall time 185.0s
Peak memory 4.5 GiB
Avg memory 4.2 GiB
CPU user 164.1s
CPU sys 18.8s
Peak spill 0 B

File an issue against this benchmark runner

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 31, 2026

run benchmark bloom_filter

env:
  RUSTFLAGS: -Ctarget-cpu=native

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4585724751-383-nm4ns 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing sbbf-simd (5ad8cc0) to 511ad06 (merge-base) diff
BENCH_NAME=bloom_filter
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench bloom_filter
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                            main                                    sbbf-simd
-----                            ----                                    ---------
check/hit/l_32MiB                1.34    521.9±2.11µs 91.4 MElem/sec     1.00    388.6±1.17µs 122.7 MElem/sec
check/hit/m_2MiB                 1.45    484.4±4.52µs 98.4 MElem/sec     1.00    333.9±2.38µs 142.8 MElem/sec
check/hit/s_128KiB               1.62    393.5±0.59µs 121.2 MElem/sec    1.00    242.8±0.13µs 196.4 MElem/sec
check/miss/l_32MiB               1.00    349.2±1.04µs 136.5 MElem/sec    1.09    381.4±2.76µs 125.0 MElem/sec
check/miss/m_2MiB                1.00    316.3±2.91µs 150.8 MElem/sec    1.10    348.6±5.68µs 136.8 MElem/sec
check/miss/s_128KiB              1.00    229.8±0.51µs 207.5 MElem/sec    1.06    243.9±0.12µs 195.5 MElem/sec
fold_to_target_fpp/ndv/1000      1.00     59.3±1.84µs 527.4 MElem/sec    1.03     60.7±1.70µs 514.5 MElem/sec
fold_to_target_fpp/ndv/10000     1.03     55.5±3.36µs 563.3 MElem/sec    1.00     53.9±1.15µs 580.1 MElem/sec
fold_to_target_fpp/ndv/100000    1.00     53.2±1.41µs 587.1 MElem/sec    1.00     53.1±1.25µs 588.3 MElem/sec
insert/ins/l_32MiB               1.03    377.9±1.35µs 126.2 MElem/sec    1.00    365.1±0.94µs 130.6 MElem/sec
insert/ins/m_2MiB                1.00    311.6±4.80µs 153.1 MElem/sec    1.03    321.5±1.97µs 148.3 MElem/sec
insert/ins/s_128KiB              1.01    225.5±0.14µs 211.4 MElem/sec    1.00    224.1±0.16µs 212.8 MElem/sec
insert_and_fold/values/1000      1.00     69.3±0.14µs 13.8 MElem/sec     1.02     70.8±0.13µs 13.5 MElem/sec
insert_and_fold/values/10000     1.00     87.1±0.15µs 109.5 MElem/sec    1.09     95.0±0.35µs 100.4 MElem/sec
insert_and_fold/values/100000    1.00    316.4±0.56µs 301.4 MElem/sec    1.27    403.0±1.44µs 236.7 MElem/sec
insert_only/values/1000          1.00     13.6±0.01µs 70.3 MElem/sec     1.01     13.7±0.01µs 69.8 MElem/sec
insert_only/values/10000         1.00     36.7±0.03µs 260.1 MElem/sec    1.02     37.3±0.18µs 255.6 MElem/sec
insert_only/values/100000        1.00    268.2±0.38µs 355.6 MElem/sec    1.10    296.3±2.71µs 321.8 MElem/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 185.0s
Peak memory 4.6 GiB
Avg memory 4.2 GiB
CPU user 164.3s
CPU sys 16.1s
Peak spill 0 B

branch

Metric Value
Wall time 180.0s
Peak memory 4.5 GiB
Avg memory 4.1 GiB
CPU user 160.7s
CPU sys 18.4s
Peak spill 0 B

File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants