IN LIST: reinterpret FixedSizeBinary for primitive fast paths#23018
IN LIST: reinterpret FixedSizeBinary for primitive fast paths#23018geoffreyclaude wants to merge 11 commits into
Conversation
70c420f to
098e0a6
Compare
|
run benchmark in_list_strategy |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/in_list_fixed_size_binary_filter (098e0a6) to c7e9284 (merge-base) diff using: in_list_strategy File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagein_list_strategy — base (merge-base)
in_list_strategy — branch
File an issue against this benchmark runner |
|
run benchmark in_list |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/in_list_fixed_size_binary_filter (098e0a6) to c7e9284 (merge-base) diff using: in_list File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagein_list — base (merge-base)
in_list — branch
File an issue against this benchmark runner |
098e0a6 to
ce699c4
Compare
867992e to
2ac665f
Compare
Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
Introduces a two-stage filter for ByteView types. Stage 1 uses a fast DirectProbeFilter on masked views (len + prefix) for quick rejection; Stage 2 performs full verification only for potential long-string matches. Triggers for Utf8View and BinaryView.
FixedSizeBinary(N) arrays share the same contiguous buffer layout as primitive arrays, so for power-of-2 widths (1, 2, 4, 8, 16) we can zero-copy reinterpret them and use the optimized primitive filters (bitmap, branchless, hash) instead of falling through to the NestedTypeFilter fallback.
2ac665f to
b62beb4
Compare
Which issue does this PR close?
INperformance with specialized implementations #19390.Rationale for this change
FixedSizeBinarymeans every value has the same number of bytes. For widths 1, 2, 4, 8, and 16, those bytes have the same shape as the primitive values optimized earlier in the stack.That lets DataFusion reuse the existing fast paths without copying the bytes:
For example, a
FixedSizeBinary(4)value is four bytes wide, just like aUInt32. The bytes can be checked by the same fixed-width lookup machinery. The value is still treated as binary data; this is only an internal lookup representation.Other fixed-size binary widths stay on the generic fallback path.
What changes are included in this PR?
FixedSizeBinary(1)andFixedSizeBinary(2)through the bitmap filters.FixedSizeBinary(4),(8), and(16)through branchless or direct-probe filters based on list size.FixedSizeBinaryneedles.FixedSizeBinarywidths onArrayStaticFilter.Are these changes tested?
Yes.
cargo fmt --all --checkcargo test -p datafusion-physical-expr fixed_size_binary --libcargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --libcargo test -p datafusion-physical-expr reinterpreted_ --libcargo test -p datafusion-physical-expr in_list_binary_types --libcargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warningsAre there any user-facing changes?
No. This is an internal performance optimization only.
Local benchmark snapshot
Benchmark command:
Method: compare adjacent saved baselines using raw Criterion sample minima (
min(time / iters)). Lower is better; changes within +/-5% are treated as noise.Compared baselines: #23016 -> #23018
Relevant scope: FixedSizeBinary rows.
Summary: 8 relevant rows, 8 faster, 0 slower, 0 within +/-5%.
fixed_size_binary/fsb16/list=10000/match=0%fixed_size_binary/fsb16/list=10000/match=50%fixed_size_binary/fsb16/list=256/match=0%fixed_size_binary/fsb16/list=256/match=50%fixed_size_binary/fsb16/list=4/match=0%fixed_size_binary/fsb16/list=4/match=50%fixed_size_binary/fsb16/list=64/match=0%fixed_size_binary/fsb16/list=64/match=50%