Skip to content

IN LIST: reinterpret small-width types for bitmap filters#23013

Draft
geoffreyclaude wants to merge 7 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_reinterpret_bitmaps
Draft

IN LIST: reinterpret small-width types for bitmap filters#23013
geoffreyclaude wants to merge 7 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_reinterpret_bitmaps

Conversation

@geoffreyclaude

@geoffreyclaude geoffreyclaude commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

#23011 and #23012 add bitmap lookups for unsigned 1-byte and 2-byte integers, and #23035 unifies those concrete filters behind one shared bitmap implementation. This PR lets other same-width primitive types reuse those same bitmaps without copying or converting the values.

The key idea is that some types have different meanings but the same physical shape in memory. For example:

  • UInt8 stores one byte.
  • Int8 also stores one byte.
  • UInt16 stores two bytes.
  • Int16 also stores two bytes.

The bitmap only cares about the exact bits. So an Int8 value can be viewed as its one-byte bit pattern and checked with the UInt8 bitmap. No new array is allocated and the underlying Arrow value buffer is shared.

That is what “zero-copy reinterpretation” means here: keep the same bytes, but use a lookup filter whose storage type matches the byte width.

What changes are included in this PR?

  • Adds a helper that reinterprets a primitive Arrow array as another primitive type with the same width.
  • Makes the helper slice-aware, so sliced Arrow arrays still start at the correct logical offset.
  • Wraps bitmap filters so signed 1-byte and 2-byte primitive arrays can reuse the unsigned bitmap storage.
  • Validates source and needle widths before using the reinterpreted path.
  • Adds focused coverage for signed boundary values, bit patterns, and sliced arrays.

Are these changes tested?

Yes.

  • cargo fmt --all
  • cargo test -p datafusion-physical-expr reinterpreted_bitmap_handles_signed_boundaries_and_slices --lib
  • cargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --lib
  • cargo test -p datafusion-physical-expr in_list_int_types --lib
  • cargo test -p datafusion-physical-expr test_in_list_dictionary_types --lib
  • cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings

Are there any user-facing changes?

No. This is an internal performance optimization only.

Local benchmark snapshot

Benchmark command:

cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name>

Method: compare adjacent saved baselines using raw Criterion sample minima (min(time / iters)). Lower is better; changes within +/-5% are treated as noise. These numbers were not rerun after splitting the behavior-preserving bitmap unification into #23035.

Compared baselines: #23035 -> #23013

Relevant scope: signed 16-bit reinterpretation rows.

Summary: 6 relevant rows, 6 faster, 0 slower, 0 within +/-5%.

Benchmark Before After Change
narrow_integer/i16/list=256/match=0% 19.15 us 4.00 us -79.1% (4.79x faster)
narrow_integer/i16/list=256/match=50% 31.32 us 4.00 us -87.2% (7.82x faster)
narrow_integer/i16/list=4/match=0% 16.79 us 4.01 us -76.1% (4.18x faster)
narrow_integer/i16/list=4/match=50% 34.80 us 4.01 us -88.5% (8.69x faster)
narrow_integer/i16/list=64/match=0% 19.21 us 4.11 us -78.6% (4.68x faster)
narrow_integer/i16/list=64/match=50% 34.72 us 4.01 us -88.5% (8.66x faster)

@github-actions github-actions Bot added the physical-expr Changes to the physical-expr crates label Jun 18, 2026
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_reinterpret_bitmaps branch 2 times, most recently from cc752d0 to 9925e82 Compare June 18, 2026 08:52
@geoffreyclaude geoffreyclaude changed the title Implement Zero-Copy Reinterpretation and enable Int8/Int16 Bitmaps IN LIST: reinterpret small-width types for bitmap filters Jun 18, 2026
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_reinterpret_bitmaps branch 2 times, most recently from 65f008f to 08fbe39 Compare June 19, 2026 05:35
Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_reinterpret_bitmaps branch from 08fbe39 to 1db5627 Compare June 19, 2026 05:55
@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
error: running 'cargo update' on crate 'datafusion-physical-expr' failed with output:
-----
    Updating crates.io index
error: failed to get `arrow-array` as a dependency of package `arrow v59.0.0`
    ... which satisfies dependency `arrow = "^59.0.0"` of package `datafusion-physical-expr v54.0.0 (/home/runner/work/datafusion/datafusion/datafusion/physical-expr)`
    ... which satisfies path dependency `datafusion-physical-expr` of package `placeholder v0.0.0 (/home/runner/work/datafusion/datafusion/target/semver-checks/local-datafusion_physical_expr-54_0_0-x86_64_unknown_linux_gnu-803dd5f4795bc6a9)`

Caused by:
  failed to load source for dependency `arrow-array`

Caused by:
  unable to update registry `crates-io`

Caused by:
  download of ar/ro/arrow-array failed

Caused by:
  curl failed

Caused by:
  [16] Error in the HTTP2 framing layer

-----
error: failed to update dependencies for crate datafusion-physical-expr v54.0.0
note: this is unlikely to be a bug in cargo-semver-checks,
      and is probably an issue with the crate's Cargo.toml
note: the following command can be used to reproduce the compilation error:
      cargo new --lib example &&
          cd example &&
          echo '[workspace]' >> Cargo.toml &&
          cargo add --path /home/runner/work/datafusion/datafusion/datafusion/physical-expr --features proto,recursive_protection &&
          cargo update

error: aborting due to failure to run 'cargo update' for crate datafusion-physical-expr v54.0.0

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant