Skip to content

[PoC] Filter pushdown selectivity threshold#9414

Draft
Dandandan wants to merge 7 commits intoapache:mainfrom
Dandandan:selectivity_threshold
Draft

[PoC] Filter pushdown selectivity threshold#9414
Dandandan wants to merge 7 commits intoapache:mainfrom
Dandandan:selectivity_threshold

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Feb 14, 2026

Which issue does this PR close?

Rationale for this change

It can be better to altogether skip (combined) filters with low effectivity - as there still will be overhead of individual (small) skip/read during Parquet decoder.

What changes are included in this PR?

This adds a simple threshold to skip pushing down if the current selection is not "effective", i.e. under a fraction of rows

Are these changes tested?

Are there any user-facing changes?

Dandandan and others added 4 commits February 14, 2026 17:05
Previously, every predicate in the RowFilter received the same
ProjectionMask containing ALL filter columns. This caused unnecessary
decoding of expensive string columns when evaluating cheap integer
predicates. Now each predicate receives a mask with only the single
column it needs.

Key sync improvements (vs baseline):
- Q37: 63.7ms -> 7.3ms  (-88.6%, Title LIKE with CounterID=62 filter)
- Q36: 117ms -> 24ms    (-79.5%, URL <> '' with CounterID=62 filter)
- Q40: 17.9ms -> 5.1ms  (-71.5%, multi-pred with RefererHash eq)
- Q41: 17.3ms -> 5.5ms  (-68.1%, multi-pred with URLHash eq)
- Q22: 303ms -> 127ms   (-58.2%, 3 string predicates)
- Q42: 7.6ms -> 3.9ms   (-48.5%, int-only multi-predicate)
- Q38: 19.1ms -> 12.4ms (-34.9%, 5 int predicates)
- Q21: 159ms -> 98ms     (-38.5%, URL LIKE + SearchPhrase)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use page-level min/max statistics (via StatisticsConverter) to compute
a RowSelection that skips pages where equality predicates cannot match.
For each equality predicate with an integer literal, we check if the
literal falls within each page's [min, max] range and skip pages where
it doesn't.

Impact is data-dependent - most effective when data is sorted/clustered
by the filter column. For this particular 100K-row sample file the data
isn't sorted by filter columns, so improvements are modest (~5% for
some CounterID=62 queries). Would show larger gains on sorted datasets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Put the cheapest/most selective predicate first: SearchPhrase <> ''
filters ~87% of rows before expensive LIKE predicates run. This
reduces string column decoding for Title and URL significantly.

Q22 sync: ~6% improvement, Q22 async: ~13% improvement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 14, 2026
@Dandandan
Copy link
Contributor Author

run benchmark arrow_reader_clickbench

@Dandandan Dandandan changed the title [PoC] Selectivity threshold [PoC] Filter pushdown selectivity threshold Feb 14, 2026
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing selectivity_threshold (b4275cb) to 39a2b71 diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
BENCH_BRANCH_NAME=selectivity_threshold
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                             main                                   selectivity_threshold
-----                                             ----                                   ---------------------
arrow_reader_clickbench/async/Q1                  1.00      2.3±0.03ms        ? ?/sec    1.00      2.3±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.01     11.6±0.40ms        ? ?/sec    1.00     11.5±0.40ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.00     13.0±0.27ms        ? ?/sec    1.03     13.4±0.46ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.00     23.1±0.38ms        ? ?/sec    1.01     23.5±0.35ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.00     28.5±0.28ms        ? ?/sec    1.01     28.9±0.41ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.00     25.5±0.19ms        ? ?/sec    1.02     26.0±0.45ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.00      5.6±0.10ms        ? ?/sec    1.04      5.8±0.33ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.12   128.4±11.83ms        ? ?/sec    1.00    114.8±0.57ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.30    166.3±1.04ms        ? ?/sec    1.00    128.1±1.42ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.31    241.9±1.35ms        ? ?/sec    1.00    185.3±4.95ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.00    412.2±2.41ms        ? ?/sec    1.00    412.5±5.81ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.00     32.1±0.70ms        ? ?/sec    1.00     32.1±0.31ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.00    100.9±0.68ms        ? ?/sec    1.00    100.6±0.66ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.00     98.7±1.00ms        ? ?/sec    1.00     98.3±0.63ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.00     28.1±0.32ms        ? ?/sec    1.00     28.1±0.28ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 4.26    118.8±0.61ms        ? ?/sec    1.00     27.9±0.51ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 12.22    93.9±0.68ms        ? ?/sec    1.00      7.7±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.40     35.7±0.81ms        ? ?/sec    1.00     25.5±0.24ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.04     46.2±0.32ms        ? ?/sec    1.00     44.5±0.47ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 3.95     40.7±0.51ms        ? ?/sec    1.00     10.3±0.25ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 3.48     29.8±0.44ms        ? ?/sec    1.00      8.6±0.24ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.97     11.0±0.07ms        ? ?/sec    1.00      5.6±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.00      2.2±0.04ms        ? ?/sec    1.01      2.3±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.00     11.2±0.30ms        ? ?/sec    1.00     11.2±0.31ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.01     12.7±0.37ms        ? ?/sec    1.00     12.6±0.41ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.00     22.5±0.20ms        ? ?/sec    1.00     22.6±0.30ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.00     27.6±0.31ms        ? ?/sec    1.00     27.7±0.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.00     25.0±0.27ms        ? ?/sec    1.00     25.1±0.25ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.00      5.2±0.09ms        ? ?/sec    1.01      5.2±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.00    110.9±1.31ms        ? ?/sec    1.00    110.7±1.00ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.03    126.8±0.89ms        ? ?/sec    1.00    123.3±1.12ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.52    233.5±1.32ms        ? ?/sec    1.00    153.2±0.99ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.04   361.3±24.92ms        ? ?/sec    1.00    347.2±1.65ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.00     30.2±0.29ms        ? ?/sec    1.02     30.8±0.50ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.00     96.1±0.81ms        ? ?/sec    1.00     95.8±0.62ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.00     94.2±1.01ms        ? ?/sec    1.00     94.3±0.64ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.00     26.9±0.25ms        ? ?/sec    1.01     27.1±0.29ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    4.53    113.6±0.58ms        ? ?/sec    1.00     25.1±0.26ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    12.06    90.0±0.61ms        ? ?/sec    1.00      7.5±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.43     32.6±0.67ms        ? ?/sec    1.00     22.8±0.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.05     41.5±0.31ms        ? ?/sec    1.00     39.7±0.37ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    4.07     39.0±0.68ms        ? ?/sec    1.00      9.6±0.17ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    3.55     28.2±0.30ms        ? ?/sec    1.00      7.9±0.15ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.98     10.5±0.12ms        ? ?/sec    1.00      5.3±0.04ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00  1983.5±13.50µs        ? ?/sec    1.01  1993.7±15.84µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.00      7.5±0.08ms        ? ?/sec    1.00      7.6±0.10ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.00      9.0±0.07ms        ? ?/sec    1.00      9.0±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.00     29.6±1.36ms        ? ?/sec    1.02     30.1±1.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.00     34.3±0.67ms        ? ?/sec    1.20     41.3±1.39ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.23     39.1±1.01ms        ? ?/sec    1.00     31.8±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.00      4.2±0.04ms        ? ?/sec    1.01      4.2±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.01    179.3±1.24ms        ? ?/sec    1.00    177.6±2.26ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.73    231.3±5.30ms        ? ?/sec    1.00    133.8±1.14ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  2.30    481.2±3.88ms        ? ?/sec    1.00    208.9±1.42ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.06   448.8±16.62ms        ? ?/sec    1.00    422.6±6.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.02     39.9±0.35ms        ? ?/sec    1.00     39.0±0.33ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.00    154.8±1.28ms        ? ?/sec    1.01    156.1±1.15ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.00    149.5±1.60ms        ? ?/sec    1.01    150.4±0.98ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.00     28.3±0.40ms        ? ?/sec    1.00     28.4±0.30ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  4.77    153.9±1.60ms        ? ?/sec    1.00     32.2±0.49ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  9.86     85.9±0.63ms        ? ?/sec    1.00      8.7±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.53     28.4±0.46ms        ? ?/sec    1.00     18.5±0.24ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.06     34.0±0.45ms        ? ?/sec    1.00     32.0±0.32ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  3.06     26.6±0.58ms        ? ?/sec    1.00      8.7±0.28ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  3.92     27.9±0.30ms        ? ?/sec    1.00      7.1±0.09ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.76     11.8±0.10ms        ? ?/sec    1.00      6.7±0.23ms        ? ?/sec

@Dandandan
Copy link
Contributor Author

Dandandan commented Feb 14, 2026

Nice this shows some good resulrs @alamb @adriangb (need to merge #9413) so we can run the benchmark again and compare (current crazy difference is due to projection issue addressed in other PR).

@Dandandan
Copy link
Contributor Author

run benchmark arrow_reader_clickbench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add selectivity threshold for filter pushdown

2 participants