[WIP] Explore extensible range partitioning for dynamic filters by gene-bordegaray · Pull Request #22002 · apache/datafusion

gene-bordegaray · 2026-05-03T15:50:04Z

Which issue does this PR close?

WIP POC for discussion [DISCUSSION] Extending Partitioning to Support More Variants #21992 and [DISCUSSION] Future of Dynamic Filters Sync #21207.

Rationale for this change

The intention of this POC is to discuss API shape, design, and path forward. Not implementation details. The implementation detail was done by AI but API designed by myself.

This is a proof of concept for representing file/range partitioning instead of advertising it as hash partitioning. The goal is to show how partitioning can be made into an extendable/customizable part of DataFusion rather than the current enum with fixed variants. It also shows how compatibility can support partition-specific optimizations (in this case tighter dynamic filters as discussed in #21207).

What changes are included in this PR?

Adds a custom physical partitioning trait and range partitioning implementation of that trait
File scans advertise range partitioning for compatible file groups
Adds partitioning compatibility checks
Routes dynamic filters by partition index when partition maps are compatible, and hash/global fallbacks

PhysicalPartitioning trait shape

PhysicalPartitioning is an extensible representation of partitioning. Each method has a responsibility to represent a partitioning in a general but correct sense:

name: the name
partition_count: number of output partitions
satisfaction: whether this partitioning satisfies a required Distribution. This is used by distribution planning to decide whether repartitioning is needed
compatibility: whether two partitionings describe the same partition map. This is more restrictive than satisfaction and used for partition-specific behavior (in this case routing dynamic filters by partition index)
project: rewrites partitioning expressions through projection while preserving the same partition map
as_any: downcast implementations to compare partitioning types

An important design note:

  satisfaction  -> can this avoid repartitioning?
  compatibility -> can partition N on both sides be treated as the same key?

Thus, dynamic filters use compatibility before enabling partition-index routing to ensure each side can be trreated as the same key.

Are these changes tested?

Yes:

cargo test -p datafusion-physical-expr -p datafusion-datasource -p datafusion-datasource-parquet -p datafusion-physical-plan -p datafusion-physical-optimizer -p datafusion-proto -> existing and new tests pass
cargo test --test sqllogictests -- preserve_file_partitioning push_down_filter_regression repartition_scan listing_table_partitions aggregate group_by joins -> shows new end-to-end dynamic filter and file partitioning behavior

Are there any user-facing changes?

Yes. This POC is meant to spark conversation around the public API: Partitioning::Custom, PhysicalPartitioning, RangePartitioning, range bounds, and partitioning compatibility.

gene-bordegaray · 2026-05-03T16:02:39Z

+        self.ranges.len()
+    }
+
+    /// Returns how this range partitioning satisfies a hash distribution


this is something we will have to be careful of and raises the question:

Should we allow different types of partitioning to satisfy each other? This may provide more optimizations like skipping repartitions, but can see this getting quite brittle

This is also why the idea of "compatibility" was introduced:

satisfaction - decides whether a partitioning is needed

compatibility - decides whether partition specific behavior is valid (like dynamic filter routing)

github-actions · 2026-05-03T16:05:27Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning origin/main
    Building datafusion-datasource v53.1.0 (current)
       Built [  36.127s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.029s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  35.827s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.029s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.267s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  73.516s] datafusion-datasource
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  41.464s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.025s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  41.488s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.026s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.137s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  84.586s] datafusion-datasource-parquet
    Building datafusion-ffi v53.1.0 (current)
       Built [  55.945s] (current)
     Parsing datafusion-ffi v53.1.0 (current)
      Parsed [   0.053s] (current)
    Building datafusion-ffi v53.1.0 (baseline)
       Built [  55.550s] (baseline)
     Parsing datafusion-ffi v53.1.0 (baseline)
      Parsed [   0.056s] (baseline)
    Checking datafusion-ffi v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.222s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 113.494s] datafusion-ffi
    Building datafusion-physical-expr v53.1.0 (current)
       Built [  23.979s] (current)
     Parsing datafusion-physical-expr v53.1.0 (current)
      Parsed [   0.043s] (current)
    Building datafusion-physical-expr v53.1.0 (baseline)
       Built [  24.271s] (baseline)
     Parsing datafusion-physical-expr v53.1.0 (baseline)
      Parsed [   0.043s] (baseline)
    Checking datafusion-physical-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.306s] 222 checks: 220 pass, 2 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field Inner.partitioned_exprs in /home/runner/work/datafusion/datafusion/datafusion/physical-expr/src/expressions/dynamic_filters.rs:101

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/enum_variant_added.ron

Failed in:
  variant Partitioning:Custom in /home/runner/work/datafusion/datafusion/datafusion/physical-expr/src/partitioning.rs:123

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  49.582s] datafusion-physical-expr
    Building datafusion-physical-optimizer v53.1.0 (current)
       Built [  36.257s] (current)
     Parsing datafusion-physical-optimizer v53.1.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-physical-optimizer v53.1.0 (baseline)
       Built [  35.875s] (baseline)
     Parsing datafusion-physical-optimizer v53.1.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-physical-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.128s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  73.464s] datafusion-physical-optimizer
    Building datafusion-physical-plan v53.1.0 (current)
       Built [  31.874s] (current)
     Parsing datafusion-physical-plan v53.1.0 (current)
      Parsed [   0.130s] (current)
    Building datafusion-physical-plan v53.1.0 (baseline)
       Built [  32.322s] (baseline)
     Parsing datafusion-physical-plan v53.1.0 (baseline)
      Parsed [   0.122s] (baseline)
    Checking datafusion-physical-plan v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.539s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  66.173s] datafusion-physical-plan
    Building datafusion-proto v53.1.0 (current)
       Built [  54.383s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.134s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  54.560s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.133s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.657s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/enum_variant_added.ron

Failed in:
  variant PartitionMethod:Range in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:2082
  variant PartitionMethod:Range in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/prost.rs:2082

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [ 112.653s] datafusion-proto
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 138.067s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 139.278s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.091s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 280.347s] datafusion-sqllogictest

gene-bordegaray · 2026-05-03T16:07:44Z

    }

+    /// Update this filter with a global fallback plus partition-local filters.
+    pub fn update_partitioned(


these are public API changes I don't love and ideally with the work being done by @adriangb in PRs like #21931 we can have this be hidden from the user and only an internal implementation detail

alamb · 2026-05-04T18:40:18Z

+/// non-overlapping ranges that accurately describe the source.
+#[derive(Debug, Clone)]
+pub struct RangePartitioning {
+    exprs: Vec<Arc<dyn PhysicalExpr>>,


In generla I think is is important to define formally what range partitioning means -- specifically given a particular row what partition is it in

As you have this structured I think you could have more than one range expressions be true. For example

{ exprs: [a, b] ranges: [[100-200], [100-200]] }

For a row like (a,b) = (10,20) it would be in both ranges

I think it is more common (e.g. spark) to have something lke

A single Expr

A list of ranges

Explicitly declare that the FIRST range that matches is the partition that the row is placed in

Also, does it make sense to explicitly define that the ranges can't overlap and must cover the whole range?

alamb · 2026-05-04T18:41:38Z

+
+/// A lexicographic range bound.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct RangeBound {


This looks pretty similar to Interval: https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html

gene-bordegaray added 4 commits May 3, 2026 08:32

add range partitioning

30d1d06

Introduce custom partitioning trait

5ff9fa5

Use range partitioning for file groups

62041a0

Route dynamic filters by partition compatibility

14034f3

gene-bordegaray marked this pull request as draft May 3, 2026 15:50

gene-bordegaray commented May 3, 2026

View reviewed changes

gene-bordegaray mentioned this pull request May 3, 2026

[DISCUSSION] Extending Partitioning to Support More Variants #21992

Open

alamb reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Explore extensible range partitioning for dynamic filters#22002

[WIP] Explore extensible range partitioning for dynamic filters#22002
gene-bordegaray wants to merge 4 commits intoapache:mainfrom
gene-bordegaray:gene.bordegaray/2026/05/partitioning_trait_range_dynamic_filter_poc

gene-bordegaray commented May 3, 2026 •

edited

Loading

Uh oh!

gene-bordegaray May 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

gene-bordegaray May 3, 2026

Uh oh!

alamb May 4, 2026

Uh oh!

alamb May 4, 2026

Uh oh!

alamb May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gene-bordegaray commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

PhysicalPartitioning trait shape

Are these changes tested?

Are there any user-facing changes?

Uh oh!

gene-bordegaray May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

gene-bordegaray May 3, 2026

Choose a reason for hiding this comment

Uh oh!

alamb May 4, 2026

Choose a reason for hiding this comment

Uh oh!

alamb May 4, 2026

Choose a reason for hiding this comment

Uh oh!

alamb May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gene-bordegaray commented May 3, 2026 •

edited

Loading

gene-bordegaray May 3, 2026 •

edited

Loading