From 5f7f97646d17817bf654e6ab7276ad2814776554 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Thu, 14 May 2026 15:11:53 +0800
Subject: [PATCH 01/14] Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip
I/O
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
A walkthrough of the sort pushdown work landed and in flight on Apache
DataFusion. Covers:
- Why SortExec is expensive and what `Exact` / `Inexact` ordering mean
at runtime (static `fetch` vs `TopK` dynamic filter).
- Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case.
- Phase 2 (#21182): statistics-based file sort that upgrades
`Unsupported` to `Exact`, eliminating the `SortExec` on
non-overlapping ASC scans. Includes the BufferExec compensation
(#21426) so the SPM above doesn't lose its implicit memory buffer.
- Reverse scans: today's row-group reverse (Inexact, #18817) and the
community decision to wait for arrow-rs page-level reverse (#9937)
before pursuing Exact reverse, after memory-profile pushback on the
original row-group-level proposal.
- Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite.
- What's next: the dynamic / TopK-driven path (#21351 merged, #21733,
#21712, #21956, #21580) including the precise RG-pruning vs
mid-stream-early-return distinction, and the EnsureRequirements
unification (#21976).
- Links into the prior dynamic filters and limit pruning posts so the
series reads as a coherent thread.
---
content/blog/2026-05-11-sort-pushdown.md | 592 ++++++++++++++++++
content/images/sort-pushdown/benchmark.svg | 75 +++
.../sort-pushdown/phase1-file-reorder.svg | 88 +++
.../sort-pushdown/phase2-stats-overlap.svg | 79 +++
content/images/sort-pushdown/plan-diff.svg | 70 +++
content/images/sort-pushdown/reverse-scan.svg | 100 +++
6 files changed, 1004 insertions(+)
create mode 100644 content/blog/2026-05-11-sort-pushdown.md
create mode 100644 content/images/sort-pushdown/benchmark.svg
create mode 100644 content/images/sort-pushdown/phase1-file-reorder.svg
create mode 100644 content/images/sort-pushdown/phase2-stats-overlap.svg
create mode 100644 content/images/sort-pushdown/plan-diff.svg
create mode 100644 content/images/sort-pushdown/reverse-scan.svg
diff --git a/content/blog/2026-05-11-sort-pushdown.md b/content/blog/2026-05-11-sort-pushdown.md
new file mode 100644
index 00000000..d8726038
--- /dev/null
+++ b/content/blog/2026-05-11-sort-pushdown.md
@@ -0,0 +1,592 @@
+---
+layout: post
+title: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O
+date: 2026-05-11
+author: Qi Zhu
+categories: [performance]
+---
+
+
+
+[TOC]
+
+*Qi Zhu, [Massive](https://www.massive.com/)*
+
+Many [Apache Parquet] datasets are already sorted on disk. Time-series
+files are usually written in ingestion-time order. Event logs are sharded
+and sorted by event id. Partitioned tables come with a natural ordering
+implied by the partition key. The information about that ordering is
+sitting right there in the file metadata.
+
+[Apache Parquet]: https://parquet.apache.org/
+
+Until recently, [Apache DataFusion] would still re-sort those files on
+every `ORDER BY` query. Every `SELECT ... ORDER BY ts LIMIT 100` did a
+full external sort across the entire scan, even though the data was
+already in that order. CPU wasted. Memory wasted. Streaming defeated.
+
+[Apache DataFusion]: https://datafusion.apache.org/
+
+This post walks through the **sort pushdown** work that closed that gap.
+It is structured in two phases — file rearrangement first, then a
+statistics-based proof of non-overlap — and lands real benchmark
+speedups of **2.1×–49× on common queries**. The same machinery extends
+to `ORDER BY ... DESC`, and the page-level reverse primitive we are
+adding upstream in [arrow-rs] will push the gains further still.
+
+[arrow-rs]: https://github.com/apache/arrow-rs
+
+## TL;DR
+
+* DataFusion can now **skip `SortExec` entirely** when input files are
+ already in the requested order.
+* Two phases:
+ * **Phase 1** — establish the `PushdownSort` rule and the
+ `Exact` / `Inexact` / `Unsupported` protocol; ship the reverse
+ row-group case for `ORDER BY ... DESC` (reports `Inexact`).
+ * **Phase 2** — sort files within each partition by Parquet
+ `min/max` statistics and *prove* non-overlap, upgrading
+ `Unsupported` to `Exact` so `PushdownSort` removes the `SortExec`
+ that `EnforceSorting` inserted earlier.
+* Real-world benchmarks on the `sort_pushdown` suite:
+ `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full
+ `ORDER BY` scans get **~2×** faster.
+* Reverse scans (`ORDER BY ... DESC`) ride the same machinery: a
+ merged row-group-level reverse returns `Inexact` (Sort stays, but
+ `TopK` terminates early); the page-level reverse primitive needed
+ for `Exact` reverse — and so for full `SortExec` removal on `DESC`
+ queries — is in flight in arrow-rs.
+
+## Why Sort Pushdown Matters
+
+`SortExec` is one of the most expensive operators in a query plan.
+It is blocking by construction — no row can leave until every input
+row has been seen and compared — so it tends to dominate both latency
+and peak memory. The cost gets paid even when:
+
+* the file is already ordered by the sort key (very common for
+ timestamp columns);
+* the query only needs the top *N* rows (`ORDER BY ts LIMIT 100`), in
+ which case full sort + truncate is wildly wasteful;
+* the next operator (`SortPreservingMergeExec`, `SortMergeJoinExec`,
+ a window function) was going to consume ordered input anyway.
+
+The data DataFusion needs to avoid this work is **already in the file
+metadata**. Parquet writers can record per-column statistics (`min`,
+`max`) at the row-group level. Files written by Spark, DuckDB,
+arrow-rs, and others routinely include them. And explicit `WITH ORDER`
+clauses in DataFusion's SQL `CREATE EXTERNAL TABLE` give the optimizer
+a direct ordering hint. The job of sort pushdown is to **use that
+information**.
+
+## How DataFusion Tracks Ordering
+
+
+
+Each `FileScanConfig` carries an `output_ordering` — the ordering
+that the optimizer is willing to claim for the scan's output. There
+are two flavours:
+
+* **`Exact`** — the optimizer is *certain* the output is in this order.
+ Sort-handling rules treat an `Exact` ordering as a proof and **remove
+ the surrounding `SortExec`**. ([`EnforceSorting`] does this when the
+ scan declares `Exact` from the start; the sort pushdown rule covered
+ in this post does the same upgrade later in the pipeline.)
+* **`Inexact`** — the optimizer *believes* the output is probably
+ ordered, but cannot prove it. Downstream operators like
+ `SortPreservingMergeExec` can still benefit from this hint, but the
+ explicit `SortExec` stays for safety.
+
+[`EnforceSorting`]: https://docs.rs/datafusion-physical-optimizer/latest/datafusion_physical_optimizer/enforce_sorting/struct.EnforceSorting.html
+
+A helper called `validated_output_ordering()` is the gatekeeper. It
+walks the list of files inside a partition, checks whether the
+declared per-file ordering is consistent with the file order on disk,
+and either confirms the ordering or **strips it entirely** if it
+sees something ambiguous (e.g. file `b` comes before file `a` in the
+file list but file `a`'s range comes first).
+
+### `Exact` and `Inexact` at runtime
+
+`Exact` and `Inexact` lead to different runtime behaviour, and
+distinguishing them up front makes the rest of this post easier to
+follow:
+
+* With **`Exact`**, the `SortExec` is removed and the LIMIT becomes
+ a **static fetch** on the source. The reader stops the moment the
+ requested number of rows has been emitted — early termination
+ at batch granularity, no dynamic state needed.
+* With **`Inexact`**, the `SortExec` stays in place. The LIMIT
+ materialises inside the sort as a `TopK` heap of size K. `TopK`
+ exposes a [**dynamic filter**][dyn-filters-blog] — a runtime
+ expression of the form *"only rows that could still beat the
+ current K-th-best value are worth considering"* — and pushes it
+ back to the parquet scanner. As more data is processed and the
+ heap tightens, the filter's threshold tightens with it, and entire
+ row groups can be skipped by checking the live threshold against
+ the row group's min/max statistics. (See the earlier
+ [dynamic filters][dyn-filters-blog] and [limit pruning][limit-pruning-blog]
+ posts for the full background on this mechanism.)
+
+Both paths use the same underlying min/max statistics, but for
+different purposes: `Exact` uses them at plan time to prove
+non-overlap and justify removing the sort; `Inexact` uses them at
+runtime to skip row groups that can no longer improve the heap.
+
+[dyn-filters-blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+[limit-pruning-blog]: https://datafusion.apache.org/blog/2026/03/20/limit-pruning/
+
+The diagram above shows the result we want: the plan after sort
+pushdown loses the `SortExec` node. Everything downstream — the
+`SortPreservingMergeExec`, the `RepartitionExec`, the
+`DataSourceExec` — was already in the plan. We just need the
+optimizer to convince itself that the bottom of the plan is
+producing the order requested.
+
+## Phase 1: The Pushdown API and Reverse Scans
+
+Phase 1 ([#19064]) introduced the **`PushdownSort`** physical
+optimizer rule and a uniform API for asking each `ExecutionPlan` two
+questions:
+
+[#19064]: https://github.com/apache/datafusion/pull/19064
+
+1. "Can you produce output in *this* ordering?"
+2. "If yes, please rearrange yourself so that it actually does."
+
+The protocol uses three results — `Exact`, `Inexact`, `Unsupported` —
+that downstream operators can interpret uniformly. The Parquet
+`FileSource` answers by comparing the requested ordering against the
+per-file declared ordering: if natural ordering satisfies the request,
+it returns `Exact`; if the *reverse* of the declared ordering does,
+it returns `Inexact` and flips on `reverse_row_groups=true` so the
+scan reads row groups from last to first (the row-group-level reverse
+covered later in this post); otherwise it returns `Unsupported`.
+
+Phase 1's scope was deliberately narrow. It set up the API and
+delivered the reverse-scan case end-to-end, but it did **not** add
+any statistics-based file rearrangement — that came later in Phase 2.
+A finer-grained extension that reorders row groups *within* each file
+by min/max statistics — so the row group with the best sort-key value
+is read first and TopK can tighten its threshold faster — is also
+in progress in [#21956].
+
+Phase 1 also produced a useful side improvement:
+
+* **Reverse-output redesign** ([#19446], [#19557]) extended the same
+ rule to `DESC` queries — picked up again in the reverse-scan
+ section below.
+
+[#19446]: https://github.com/apache/datafusion/pull/19446
+[#19557]: https://github.com/apache/datafusion/pull/19557
+
+## Phase 2: Use Statistics to Prove Non-Overlap
+
+
+
+Phase 1 left a sharp edge that motivated Phase 2 ([#21182]). Consider
+this realistic scenario:
+
+[#21182]: https://github.com/apache/datafusion/pull/21182
+
+* Three files: `a.parquet`, `b.parquet`, `c.parquet`.
+* Each declares `WITH ORDER (ts ASC)`.
+* Internally each file *is* sorted by `ts`.
+* But they were written by different ingestion jobs and end up listed
+ in the **wrong order** on disk (e.g. alphabetical by name, not by
+ time).
+
+`validated_output_ordering()` looks at this, sees that the
+file-internal ordering disagrees with the file-list order, and
+**strips the ordering entirely**. From the optimizer's point of view
+the scan now has no declared ordering, so `EnforceSorting` (which runs
+earlier in the pipeline) inserts a `SortExec`. The data is sorted on
+disk; the optimizer just can't tell.
+
+Phase 2 fixes this in `PushdownSort`, which runs late — after
+`EnforceDistribution` and `EnforceSorting` have already shaped the
+plan. When `PushdownSort` finds a `SortExec` above a file scan whose
+ordering was stripped (a `FileSource` `Unsupported` result), it does
+three things inside `FileScanConfig::try_pushdown_sort`:
+
+1. **Sort the file list by per-file statistics on the sort
+ column(s)** within each file group (the diagram above). The
+ pre-existing [`MinMaxStatistics`] helper (introduced in [#9593])
+ reads each file's `column_statistics[c].min_value` /
+ `.max_value` for each sort column `c`, then sorts the file list by
+ the min row. Phase 2 wires this helper into the optimizer's
+ `Unsupported` branch — `sort_files_within_groups_by_statistics`
+ does the per-group orchestration and decides whether any group is
+ non-overlapping after the sort.
+2. **Check adjacency within each group**: walk each sorted file group
+ independently and ask whether `file[i].max ≤ file[i+1].min` for
+ every adjacent pair (touching at the boundary is fine — value `v`
+ showing up as the last row of one file and the first row of the
+ next still produces a sorted stream). The check is **per file
+ group**, not across groups; cross-group ordering is the job of
+ `SortPreservingMergeExec` at runtime (more on this below).
+3. **Upgrade `Unsupported` to `Exact`** when adjacency holds, the
+ table has a declared `output_ordering` (from `WITH ORDER` or
+ parquet `sorting_columns`), and the sort columns are null-free —
+ the last condition preserves `NULLS LAST`/`NULLS FIRST` semantics
+ across file boundaries. `PushdownSort` then removes the `SortExec`
+ itself and the plan becomes streamable.
+
+[`MinMaxStatistics`]: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/statistics.rs
+[#9593]: https://github.com/apache/datafusion/pull/9593
+
+One caveat that comes straight from `MinMaxStatistics`: the stats
+sort only fires when every `ORDER BY` expression is a plain column.
+`ORDER BY date_trunc('hour', ts)` silently skips the upgrade — there
+is no per-file min/max for the function output to compare against.
+Extending sort pushdown across monotonic function wrappers is one of
+the open follow-ups.
+
+
+
+The diagram above contrasts the two cases. On the left, ranges are
+non-overlapping after sort, so we can guarantee that emitting the
+files in min-order produces a globally sorted stream. On the right,
+the ranges overlap, so even after sorting the files by `min(ts)` we
+cannot guarantee global ordering — Phase 2 correctly bails out and
+keeps `SortExec` in place.
+
+The implementation handles a few edge cases worth calling out:
+
+* **Buffering the eliminated `SortExec`.** When the `SortExec` was
+ sitting under a `SortPreservingMergeExec` with
+ `preserve_partitioning=true`, it wasn't just sorting — it was also
+ acting as an *implicit in-memory buffer* for the SPM above it. The
+ SPM picks rows from each partition stream one at a time; without
+ the upstream `SortExec` holding batches in memory, the SPM would
+ read directly from I/O-bound sources and stall on every pick. Phase
+ 2 compensates by inserting a [`BufferExec`] in the `SortExec`'s
+ place — bounded streaming buffer, same throughput shape, no
+ blocking sort. Capacity is configurable via
+ [`sort_pushdown_buffer_capacity`] ([#21426]).
+* **`fetch` preservation** through `EnforceDistribution`. The
+ distribution rule sometimes strips a `SortExec`'s `fetch` field and
+ re-adds the node later. Phase 2 plumbs `fetch` through so a
+ surviving `LIMIT` is not lost.
+* **Per-group, not global, non-overlap.** Phase 2's adjacency check is
+ scoped to each file group. Two file groups can have *overlapping*
+ ranges and the upgrade still fires, as long as each group is
+ internally non-overlapping. That works because each group already
+ produces an independently ordered stream at runtime, and
+ `SortPreservingMergeExec` then picks rows across streams in value
+ order to produce the final globally sorted output. Phase 2 only has
+ to prove the per-stream property.
+* **Single-partition vs multi-partition execution**. With the default
+ multi-partition setup, `EnforceDistribution` byte-range-splits files
+ into single-file groups, after which `validated_output_ordering()`
+ works correctly on its own. Phase 2 only triggers when files have
+ not been split — typically `--partitions 1` runs, or files small
+ enough that the splitter leaves them alone. In the typical `--partitions
+ 1` case the "per-group" distinction collapses (one group equals the
+ whole table), which is why the example earlier in this section is
+ drawn that way.
+
+[`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs
+[`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426
+[#21426]: https://github.com/apache/datafusion/pull/21426
+
+## Benchmarks
+
+
+
+The [`sort_pushdown`] benchmark suite reproduces the
+"wrong-order file list" scenario by generating Parquet files whose
+names are intentionally reversed against their sort-key ranges. Numbers
+below are `--partitions 1`, release build, on the merged Phase 2
+branch versus `main`:
+
+[`sort_pushdown`]: https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown
+
+| Query | Before | After | Speedup |
+| ------------------------------------------- | -------:| -------:| -------:|
+| Q1 — `ORDER BY key` (full scan) | 259 ms | 122 ms | **2.1×** |
+| Q2 — `ORDER BY key LIMIT 100` | 80 ms | 3 ms | **27×** |
+| Q3 — `SELECT * ORDER BY key` | 700 ms | 313 ms | **2.2×** |
+| Q4 — `SELECT * ORDER BY key LIMIT 100` | 342 ms | 7 ms | **49×** |
+
+The shape of the speedup is what you would expect once `SortExec` is
+removed:
+
+* **Full-scan queries (Q1, Q3)** still have to push every row through
+ the pipeline, so the gain is "just" the cost of the sort itself —
+ roughly half the original time. This matches the rule of thumb that
+ a blocking sort doubles end-to-end latency on data that fits in
+ memory.
+* **`LIMIT` queries (Q2, Q4)** benefit much more because removing
+ `SortExec` converts the LIMIT into a static `fetch` on the data
+ source — the reader stops the moment K rows have been emitted,
+ instead of reading the full file, sorting, and truncating.
+ This is the "early termination at batch granularity" case from
+ the runtime-difference section above. A 342 ms full-file scan
+ collapses into a 7 ms K-row read.
+
+It is worth saying explicitly what this change does **not** affect.
+The default multi-partition execution path is unchanged: those plans
+already produced correct orderings via byte-range splitting, so
+Phase 2 simply does not trigger. There is no regression and no behavior
+change for the typical multi-threaded query.
+
+## Reverse Scans for `ORDER BY ... DESC`
+
+
+
+`ORDER BY ts DESC` is the same problem in reverse. If a file is sorted
+ascending and the query wants descending, we should be able to skip
+the sort — we just need to read the data in the opposite order.
+
+The first iteration of this lives in [#18817] and operates at the
+**row group** level: it reverses the *iteration order of row groups*
+so the last RG is opened first, but rows within each RG are still
+decoded forward. The resulting stream is "RGs descending × rows
+ascending" — close to the requested order, but not strictly DESC. The
+optimizer therefore reports this as `Inexact` and leaves the
+`SortExec` in place; the win is that `TopK`'s dynamic filter tightens
+much faster, because the very first row groups read already contain
+values near the final answer. A tight threshold means subsequent row
+groups can be skipped via min/max statistics. This ships today and
+is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted files.
+
+[#18817]: https://github.com/apache/datafusion/pull/18817
+
+To turn this into `Exact` reverse — so the `SortExec` can be removed
+outright — each emitted batch itself has to be in DESC order. The
+straightforward row-group-level approach (decode an entire RG forward,
+materialize all rows, reverse the buffer, then emit) is correct and
+was actually proposed first, in an earlier iteration of this work
+([#18817], later closed and split into smaller pieces). Review
+feedback there — primarily from [@2010YOUY01] — flagged the memory
+profile as too aggressive: caching an entire row group's worth of
+decoded rows before any batch can be emitted is roughly:
+
+* **Peak buffer of one whole row group** (~128 MB by default), versus
+ the few-MB-per-batch streaming profile readers normally have.
+* **First-batch latency = full last-row-group decode**. For
+ `ORDER BY ts DESC LIMIT 10` that means decoding ~1 million rows to
+ return 10 — defeating the point of the `LIMIT`.
+
+The agreed direction coming out of that discussion was to ship the
+narrower `Inexact` row-group-reverse first (which became Phase 1 in
+[#19064]), and to build `Exact` reverse on a finer-grained primitive
+once `arrow-rs` exposed one.
+
+That primitive is the **page-level** reverse traversal. Parquet's
+`OffsetIndex` already gives us byte-precise locations for every data
+page in a column chunk, so we can `seek` directly to the last page,
+decode it forward, reverse the resulting batch, and emit. Peak buffer
+drops to one page (~1 MB) and first-batch latency drops to the cost
+of one page decode — the row-group-level memory cliff disappears.
+
+We are landing this primitive upstream in arrow-rs as
+[#9937], with the discussion in [#9934]. Early numbers on a 100k-row,
+98-page column chunk show **~50× faster time-to-first-N** for `n ≤ 1
+page` and **~9× faster** for `n` spanning 10 pages, compared with the
+row-group-level Exact reverse described above. The DataFusion-side
+integration that turns this primitive into an `Exact` result is a
+follow-up to #9937 and is gated on its merge.
+
+[@2010YOUY01]: https://github.com/2010YOUY01
+
+[#9937]: https://github.com/apache/arrow-rs/pull/9937
+[#9934]: https://github.com/apache/arrow-rs/issues/9934
+
+One natural question: why not reverse the rows *within* a page
+directly? Because we can't. Parquet's page encodings (RLE, dictionary,
+delta, bit-packing) are all forward streams — you cannot decode the
+last value without decoding every value that came before it. The
+design therefore is: **reverse the page traversal, forward-decode
+each page, reverse the resulting RecordBatch**. This is the algorithm
+shape that DataFusion's Phase-2 `RecordBatchReader` integration will
+use once arrow-rs ships the primitive.
+
+The killer use case is **filtered reverse TopK**:
+
+```sql
+SELECT * FROM events
+WHERE user_id = 42
+ORDER BY ts DESC
+LIMIT 10
+```
+
+Here `RowSelection::with_limit` cannot help — you don't know in
+advance which rows match `user_id = 42`, so you can't pre-compute a
+selection of the "last 10 matching rows". The only correct strategy
+is to stream pages backward, evaluate the filter on each, and stop
+when 10 matches are collected. Row-group reverse stops at a
+~128 MB granularity. Page reverse stops at ~1 MB granularity. For a
+selective filter, the saving compounds.
+
+## What's Next
+
+Sort pushdown is a long-running line of work and there is more to do.
+Beyond the `Exact` path described above, there is a complementary
+**dynamic / TopK-driven path** that helps when `Exact` cannot apply —
+e.g. when file ranges genuinely overlap, or when the sort is on a
+function output rather than a plain column. The two directions are
+not alternatives; they compose:
+
+* **`Exact` reverse for `ORDER BY ... DESC`.** Today's row-group
+ reverse returns `Inexact` and the `SortExec` stays on top; the
+ arrow-rs page-level reverse primitive ([#9937]) is what unlocks
+ `Exact` reverse on `DESC` queries (and therefore full `SortExec`
+ elimination on `DESC`). Memory + first-batch latency rule out doing
+ the same thing at the row-group level. Gated on #9937.
+* **Dynamic / TopK-driven path.** When `Exact` cannot fire, `TopK`'s
+ [dynamic filter][dyn-filters-blog] still benefits enormously from
+ reading the *best* data first. This thread also builds on the
+ [limit pruning][limit-pruning-blog] work that turned `LIMIT` into
+ an I/O optimization across the pruning pipeline. The
+ recently-merged morsel-style work scheduling in `FileStream`
+ ([#21351]) gives sibling partitions a *shared work queue* with
+ file-level work-stealing — no CPU sits idle when one partition
+ runs out of files. The proposed [#21733] sorts files in
+ that shared queue by per-file statistics *before* any partition
+ picks, so the first file read is globally optimal and tightens the
+ dynamic filter immediately. Combined with **TopK threshold init from
+ parquet statistics** ([#21712]) and **row-group reorder within each
+ file** ([#21956]), the threshold can be set before reading a single
+ byte. The combined statistics-driven `TopK` pipeline is in flight
+ as [#21580].
+
+ The mechanism here is **RG-level pruning, not mid-stream early
+ return**. With the threshold known up front, the parquet
+ `PruningPredicate` rejects entire row groups against their min/max
+ statistics before any I/O — those row groups are never decoded.
+ The row group(s) the reader *does* open still have their sort
+ column decoded in full to feed the dynamic filter. On the #21580
+ microbenchmark (single file, 61 sorted row groups, `--partitions 1`),
+ **60 of the 61 row groups are skipped** and only one is decoded:
+
+ | Query | Baseline | With pipeline | Speedup |
+ | ------------------------------ | -------: | ------------: | ------: |
+ | `ORDER BY col DESC LIMIT 100` | 28.5 ms | 1.64 ms | **17×** |
+ | `ORDER BY col DESC LIMIT 1000` | 22.2 ms | 0.37 ms | **60×** |
+ | `SELECT * ORDER BY ... LIMIT 100` | 22.5 ms | 0.66 ms | **34×** |
+ | `SELECT * ORDER BY ... LIMIT 1000` | 22.4 ms | 0.61 ms | **37×** |
+
+ The stack reports `Inexact` — the `SortExec` stays on top to
+ enforce correctness across overlapping ranges — so this path
+ cannot do *true* mid-stream early return. Once the parquet reader
+ opens a row group, the sort column has to be decoded all the way
+ through; once a `FileStream` picks up a file from the shared work
+ queue, it has to finish that file. Today's dynamic work scheduling
+ ([#21351]) is **file-granular**: idle partitions stop pulling
+ new files from the queue once a global limit is satisfied, but
+ the partition that's currently inside a file decodes that file's
+ remaining row groups regardless. Mid-file RG-level early return
+ on `TopK` convergence is **not implemented yet** — the work
+ queue holds `PartitionedFile`, not row-group descriptors.
+
+ Closing the tap the moment `TopK` has K confirmed winners therefore
+ needs either:
+
+ * the **`Exact` path**, where the `SortExec` is gone entirely and
+ the data source's own `fetch` becomes a static limit that the
+ reader can honour at batch granularity; or
+ * **finer-grained dynamic scheduling** — having the shared queue
+ hold row-group descriptors instead of whole files, so a partition
+ can release its current file's remaining row groups back to the
+ pool once a global signal says enough TopK winners have been
+ found. This is a natural extension of [#21351] and [#21733] but
+ is not yet on a PR.
+
+ The three mechanisms compose. Stats pruning saves the row groups
+ that *can't* matter (skipped without I/O). The dynamic filter
+ narrows what's decoded inside the row groups the reader does
+ open. `Exact` or finer-grained scheduling is what eventually
+ closes the tap once `TopK` is satisfied.
+* **Phase 3 — filter + sort early termination.** `WHERE filter ORDER
+ BY ts DESC LIMIT N` is the dominant observability query shape and
+ the one where the arrow-rs page-reverse primitive matters most:
+ `RowSelection::with_limit` cannot pre-compute the last `N` matching
+ rows when the filter is selective, so the only correct strategy is
+ to stream pages backward, evaluate the filter, and stop when `N`
+ matches are collected. The DataFusion-side integration is the
+ follow-up to #9937.
+* **Unifying `EnforceDistribution` and `EnforceSorting`** into a
+ single `EnsureRequirements` rule ([#21976]). The two existing rules
+ are coupled through `SortExec.preserve_partitioning`, which makes
+ their composition non-idempotent and has caused a class of
+ production bugs. Other engines (Spark's `EnsureRequirements`,
+ Trino's `AddExchanges`) handle both in a single rule. Merging them
+ also gives future sort-related optimizations a single coherent place
+ to live. In progress.
+* **OFFSET pushdown to parquet** ([#21828]) so `ORDER BY ts LIMIT K
+ OFFSET N` queries can skip the first `N` rows at the row-group level
+ instead of decoding and discarding them. In progress.
+
+[#21976]: https://github.com/apache/datafusion/pull/21976
+[#21956]: https://github.com/apache/datafusion/pull/21956
+[#21712]: https://github.com/apache/datafusion/pull/21712
+[#21580]: https://github.com/apache/datafusion/pull/21580
+[#21828]: https://github.com/apache/datafusion/pull/21828
+[#21351]: https://github.com/apache/datafusion/pull/21351
+[#21733]: https://github.com/apache/datafusion/issues/21733
+
+Concretely useful issues for new contributors:
+
+* [#17348] — the umbrella issue for sort pushdown.
+* [#21317] — sort pushdown: reorder row groups by statistics within
+ each file.
+* [#19394] — add more `ExecutionPlan` impls to support sort pushdown.
+
+[#17348]: https://github.com/apache/datafusion/issues/17348
+[#21317]: https://github.com/apache/datafusion/issues/21317
+[#19394]: https://github.com/apache/datafusion/issues/19394
+
+## Acknowledgements
+
+Thank you to [@alamb], [@adriangb], [@xudong963], [@2010YOUY01], and
+[@Dandandan] for reviewing the design and the patches across many
+iterations. The DataFusion community's willingness to engage deeply
+with optimizer changes — including the ones that touch foundational
+invariants — is what made this work possible.
+
+[@alamb]: https://github.com/alamb
+[@adriangb]: https://github.com/adriangb
+[@xudong963]: https://github.com/xudong963
+[@2010YOUY01]: https://github.com/2010YOUY01
+[@Dandandan]: https://github.com/Dandandan
+
+## References
+
+Prior posts this work builds on:
+
+* [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses.
+* [Turning LIMIT into an I/O Optimization: Inside DataFusion's Multi-Layer Pruning Stack][limit-pruning-blog] — the pruning pipeline this work plugs into.
+
+Issues and PRs:
+
+* Umbrella issue: [apache/datafusion#17348][#17348]
+* `MinMaxStatistics` foundation: [apache/datafusion#9593][#9593]
+* Phase 1: [apache/datafusion#19064][#19064]
+* Phase 2: [apache/datafusion#21182][#21182]
+* `BufferExec` capacity for sort elimination: [apache/datafusion#21426][#21426]
+* Dynamic / TopK-driven path: [apache/datafusion#21351][#21351] (morsel-style work scheduling),
+ [apache/datafusion#21733][#21733] (global file reorder in shared queue)
+* Benchmark suite: [`sort_pushdown`]
+* Row-group reverse scan: [apache/datafusion#18817][#18817]
+* Page-level reverse (arrow-rs): [apache/arrow-rs#9934][#9934],
+ [apache/arrow-rs#9937][#9937]
+* `EnsureRequirements`: [apache/datafusion#21976][#21976]
diff --git a/content/images/sort-pushdown/benchmark.svg b/content/images/sort-pushdown/benchmark.svg
new file mode 100644
index 00000000..30afb7b2
--- /dev/null
+++ b/content/images/sort-pushdown/benchmark.svg
@@ -0,0 +1,75 @@
+
diff --git a/content/images/sort-pushdown/phase1-file-reorder.svg b/content/images/sort-pushdown/phase1-file-reorder.svg
new file mode 100644
index 00000000..9ae798ba
--- /dev/null
+++ b/content/images/sort-pushdown/phase1-file-reorder.svg
@@ -0,0 +1,88 @@
+
diff --git a/content/images/sort-pushdown/phase2-stats-overlap.svg b/content/images/sort-pushdown/phase2-stats-overlap.svg
new file mode 100644
index 00000000..027860ef
--- /dev/null
+++ b/content/images/sort-pushdown/phase2-stats-overlap.svg
@@ -0,0 +1,79 @@
+
diff --git a/content/images/sort-pushdown/plan-diff.svg b/content/images/sort-pushdown/plan-diff.svg
new file mode 100644
index 00000000..a4d08673
--- /dev/null
+++ b/content/images/sort-pushdown/plan-diff.svg
@@ -0,0 +1,70 @@
+
diff --git a/content/images/sort-pushdown/reverse-scan.svg b/content/images/sort-pushdown/reverse-scan.svg
new file mode 100644
index 00000000..443a0a1c
--- /dev/null
+++ b/content/images/sort-pushdown/reverse-scan.svg
@@ -0,0 +1,100 @@
+
From 47e45dd28d7b2d2f07b2ce1e344ce25f82391de9 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Thu, 14 May 2026 15:34:17 +0800
Subject: [PATCH 02/14] Push draft date to 2026-05-25
---
...{2026-05-11-sort-pushdown.md => 2026-05-25-sort-pushdown.md} | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
rename content/blog/{2026-05-11-sort-pushdown.md => 2026-05-25-sort-pushdown.md} (99%)
diff --git a/content/blog/2026-05-11-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
similarity index 99%
rename from content/blog/2026-05-11-sort-pushdown.md
rename to content/blog/2026-05-25-sort-pushdown.md
index d8726038..8dd9d8b8 100644
--- a/content/blog/2026-05-11-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -1,7 +1,7 @@
---
layout: post
title: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O
-date: 2026-05-11
+date: 2026-05-25
author: Qi Zhu
categories: [performance]
---
From dab94fd74037eb6c1f57861574caa0497fcea9ae Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Fri, 15 May 2026 17:40:51 +0800
Subject: [PATCH 03/14] Add Phase 3 section covering #21956 runtime reorder
pipeline
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The post previously only mentioned #21956 in passing. The PR is
landing the full mechanism — `try_pushdown_sort` decision tree,
two flags on `ParquetSource`, three composable runtime steps
(file reorder + RG reorder + reverse), and a `sort_prefix`-
preserving short-circuit — so cover it as a dedicated phase
between Phase 2 and the existing reverse-scan section.
- TL;DR: add a Phase 3 bullet alongside Phase 1 and Phase 2.
- Phase 1: replace the in-flight `#21956` aside with a cross-link
to the new section.
- Phase 2: keep the caveat about function-wrapped sorts but note
that #21956's `Inexact` path now covers them via monotonicity
inference.
- New `## Phase 3` section with two SVG diagrams: a decision tree
for the three protocol outcomes, and a three-step pipeline for
the `Inexact` runtime. Covers the two-flag design, the nested
file/RG layers, when EXPLAIN surfaces each flag, and four
scenarios where Phase 3 does not fire (aggregations, multi-
column secondary keys, function-wrapped sorts without a
declared ordering, source declares a forward prefix of the
request).
- "What's Next": rename the old "Phase 3 — filter + sort" bullet
to "Filtered reverse TopK end-to-end" so the label doesn't
clash with the new section, and add a follow-up bullet
referencing #22198 for multi-column / function-wrapped reorder.
---
content/blog/2026-05-25-sort-pushdown.md | 178 ++++++++++++++++--
.../images/sort-pushdown/pr21956-decision.svg | 66 +++++++
.../pr21956-runtime-pipeline.svg | 69 +++++++
3 files changed, 298 insertions(+), 15 deletions(-)
create mode 100644 content/images/sort-pushdown/pr21956-decision.svg
create mode 100644 content/images/sort-pushdown/pr21956-runtime-pipeline.svg
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 8dd9d8b8..32366246 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -56,8 +56,10 @@ adding upstream in [arrow-rs] will push the gains further still.
## TL;DR
* DataFusion can now **skip `SortExec` entirely** when input files are
- already in the requested order.
-* Two phases:
+ already in the requested order, and **read the most-promising data
+ first** when they aren't — so `TopK` converges fast and the rest
+ gets pruned by statistics.
+* Three phases:
* **Phase 1** — establish the `PushdownSort` rule and the
`Exact` / `Inexact` / `Unsupported` protocol; ship the reverse
row-group case for `ORDER BY ... DESC` (reports `Inexact`).
@@ -65,9 +67,16 @@ adding upstream in [arrow-rs] will push the gains further still.
`min/max` statistics and *prove* non-overlap, upgrading
`Unsupported` to `Exact` so `PushdownSort` removes the `SortExec`
that `EnforceSorting` inserted earlier.
-* Real-world benchmarks on the `sort_pushdown` suite:
- `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full
- `ORDER BY` scans get **~2×** faster.
+ * **Phase 3** ([#21956]) — generalise `Inexact`: whenever the
+ leading sort key is a plain column in the file schema (or the
+ source's reversed declared ordering satisfies the request),
+ `try_pushdown_sort` stamps two flags on the source and the
+ opener runs a three-step runtime pipeline — file-level reorder
+ in the shared morsel queue, row-group reorder by min/max stats,
+ then optional iteration reverse for `DESC` requests.
+* Real-world benchmarks on the `sort_pushdown` suite (Phase 2's
+ `Exact` upgrade): `ORDER BY ... LIMIT` queries get **27× and 49×
+ faster**; full `ORDER BY` scans get **~2×** faster.
* Reverse scans (`ORDER BY ... DESC`) ride the same machinery: a
merged row-group-level reverse returns `Inexact` (Sort stays, but
`TopK` terminates early); the page-level reverse primitive needed
@@ -183,10 +192,9 @@ covered later in this post); otherwise it returns `Unsupported`.
Phase 1's scope was deliberately narrow. It set up the API and
delivered the reverse-scan case end-to-end, but it did **not** add
any statistics-based file rearrangement — that came later in Phase 2.
-A finer-grained extension that reorders row groups *within* each file
-by min/max statistics — so the row group with the best sort-key value
-is read first and TopK can tighten its threshold faster — is also
-in progress in [#21956].
+A finer-grained extension that broadens this `Inexact` path with a
+three-step runtime reorder pipeline lands in [#21956] — covered in
+[Phase 3](#phase-3-runtime-reorder-for-inexact-pushdown) below.
Phase 1 also produced a useful side improvement:
@@ -259,6 +267,12 @@ is no per-file min/max for the function output to compare against.
Extending sort pushdown across monotonic function wrappers is one of
the open follow-ups.
+*(Within #21956's `Inexact` path, `EquivalenceProperties`'s
+monotonicity inference does let function-wrapped sorts benefit from
+row-group iteration reverse when the source declares a compatible
+natural ordering — but stats-based reorder still needs a plain
+column.)*
+
The diagram above contrasts the two cases. On the left, ranges are
@@ -348,6 +362,130 @@ already produced correct orderings via byte-range splitting, so
Phase 2 simply does not trigger. There is no regression and no behavior
change for the typical multi-threaded query.
+## Phase 3: Runtime Reorder for Inexact Pushdown
+
+Phase 2 handles the `Exact` upgrade — strong correctness, sort
+elimination — but only when the table has a declared
+`output_ordering` *and* the files are provably non-overlapping after
+sorting by min. Two large classes of queries fall outside that
+window:
+
+* **Unsorted data** — no `WITH ORDER`, no parquet `sorting_columns`.
+ Phase 2 cannot fire because there is no ordering claim to upgrade.
+* **Overlapping ranges** — files written by different ingestion
+ jobs share time windows. Phase 2 keeps the `SortExec` because the
+ global ordering can't be proven, even though the files often do
+ contain large stretches of in-order data.
+
+For both, a full external `SortExec` is overkill. The parquet
+metadata is right there, and reading the *most-promising* data
+first lets `TopK`'s dynamic filter threshold tighten quickly so the
+rest gets pruned. Phase 3 ([#21956]) wires that up by generalising
+the `Inexact` path Phase 1 introduced.
+
+### `try_pushdown_sort` — one decision, three outcomes
+
+
+
+The `Exact` / `Inexact` / `Unsupported` protocol from Phase 1 stays.
+Phase 3 broadens the **conditions** that route a query into
+`Inexact`:
+
+| Condition | Outcome |
+| --- | --- |
+| `eq_properties.ordering_satisfy(request)` | `Exact` — Phase 1 / 2 sort elimination |
+| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — Phase 3 runtime pipeline |
+| Neither | `Unsupported` — `SortExec` stays, no source-side optimisation |
+
+The "reversed satisfies" branch is what handles function-wrapped
+sorts (`date_trunc('day', ts) DESC`, `ceil(value) DESC`,
+`CAST(x AS Date) DESC`) — `EquivalenceProperties`'s monotonicity
+reasoning recognises that `f(col) DESC` is satisfied by `col ASC`
+reversed, even though parquet has no stats keyed by `f(col)`
+itself.
+
+### Two flags on `ParquetSource`, three runtime steps
+
+
+
+When `try_pushdown_sort` returns `Inexact`, it stamps two fields on
+the `ParquetSource`:
+
+```rust
+struct ParquetSource {
+ sort_order_for_reorder: Option, // what to reorder by
+ reverse_row_groups: bool, // whether to flip iteration
+ // ...
+}
+```
+
+The opener reads them at scan time to drive three composable steps:
+
+1. **File-level reorder.** `FileSource::reorder_files` sits in the
+ shared morsel queue (the [#21351] work-stealing primitive) and
+ sorts the partitioned-file list by `min(col)`. The first file
+ picked across all partitions is globally the most-promising one.
+2. **Row-group-level reorder.** Once a file is opened,
+ `PreparedAccessPlan::reorder_by_statistics` sorts that file's
+ `row_group_indexes` by `min(col)` ASC. The row group most likely
+ to contribute to `TopK` is decoded first.
+3. **Reverse.** For `DESC` requests,
+ `PreparedAccessPlan::reverse` flips the iteration after the
+ stats reorder normalises everything to ASC-by-min. Same
+ primitive Phase 1 introduced for declared reverse scans — Phase
+ 3 just routes more queries through it.
+
+The two layers **nest by construction**: file `i`'s `min(col)` is
+a lower bound on every row group inside it, so the file queue's
+order is a natural prefix of the within-file row-group order.
+Choosing the same key (`min`) in both layers keeps the strategies
+consistent.
+
+`reverse_row_groups`'s meaning depends on which way `Inexact` was
+reached. When the column-in-schema condition fires, the stats
+reorder produces ASC-by-min, so `reverse_row_groups` simply mirrors
+the request direction. When only the reversed-equivalence
+condition fires (function-wrapped case with a declared source
+ordering), `reverse_row_groups` is `true` unconditionally — there
+is no stats reorder to compose with, just a flip of the file's
+natural order.
+
+Both flags surface on the `DataSourceExec` line in `EXPLAIN` so
+plan inspection and snapshot tests can confirm the pushdown fired:
+
+```text
+DataSourceExec: file_groups=..., file_type=parquet,
+ sort_order_for_reorder=[a@0 ASC], reverse_row_groups=true
+```
+
+Absence of either flag means the corresponding runtime step is a
+no-op.
+
+### When Phase 3 does *not* fire
+
+* **Aggregations on top of the sort key.** `SELECT URL, COUNT(*) AS c
+ FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench
+ TopK shape) — the leading sort key (`c`) is an aggregation result
+ and has no per-RG stats in the parquet file, so the
+ column-in-schema check fails. Pushing sort metadata through
+ `AggregateExec` is a separate problem: the aggregated value
+ doesn't exist before aggregation, so even if the metadata reached
+ the scan there'd be nothing actionable to do with it.
+* **Multi-column sort secondary keys.** The reorder currently only
+ uses the leading sort expression — secondary keys are ignored.
+ Tracked as a follow-up in [#22198].
+* **Function-wrapped sort without a source-declared ordering.**
+ Without a declared ordering to invert, the reversed-equivalence
+ branch has nothing to satisfy. Tracked in the same follow-up.
+* **Source declares a forward prefix of the request.** When the
+ source's declared `output_ordering` is a non-empty proper prefix
+ of the request (e.g. source `[a DESC, b ASC]`, request
+ `[a DESC, b ASC, c DESC]`), `try_pushdown_sort` returns
+ `Unsupported` so the surrounding `SortExec` can keep its
+ `sort_prefix` annotation — prefix-aware early termination in
+ `TopK` is strictly better than the Phase 3 reorder on data that
+ is already in prefix order on disk.
+
## Reverse Scans for `ORDER BY ... DESC`
@@ -464,9 +602,11 @@ not alternatives; they compose:
that shared queue by per-file statistics *before* any partition
picks, so the first file read is globally optimal and tightens the
dynamic filter immediately. Combined with **TopK threshold init from
- parquet statistics** ([#21712]) and **row-group reorder within each
- file** ([#21956]), the threshold can be set before reading a single
- byte. The combined statistics-driven `TopK` pipeline is in flight
+ parquet statistics** ([#21712]) and **`try_pushdown_sort` driving
+ runtime row-group / file reorder + reverse** ([#21956], landed),
+ the threshold can be set before reading a single byte. The reorder
+ mechanism applies to any `ORDER BY [LIMIT N]` on
+ parquet, not just TopK queries with a dynamic filter. The combined statistics-driven `TopK` pipeline is in flight
as [#21580].
The mechanism here is **RG-level pruning, not mid-stream early
@@ -516,9 +656,9 @@ not alternatives; they compose:
narrows what's decoded inside the row groups the reader does
open. `Exact` or finer-grained scheduling is what eventually
closes the tap once `TopK` is satisfied.
-* **Phase 3 — filter + sort early termination.** `WHERE filter ORDER
- BY ts DESC LIMIT N` is the dominant observability query shape and
- the one where the arrow-rs page-reverse primitive matters most:
+* **Filtered reverse TopK end-to-end.** `WHERE filter ORDER BY ts
+ DESC LIMIT N` is the dominant observability query shape and the
+ one where the arrow-rs page-reverse primitive matters most:
`RowSelection::with_limit` cannot pre-compute the last `N` matching
rows when the filter is selective, so the only correct strategy is
to stream pages backward, evaluate the filter, and stop when `N`
@@ -535,9 +675,17 @@ not alternatives; they compose:
* **OFFSET pushdown to parquet** ([#21828]) so `ORDER BY ts LIMIT K
OFFSET N` queries can skip the first `N` rows at the row-group level
instead of decoding and discarding them. In progress.
+* **Multi-column and function-wrapped reorder follow-ups** ([#22198]).
+ The reorder mechanism in #21956 currently only uses the leading
+ sort key and only fires on plain columns. Lexicographic multi-key
+ reorder via `arrow::compute::lexsort_to_indices` is low-hanging
+ fruit; extending to monotonic function wrappers via leaf-column
+ extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs a
+ bit more `EquivalenceProperties` integration but is doable.
[#21976]: https://github.com/apache/datafusion/pull/21976
[#21956]: https://github.com/apache/datafusion/pull/21956
+[#22198]: https://github.com/apache/datafusion/issues/22198
[#21712]: https://github.com/apache/datafusion/pull/21712
[#21580]: https://github.com/apache/datafusion/pull/21580
[#21828]: https://github.com/apache/datafusion/pull/21828
diff --git a/content/images/sort-pushdown/pr21956-decision.svg b/content/images/sort-pushdown/pr21956-decision.svg
new file mode 100644
index 00000000..a8203241
--- /dev/null
+++ b/content/images/sort-pushdown/pr21956-decision.svg
@@ -0,0 +1,66 @@
+
diff --git a/content/images/sort-pushdown/pr21956-runtime-pipeline.svg b/content/images/sort-pushdown/pr21956-runtime-pipeline.svg
new file mode 100644
index 00000000..5bb8d678
--- /dev/null
+++ b/content/images/sort-pushdown/pr21956-runtime-pipeline.svg
@@ -0,0 +1,69 @@
+
From d2ad95c7633b31be54408c63f57a82b7e1f46bd9 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sat, 16 May 2026 09:42:52 +0800
Subject: [PATCH 04/14] Add empirical note: why we keep an out-of-tree RG-level
Exact reverse
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Add an ### Empirical note subsection inside "Reverse Scans for
`ORDER BY ... DESC`" that records what we found running an in-house
RG-level `Exact` reverse against upstream `Inexact` + `TopK`:
- `LIMIT N` does not propagate as a static stop signal in the
Inexact path. The dynamic filter pushdown can stats-prune
*subsequent* row groups once the threshold tightens, but inside
the row group `TopK` is currently reading the sort column has to
be fully decoded so the filter can be evaluated row by row.
`LIMIT 10` on a 1M-row row group is still ~1M sort-column
decodes regardless of N. LIMIT only saves work on non-sort
columns inside that row group and on whole subsequent row
groups the threshold prunes.
- `SortExec` stays on top of `Inexact`, so the final ordering
pass and per-row heap maintenance are both extra costs the
`Exact` path (which deletes `SortExec`) does not pay.
Then explain why we run RG-level `Exact` in production but did
not upstream it: parquet does not allow partial row-group reads,
so any RG-level `Exact` implementation peaks at one whole row
group (~128 MB) of decoded data in memory — the same constraint
that closed `#18817`. Our runtime advantage comes from skipping
heap / filter / `SortExec` overhead, not from decoding less.
Frame the page-level `Exact` reverse work in arrow-rs `#9937` as
the path that keeps the runtime win we measured while bringing
peak memory back into the streaming regime via `OffsetIndex`
page-level seek.
---
content/blog/2026-05-25-sort-pushdown.md | 74 ++++++++++++++++++++++++
1 file changed, 74 insertions(+)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 32366246..96708cd0 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -529,6 +529,80 @@ narrower `Inexact` row-group-reverse first (which became Phase 1 in
[#19064]), and to build `Exact` reverse on a finer-grained primitive
once `arrow-rs` exposed one.
+### Empirical note — runtime cost of `Inexact` + `TopK`
+
+We run an internal row-group-level `Exact` reverse implementation in
+production and tested swapping in upstream's `Inexact` row-group
+reverse + `TopK` on `ORDER BY ts DESC LIMIT N` queries. End-to-end
+latency went **up**, not down. A few cost components stack up on the
+`Inexact` + `TopK` side:
+
+* **`LIMIT N` does not propagate as a static stop signal to the
+ source.** In the `Inexact` path the `SortExec` stays on top and
+ `TopK`'s fetch belongs to `SortExec`, not to the parquet scan. The
+ only mechanism that can cut work below the `SortExec` is the
+ dynamic-filter pushdown: as the heap fills, the filter (`ts >
+ threshold`) is pushed to the source and its threshold tightens
+ with every batch. That filter is enough to **stats-prune
+ subsequent, not-yet-opened row groups** entirely — if a row
+ group's `max(ts) < threshold` it is skipped without decode. But
+ inside the row group the source is currently reading, the
+ filter pushdown does not unwind to "stop": the sort column has
+ to be **fully decoded** so the filter can be evaluated row by
+ row, the surviving rows feed the heap to tighten the threshold,
+ and only then can the resulting `RowSelection` skip the *other*
+ columns for rows that didn't pass. For
+ `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is still
+ ~1M sort-column decodes regardless of `N`; the LIMIT only saves
+ work on non-sort columns inside the same row group and on whole
+ *subsequent* row groups that the tightened threshold can prune.
+ The internal `Exact` reverse path, by contrast, deletes the
+ `SortExec` so the LIMIT becomes a static fetch on the source.
+ The source walks pages of the target row group from the back,
+ decodes each batch, reverses the batch row-wise, emits — and
+ stops the moment K rows have been delivered. For
+ `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is one
+ batch worth of decode work, not 1M. No filter machinery, no
+ heap, no per-row threshold check.
+* **`SortExec` itself adds ordering work on top of `Inexact`.** The
+ reversed-RG stream is not strictly DESC (rows within each RG are
+ still forward), so `Inexact` keeps the surrounding `SortExec`.
+ Even when the heap is settled and the dynamic filter has
+ pruned the tail, the outer operator does its own final ordering
+ pass — overhead that `Exact` (which deletes the `SortExec`)
+ does not pay.
+
+Why didn't we just upstream the internal `Exact` reverse, then?
+**Memory.** Parquet does not allow reading only part of a row
+group, so any RG-level `Exact` implementation — ours included —
+has to decode the entire row group, reverse the buffer in
+memory, and only then emit. That is the same memory profile that
+`#18817` was rejected for: a peak of one whole row group
+(~128 MB) of decoded data, vs. the few-MB-per-batch streaming
+profile readers normally have. Our runtime advantage over
+`Inexact` + `TopK` does *not* come from decoding less — both
+paths decode the relevant row group's sort column in full — it
+comes from skipping the per-row heap maintenance, the dynamic
+filter evaluation, and the `SortExec` final ordering pass that
+`Inexact` keeps on top. So we end up running our `Exact` reverse
+in-house but cannot land it as the upstream default for the same
+memory reason that closed `#18817`.
+
+**That is the direct motivation behind the page-level `Exact`
+reverse work we are pushing upstream in arrow-rs `#9937`.** It
+shrinks the unit of work from one whole row group down to one
+page (~1 MB): the reader uses parquet's `OffsetIndex` to `seek`
+to the last page of the column chunk, decode it forward, reverse
+the resulting batch, and emit — without ever materialising the
+rest of the row group in memory. The streaming memory profile is
+preserved and the runtime advantage we measured internally is
+kept. Once `#9937` and the DataFusion follow-up land, the
+upstream default for `ORDER BY ts DESC LIMIT N` becomes `Exact`
+reverse at page granularity: `SortExec` removed, static fetch on
+the source, peak memory in the streaming regime, no `TopK` heap
+overhead, with K rows returned after roughly one page's worth of
+decode work.
+
That primitive is the **page-level** reverse traversal. Parquet's
`OffsetIndex` already gives us byte-precise locations for every data
page in a column chunk, so we can `seek` directly to the last page,
From 8921f90d572ab26bfd11dabe40ae3b5c1f0bb672 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sat, 16 May 2026 16:13:48 +0800
Subject: [PATCH 05/14] Correct internal RG-Exact description and trim arrow-rs
#9937 duplication
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Two corrections in the empirical-note / reverse-scans section:
1. The internal RG-level `Exact` reverse path was incorrectly
described as "walks pages from the back, decodes each batch,
reverses row-wise, stops the moment K rows have been delivered."
That is actually the page-level `Exact` shape (arrow-rs #9937),
not the in-house RG-level implementation. Parquet does not allow
partial row-group reads, so the in-house path has to decode the
entire target row group, reverse the buffer in memory, take the
first K rows, and stop — same memory profile as #18817's
proposal. The runtime advantage over `Inexact` + `TopK` comes
from removing the per-row heap maintenance, dynamic-filter
evaluation, and `SortExec` final ordering pass, not from
decoding less sort-column data. Sort col decode on the target
row group is the same on both paths.
2. The arrow-rs #9937 paragraph I previously added duplicated the
technical detail already present in the long-standing
"That primitive is the page-level reverse traversal..."
paragraph. Replaced with a one-sentence bridge ("The fix that
keeps both the runtime win and a streaming memory profile is
page-level `Exact` reverse via arrow-rs #9937, described next.")
so the existing paragraph carries the explanation without
repetition.
---
content/blog/2026-05-25-sort-pushdown.md | 35 +++++++++---------------
1 file changed, 13 insertions(+), 22 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 96708cd0..acafec65 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -556,14 +556,16 @@ latency went **up**, not down. A few cost components stack up on the
~1M sort-column decodes regardless of `N`; the LIMIT only saves
work on non-sort columns inside the same row group and on whole
*subsequent* row groups that the tightened threshold can prune.
- The internal `Exact` reverse path, by contrast, deletes the
- `SortExec` so the LIMIT becomes a static fetch on the source.
- The source walks pages of the target row group from the back,
- decodes each batch, reverses the batch row-wise, emits — and
- stops the moment K rows have been delivered. For
- `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is one
- batch worth of decode work, not 1M. No filter machinery, no
- heap, no per-row threshold check.
+ The internal RG-level `Exact` reverse path, by contrast, deletes
+ the `SortExec` so the LIMIT becomes a static fetch on the source.
+ The source still has to decode the target row group in full —
+ parquet does not allow partial row-group reads, so this part is
+ the same as `Inexact` — but it then reverses the buffer in
+ memory, takes the first K rows, and **stops**. No subsequent row
+ group is opened, no stats check, no filter machinery, no per-row
+ heap maintenance, no `SortExec` final ordering pass. The wins
+ come from removing those per-row and per-RG overheads on top, not
+ from decoding less sort-column data on the target row group.
* **`SortExec` itself adds ordering work on top of `Inexact`.** The
reversed-RG stream is not strictly DESC (rows within each RG are
still forward), so `Inexact` keeps the surrounding `SortExec`.
@@ -588,20 +590,9 @@ filter evaluation, and the `SortExec` final ordering pass that
in-house but cannot land it as the upstream default for the same
memory reason that closed `#18817`.
-**That is the direct motivation behind the page-level `Exact`
-reverse work we are pushing upstream in arrow-rs `#9937`.** It
-shrinks the unit of work from one whole row group down to one
-page (~1 MB): the reader uses parquet's `OffsetIndex` to `seek`
-to the last page of the column chunk, decode it forward, reverse
-the resulting batch, and emit — without ever materialising the
-rest of the row group in memory. The streaming memory profile is
-preserved and the runtime advantage we measured internally is
-kept. Once `#9937` and the DataFusion follow-up land, the
-upstream default for `ORDER BY ts DESC LIMIT N` becomes `Exact`
-reverse at page granularity: `SortExec` removed, static fetch on
-the source, peak memory in the streaming regime, no `TopK` heap
-overhead, with K rows returned after roughly one page's worth of
-decode work.
+**The fix that keeps both the runtime win and a streaming memory
+profile is page-level `Exact` reverse via arrow-rs [#9937]**,
+described next.
That primitive is the **page-level** reverse traversal. Parquet's
`OffsetIndex` already gives us byte-precise locations for every data
From 33951193eb2ec8594ff3038949ce79468f93dc25 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 13:16:47 +0800
Subject: [PATCH 06/14] Mark #21956 as landed (merged via merge queue)
---
content/blog/2026-05-25-sort-pushdown.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index acafec65..082dede7 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -193,7 +193,7 @@ Phase 1's scope was deliberately narrow. It set up the API and
delivered the reverse-scan case end-to-end, but it did **not** add
any statistics-based file rearrangement — that came later in Phase 2.
A finer-grained extension that broadens this `Inexact` path with a
-three-step runtime reorder pipeline lands in [#21956] — covered in
+three-step runtime reorder pipeline landed in [#21956] — covered in
[Phase 3](#phase-3-runtime-reorder-for-inexact-pushdown) below.
Phase 1 also produced a useful side improvement:
From 7c51d24342f42006ec0129ad2bb5674e8b8e542e Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 14:25:09 +0800
Subject: [PATCH 07/14] =?UTF-8?q?Reframe=20blog=20from=20Phase=20N=20?=
=?UTF-8?q?=E2=86=92=20capability-based=20sections?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
All three sort-pushdown PRs have now landed, so the chronological
'Phase 1/2/3' framing is less useful for readers than a capability
breakdown. Sections are now:
- The PushdownSort Rule (#19064)
- Sort Elimination via Statistics (#21182)
- Runtime Reorder for TopK Convergence (#21956)
- Reverse Scans for ORDER BY ... DESC (unchanged)
In-body Phase references replaced with PR numbers or capability names;
anchor links updated; references section restructured.
---
content/blog/2026-05-25-sort-pushdown.md | 194 ++++++++++++-----------
1 file changed, 104 insertions(+), 90 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 082dede7..205d4807 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -44,12 +44,13 @@ already in that order. CPU wasted. Memory wasted. Streaming defeated.
[Apache DataFusion]: https://datafusion.apache.org/
-This post walks through the **sort pushdown** work that closed that gap.
-It is structured in two phases — file rearrangement first, then a
-statistics-based proof of non-overlap — and lands real benchmark
-speedups of **2.1×–49× on common queries**. The same machinery extends
-to `ORDER BY ... DESC`, and the page-level reverse primitive we are
-adding upstream in [arrow-rs] will push the gains further still.
+This post walks through the **sort pushdown** work that closed that
+gap. It covers two complementary capabilities — sort elimination via
+statistics, and runtime reorder for `TopK` convergence — and lands
+real benchmark speedups of **2.1×–49× on common queries**. The same
+machinery extends to `ORDER BY ... DESC`, and the page-level reverse
+primitive we are adding upstream in [arrow-rs] will push the gains
+further still.
[arrow-rs]: https://github.com/apache/arrow-rs
@@ -59,29 +60,36 @@ adding upstream in [arrow-rs] will push the gains further still.
already in the requested order, and **read the most-promising data
first** when they aren't — so `TopK` converges fast and the rest
gets pruned by statistics.
-* Three phases:
- * **Phase 1** — establish the `PushdownSort` rule and the
- `Exact` / `Inexact` / `Unsupported` protocol; ship the reverse
- row-group case for `ORDER BY ... DESC` (reports `Inexact`).
- * **Phase 2** — sort files within each partition by Parquet
- `min/max` statistics and *prove* non-overlap, upgrading
- `Unsupported` to `Exact` so `PushdownSort` removes the `SortExec`
- that `EnforceSorting` inserted earlier.
- * **Phase 3** ([#21956]) — generalise `Inexact`: whenever the
- leading sort key is a plain column in the file schema (or the
- source's reversed declared ordering satisfies the request),
+* What's supported today:
+ * **The `PushdownSort` rule** ([#19064]) — a physical optimizer
+ rule that asks each `ExecutionPlan` "can you produce output in
+ *this* ordering?" and uses the
+ `Exact` / `Inexact` / `Unsupported` answer to decide whether to
+ delete the surrounding `SortExec`, leave it in place with a
+ hint, or give up.
+ * **Sort elimination via statistics** ([#21182]) — `PushdownSort`
+ sorts files within each partition by Parquet `min/max`
+ statistics and, when the resulting ranges are provably
+ non-overlapping, upgrades the source's ordering claim from
+ `Unsupported` to `Exact` and **removes the `SortExec`** that
+ `EnforceSorting` inserted earlier.
+ * **Runtime reorder for `TopK` convergence** ([#21956]) — whenever
+ the leading sort key is a plain column in the file schema (or
+ the source's reversed declared ordering satisfies the request),
`try_pushdown_sort` stamps two flags on the source and the
opener runs a three-step runtime pipeline — file-level reorder
in the shared morsel queue, row-group reorder by min/max stats,
- then optional iteration reverse for `DESC` requests.
-* Real-world benchmarks on the `sort_pushdown` suite (Phase 2's
- `Exact` upgrade): `ORDER BY ... LIMIT` queries get **27× and 49×
- faster**; full `ORDER BY` scans get **~2×** faster.
-* Reverse scans (`ORDER BY ... DESC`) ride the same machinery: a
- merged row-group-level reverse returns `Inexact` (Sort stays, but
- `TopK` terminates early); the page-level reverse primitive needed
- for `Exact` reverse — and so for full `SortExec` removal on `DESC`
- queries — is in flight in arrow-rs.
+ then optional iteration reverse for `DESC` requests. `SortExec`
+ stays, but `TopK`'s dynamic filter tightens fast on the
+ most-promising data and the rest is pruned.
+ * **Reverse scans for `ORDER BY ... DESC`** ([#19446], [#19557]) —
+ a row-group-level reverse returns `Inexact` (Sort stays, but
+ `TopK` terminates early). The page-level reverse primitive
+ needed for `Exact` reverse — and so for full `SortExec` removal
+ on `DESC` queries — is in flight in arrow-rs ([#9937]).
+* Real-world benchmarks on the `sort_pushdown` suite (`Exact` path):
+ `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full
+ `ORDER BY` scans get **~2×** faster.
## Why Sort Pushdown Matters
@@ -169,11 +177,10 @@ pushdown loses the `SortExec` node. Everything downstream — the
optimizer to convince itself that the bottom of the plan is
producing the order requested.
-## Phase 1: The Pushdown API and Reverse Scans
+## The `PushdownSort` Rule
-Phase 1 ([#19064]) introduced the **`PushdownSort`** physical
-optimizer rule and a uniform API for asking each `ExecutionPlan` two
-questions:
+[#19064] introduced the **`PushdownSort`** physical optimizer rule
+and a uniform API for asking each `ExecutionPlan` two questions:
[#19064]: https://github.com/apache/datafusion/pull/19064
@@ -189,14 +196,17 @@ it returns `Inexact` and flips on `reverse_row_groups=true` so the
scan reads row groups from last to first (the row-group-level reverse
covered later in this post); otherwise it returns `Unsupported`.
-Phase 1's scope was deliberately narrow. It set up the API and
-delivered the reverse-scan case end-to-end, but it did **not** add
-any statistics-based file rearrangement — that came later in Phase 2.
-A finer-grained extension that broadens this `Inexact` path with a
-three-step runtime reorder pipeline landed in [#21956] — covered in
-[Phase 3](#phase-3-runtime-reorder-for-inexact-pushdown) below.
+The initial PR's scope was deliberately narrow. It set up the API
+and delivered the reverse-scan case end-to-end, but did **not** add
+any statistics-based file rearrangement — that came later via
+[#21182], covered in
+[Sort Elimination via Statistics](#sort-elimination-via-statistics)
+below. A finer-grained extension that broadens this `Inexact` path
+with a three-step runtime reorder pipeline landed in [#21956] —
+covered in
+[Runtime Reorder for TopK Convergence](#runtime-reorder-for-topk-convergence).
-Phase 1 also produced a useful side improvement:
+[#19064] also produced a useful side improvement:
* **Reverse-output redesign** ([#19446], [#19557]) extended the same
rule to `DESC` queries — picked up again in the reverse-scan
@@ -205,12 +215,13 @@ Phase 1 also produced a useful side improvement:
[#19446]: https://github.com/apache/datafusion/pull/19446
[#19557]: https://github.com/apache/datafusion/pull/19557
-## Phase 2: Use Statistics to Prove Non-Overlap
+## Sort Elimination via Statistics
-
+
-Phase 1 left a sharp edge that motivated Phase 2 ([#21182]). Consider
-this realistic scenario:
+The initial `Inexact`-only path left a sharp edge that motivated
+stats-based sort elimination ([#21182]). Consider this realistic
+scenario:
[#21182]: https://github.com/apache/datafusion/pull/21182
@@ -228,7 +239,7 @@ the scan now has no declared ordering, so `EnforceSorting` (which runs
earlier in the pipeline) inserts a `SortExec`. The data is sorted on
disk; the optimizer just can't tell.
-Phase 2 fixes this in `PushdownSort`, which runs late — after
+[#21182] fixes this in `PushdownSort`, which runs late — after
`EnforceDistribution` and `EnforceSorting` have already shaped the
plan. When `PushdownSort` finds a `SortExec` above a file scan whose
ordering was stripped (a `FileSource` `Unsupported` result), it does
@@ -239,7 +250,7 @@ three things inside `FileScanConfig::try_pushdown_sort`:
pre-existing [`MinMaxStatistics`] helper (introduced in [#9593])
reads each file's `column_statistics[c].min_value` /
`.max_value` for each sort column `c`, then sorts the file list by
- the min row. Phase 2 wires this helper into the optimizer's
+ the min row. The PR wires this helper into the optimizer's
`Unsupported` branch — `sort_files_within_groups_by_statistics`
does the per-group orchestration and decides whether any group is
non-overlapping after the sort.
@@ -273,14 +284,14 @@ row-group iteration reverse when the source declares a compatible
natural ordering — but stats-based reorder still needs a plain
column.)*
-
+
The diagram above contrasts the two cases. On the left, ranges are
non-overlapping after sort, so we can guarantee that emitting the
files in min-order produces a globally sorted stream. On the right,
the ranges overlap, so even after sorting the files by `min(ts)` we
-cannot guarantee global ordering — Phase 2 correctly bails out and
-keeps `SortExec` in place.
+cannot guarantee global ordering — the upgrade is skipped and
+`SortExec` stays in place.
The implementation handles a few edge cases worth calling out:
@@ -290,32 +301,32 @@ The implementation handles a few edge cases worth calling out:
acting as an *implicit in-memory buffer* for the SPM above it. The
SPM picks rows from each partition stream one at a time; without
the upstream `SortExec` holding batches in memory, the SPM would
- read directly from I/O-bound sources and stall on every pick. Phase
- 2 compensates by inserting a [`BufferExec`] in the `SortExec`'s
+ read directly from I/O-bound sources and stall on every pick. The
+ rule compensates by inserting a [`BufferExec`] in the `SortExec`'s
place — bounded streaming buffer, same throughput shape, no
blocking sort. Capacity is configurable via
[`sort_pushdown_buffer_capacity`] ([#21426]).
* **`fetch` preservation** through `EnforceDistribution`. The
distribution rule sometimes strips a `SortExec`'s `fetch` field and
- re-adds the node later. Phase 2 plumbs `fetch` through so a
+ re-adds the node later. The PR plumbs `fetch` through so a
surviving `LIMIT` is not lost.
-* **Per-group, not global, non-overlap.** Phase 2's adjacency check is
+* **Per-group, not global, non-overlap.** The adjacency check is
scoped to each file group. Two file groups can have *overlapping*
ranges and the upgrade still fires, as long as each group is
internally non-overlapping. That works because each group already
produces an independently ordered stream at runtime, and
`SortPreservingMergeExec` then picks rows across streams in value
- order to produce the final globally sorted output. Phase 2 only has
- to prove the per-stream property.
+ order to produce the final globally sorted output. The rule only
+ has to prove the per-stream property.
* **Single-partition vs multi-partition execution**. With the default
multi-partition setup, `EnforceDistribution` byte-range-splits files
into single-file groups, after which `validated_output_ordering()`
- works correctly on its own. Phase 2 only triggers when files have
- not been split — typically `--partitions 1` runs, or files small
- enough that the splitter leaves them alone. In the typical `--partitions
- 1` case the "per-group" distinction collapses (one group equals the
- whole table), which is why the example earlier in this section is
- drawn that way.
+ works correctly on its own. Stats-based reorder only triggers when
+ files have not been split — typically `--partitions 1` runs, or
+ files small enough that the splitter leaves them alone. In the
+ typical `--partitions 1` case the "per-group" distinction collapses
+ (one group equals the whole table), which is why the example earlier
+ in this section is drawn that way.
[`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs
[`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426
@@ -323,13 +334,13 @@ The implementation handles a few edge cases worth calling out:
## Benchmarks
-
+
The [`sort_pushdown`] benchmark suite reproduces the
"wrong-order file list" scenario by generating Parquet files whose
names are intentionally reversed against their sort-key ranges. Numbers
-below are `--partitions 1`, release build, on the merged Phase 2
-branch versus `main`:
+below are `--partitions 1`, release build, with stats-based sort
+elimination ([#21182]) enabled, versus `main`:
[`sort_pushdown`]: https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown
@@ -359,42 +370,44 @@ removed:
It is worth saying explicitly what this change does **not** affect.
The default multi-partition execution path is unchanged: those plans
already produced correct orderings via byte-range splitting, so
-Phase 2 simply does not trigger. There is no regression and no behavior
-change for the typical multi-threaded query.
+stats-based sort elimination simply does not trigger. There is no
+regression and no behavior change for the typical multi-threaded
+query.
-## Phase 3: Runtime Reorder for Inexact Pushdown
+## Runtime Reorder for TopK Convergence
-Phase 2 handles the `Exact` upgrade — strong correctness, sort
-elimination — but only when the table has a declared
-`output_ordering` *and* the files are provably non-overlapping after
-sorting by min. Two large classes of queries fall outside that
-window:
+Stats-based sort elimination handles the `Exact` upgrade — strong
+correctness, sort elimination — but only when the table has a
+declared `output_ordering` *and* the files are provably
+non-overlapping after sorting by min. Two large classes of queries
+fall outside that window:
* **Unsorted data** — no `WITH ORDER`, no parquet `sorting_columns`.
- Phase 2 cannot fire because there is no ordering claim to upgrade.
+ The `Exact` upgrade cannot fire because there is no ordering
+ claim to upgrade.
* **Overlapping ranges** — files written by different ingestion
- jobs share time windows. Phase 2 keeps the `SortExec` because the
- global ordering can't be proven, even though the files often do
- contain large stretches of in-order data.
+ jobs share time windows. The `Exact` upgrade keeps the `SortExec`
+ because the global ordering can't be proven, even though the
+ files often do contain large stretches of in-order data.
For both, a full external `SortExec` is overkill. The parquet
metadata is right there, and reading the *most-promising* data
first lets `TopK`'s dynamic filter threshold tighten quickly so the
-rest gets pruned. Phase 3 ([#21956]) wires that up by generalising
-the `Inexact` path Phase 1 introduced.
+rest gets pruned. [#21956] wires that up by generalising the
+`Inexact` path that [#19064] introduced.
### `try_pushdown_sort` — one decision, three outcomes
-The `Exact` / `Inexact` / `Unsupported` protocol from Phase 1 stays.
-Phase 3 broadens the **conditions** that route a query into
-`Inexact`:
+The `Exact` / `Inexact` / `Unsupported` protocol from [#19064]
+stays. The new PR broadens the **conditions** that route a query
+into `Inexact`:
| Condition | Outcome |
| --- | --- |
-| `eq_properties.ordering_satisfy(request)` | `Exact` — Phase 1 / 2 sort elimination |
-| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — Phase 3 runtime pipeline |
+| `eq_properties.ordering_satisfy(request)` | `Exact` — sort elimination |
+| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — runtime reorder pipeline |
| Neither | `Unsupported` — `SortExec` stays, no source-side optimisation |
The "reversed satisfies" branch is what handles function-wrapped
@@ -406,7 +419,7 @@ itself.
### Two flags on `ParquetSource`, three runtime steps
-
+
When `try_pushdown_sort` returns `Inexact`, it stamps two fields on
the `ParquetSource`:
@@ -432,8 +445,8 @@ The opener reads them at scan time to drive three composable steps:
3. **Reverse.** For `DESC` requests,
`PreparedAccessPlan::reverse` flips the iteration after the
stats reorder normalises everything to ASC-by-min. Same
- primitive Phase 1 introduced for declared reverse scans — Phase
- 3 just routes more queries through it.
+ primitive [#19064] introduced for declared reverse scans —
+ [#21956] just routes more queries through it.
The two layers **nest by construction**: file `i`'s `min(col)` is
a lower bound on every row group inside it, so the file queue's
@@ -461,7 +474,7 @@ DataSourceExec: file_groups=..., file_type=parquet,
Absence of either flag means the corresponding runtime step is a
no-op.
-### When Phase 3 does *not* fire
+### When runtime reorder does *not* fire
* **Aggregations on top of the sort key.** `SELECT URL, COUNT(*) AS c
FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench
@@ -483,7 +496,7 @@ no-op.
`[a DESC, b ASC, c DESC]`), `try_pushdown_sort` returns
`Unsupported` so the surrounding `SortExec` can keep its
`sort_prefix` annotation — prefix-aware early termination in
- `TopK` is strictly better than the Phase 3 reorder on data that
+ `TopK` is strictly better than the runtime reorder on data that
is already in prefix order on disk.
## Reverse Scans for `ORDER BY ... DESC`
@@ -525,7 +538,7 @@ decoded rows before any batch can be emitted is roughly:
return 10 — defeating the point of the `LIMIT`.
The agreed direction coming out of that discussion was to ship the
-narrower `Inexact` row-group-reverse first (which became Phase 1 in
+narrower `Inexact` row-group-reverse first (which landed in
[#19064]), and to build `Exact` reverse on a finer-grained primitive
once `arrow-rs` exposed one.
@@ -620,8 +633,8 @@ delta, bit-packing) are all forward streams — you cannot decode the
last value without decoding every value that came before it. The
design therefore is: **reverse the page traversal, forward-decode
each page, reverse the resulting RecordBatch**. This is the algorithm
-shape that DataFusion's Phase-2 `RecordBatchReader` integration will
-use once arrow-rs ships the primitive.
+shape DataFusion's `RecordBatchReader` integration will use once
+arrow-rs ships the primitive.
The killer use case is **filtered reverse TopK**:
@@ -793,8 +806,9 @@ Issues and PRs:
* Umbrella issue: [apache/datafusion#17348][#17348]
* `MinMaxStatistics` foundation: [apache/datafusion#9593][#9593]
-* Phase 1: [apache/datafusion#19064][#19064]
-* Phase 2: [apache/datafusion#21182][#21182]
+* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064][#19064]
+* Sort elimination via statistics: [apache/datafusion#21182][#21182]
+* Runtime reorder for TopK convergence: [apache/datafusion#21956][#21956]
* `BufferExec` capacity for sort elimination: [apache/datafusion#21426][#21426]
* Dynamic / TopK-driven path: [apache/datafusion#21351][#21351] (morsel-style work scheduling),
[apache/datafusion#21733][#21733] (global file reorder in shared queue)
From e246b2c6fd5ff78557ae809cc70745838fc3f049 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 14:49:43 +0800
Subject: [PATCH 08/14] Clarify two-layer composability: same key (min), not
'nest by construction'
The previous wording 'nest by construction' could be read as a code-
enforced property. It's actually a logical consequence of using the
same sort key (min) at both file and row-group level: a file's
min(col) is just the minimum over its row groups' min(col) values, so
the most-promising file contains the most-promising row group. The
rewritten paragraph spells that out and ties it to why TopK's dynamic
filter tightens fast.
---
content/blog/2026-05-25-sort-pushdown.md | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 205d4807..05bf141f 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -448,11 +448,15 @@ The opener reads them at scan time to drive three composable steps:
primitive [#19064] introduced for declared reverse scans —
[#21956] just routes more queries through it.
-The two layers **nest by construction**: file `i`'s `min(col)` is
-a lower bound on every row group inside it, so the file queue's
-order is a natural prefix of the within-file row-group order.
-Choosing the same key (`min`) in both layers keeps the strategies
-consistent.
+The two layers compose naturally because they sort by the same
+key. A file's `min(col)` is the minimum over its row groups'
+`min(col)` values, so the file with the smallest `min` contains
+the row group with the smallest `min`. Sorting files by `min(col)`
+and then sorting row groups by `min(col)` within each file
+produces an approximately min-ordered global stream — the first
+batch comes from the most-promising row group in the
+most-promising file, exactly what `TopK`'s dynamic filter needs
+to tighten its threshold fast.
`reverse_row_groups`'s meaning depends on which way `Inexact` was
reached. When the column-in-schema condition fires, the stats
From 6c6d58434fd9bedb202afedc9e3a2db2a17e6642 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 15:11:02 +0800
Subject: [PATCH 09/14] Strip inline PR refs from narrative; collect all links
in References
Match the dynamic-filter blog's style: narrative talks about
capabilities/mechanisms, not 'PR #21956 did X / PR #19064 introduced
Y'. The 81 inline PR/issue references in the body were dropping the
reader out of the narrative; they belong in a single Issues-and-PRs
list at the end.
Changes:
- TL;DR: drop 6 inline PR refs from the 4 capability bullets
- Body sections (PushdownSort Rule, Sort Elimination, Runtime Reorder,
Reverse Scans, Empirical Note): drop ~30 inline refs to historical
PRs; replace with capability names or 'the rule' / 'the runtime
reorder path' style descriptions
- What's Next: switch from [#NNNNN] format to named markdown links
(matching dynamic-filter's Future Work style)
- Issues for new contributors: same conversion
- References section: rewrite using full URLs (no link-ref
indirection); split into 'Landed' vs 'In flight / open' for
clarity
Net: ~90 lines removed, all PR/issue numbers now consolidated at
the bottom of the post.
---
content/blog/2026-05-25-sort-pushdown.md | 435 +++++++++++------------
1 file changed, 214 insertions(+), 221 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 05bf141f..8e12170c 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -61,32 +61,31 @@ further still.
first** when they aren't — so `TopK` converges fast and the rest
gets pruned by statistics.
* What's supported today:
- * **The `PushdownSort` rule** ([#19064]) — a physical optimizer
- rule that asks each `ExecutionPlan` "can you produce output in
- *this* ordering?" and uses the
- `Exact` / `Inexact` / `Unsupported` answer to decide whether to
- delete the surrounding `SortExec`, leave it in place with a
- hint, or give up.
- * **Sort elimination via statistics** ([#21182]) — `PushdownSort`
- sorts files within each partition by Parquet `min/max`
- statistics and, when the resulting ranges are provably
- non-overlapping, upgrades the source's ordering claim from
- `Unsupported` to `Exact` and **removes the `SortExec`** that
- `EnforceSorting` inserted earlier.
- * **Runtime reorder for `TopK` convergence** ([#21956]) — whenever
- the leading sort key is a plain column in the file schema (or
- the source's reversed declared ordering satisfies the request),
+ * **The `PushdownSort` rule** — a physical optimizer rule that
+ asks each `ExecutionPlan` "can you produce output in *this*
+ ordering?" and uses the `Exact` / `Inexact` / `Unsupported`
+ answer to decide whether to delete the surrounding `SortExec`,
+ leave it in place with a hint, or give up.
+ * **Sort elimination via statistics** — `PushdownSort` sorts
+ files within each partition by Parquet `min/max` statistics
+ and, when the resulting ranges are provably non-overlapping,
+ upgrades the source's ordering claim from `Unsupported` to
+ `Exact` and **removes the `SortExec`** that `EnforceSorting`
+ inserted earlier.
+ * **Runtime reorder for `TopK` convergence** — whenever the
+ leading sort key is a plain column in the file schema (or the
+ source's reversed declared ordering satisfies the request),
`try_pushdown_sort` stamps two flags on the source and the
opener runs a three-step runtime pipeline — file-level reorder
in the shared morsel queue, row-group reorder by min/max stats,
then optional iteration reverse for `DESC` requests. `SortExec`
stays, but `TopK`'s dynamic filter tightens fast on the
most-promising data and the rest is pruned.
- * **Reverse scans for `ORDER BY ... DESC`** ([#19446], [#19557]) —
- a row-group-level reverse returns `Inexact` (Sort stays, but
- `TopK` terminates early). The page-level reverse primitive
- needed for `Exact` reverse — and so for full `SortExec` removal
- on `DESC` queries — is in flight in arrow-rs ([#9937]).
+ * **Reverse scans for `ORDER BY ... DESC`** — a row-group-level
+ reverse returns `Inexact` (Sort stays, but `TopK` terminates
+ early). The page-level reverse primitive needed for `Exact`
+ reverse — and so for full `SortExec` removal on `DESC` queries
+ — is in flight in arrow-rs.
* Real-world benchmarks on the `sort_pushdown` suite (`Exact` path):
`ORDER BY ... LIMIT` queries get **27× and 49× faster**; full
`ORDER BY` scans get **~2×** faster.
@@ -179,10 +178,8 @@ producing the order requested.
## The `PushdownSort` Rule
-[#19064] introduced the **`PushdownSort`** physical optimizer rule
-and a uniform API for asking each `ExecutionPlan` two questions:
-
-[#19064]: https://github.com/apache/datafusion/pull/19064
+The **`PushdownSort`** physical optimizer rule defines a uniform
+API for asking each `ExecutionPlan` two questions:
1. "Can you produce output in *this* ordering?"
2. "If yes, please rearrange yourself so that it actually does."
@@ -196,34 +193,24 @@ it returns `Inexact` and flips on `reverse_row_groups=true` so the
scan reads row groups from last to first (the row-group-level reverse
covered later in this post); otherwise it returns `Unsupported`.
-The initial PR's scope was deliberately narrow. It set up the API
-and delivered the reverse-scan case end-to-end, but did **not** add
-any statistics-based file rearrangement — that came later via
-[#21182], covered in
+The rule's initial scope was deliberately narrow. It set up the
+API and delivered the reverse-scan case end-to-end, but did **not**
+add any statistics-based file rearrangement — that came later,
+covered in
[Sort Elimination via Statistics](#sort-elimination-via-statistics)
below. A finer-grained extension that broadens this `Inexact` path
-with a three-step runtime reorder pipeline landed in [#21956] —
-covered in
+with a three-step runtime reorder pipeline is covered in
[Runtime Reorder for TopK Convergence](#runtime-reorder-for-topk-convergence).
-[#19064] also produced a useful side improvement:
-
-* **Reverse-output redesign** ([#19446], [#19557]) extended the same
- rule to `DESC` queries — picked up again in the reverse-scan
- section below.
-
-[#19446]: https://github.com/apache/datafusion/pull/19446
-[#19557]: https://github.com/apache/datafusion/pull/19557
+The same rule also handles **reverse-output** for `DESC` queries —
+picked up again in the reverse-scan section below.
## Sort Elimination via Statistics
The initial `Inexact`-only path left a sharp edge that motivated
-stats-based sort elimination ([#21182]). Consider this realistic
-scenario:
-
-[#21182]: https://github.com/apache/datafusion/pull/21182
+stats-based sort elimination. Consider this realistic scenario:
* Three files: `a.parquet`, `b.parquet`, `c.parquet`.
* Each declares `WITH ORDER (ts ASC)`.
@@ -239,21 +226,21 @@ the scan now has no declared ordering, so `EnforceSorting` (which runs
earlier in the pipeline) inserts a `SortExec`. The data is sorted on
disk; the optimizer just can't tell.
-[#21182] fixes this in `PushdownSort`, which runs late — after
-`EnforceDistribution` and `EnforceSorting` have already shaped the
-plan. When `PushdownSort` finds a `SortExec` above a file scan whose
-ordering was stripped (a `FileSource` `Unsupported` result), it does
-three things inside `FileScanConfig::try_pushdown_sort`:
+Stats-based sort elimination fixes this in `PushdownSort`, which
+runs late — after `EnforceDistribution` and `EnforceSorting` have
+already shaped the plan. When `PushdownSort` finds a `SortExec`
+above a file scan whose ordering was stripped (a `FileSource`
+`Unsupported` result), it does three things inside
+`FileScanConfig::try_pushdown_sort`:
1. **Sort the file list by per-file statistics on the sort
column(s)** within each file group (the diagram above). The
- pre-existing [`MinMaxStatistics`] helper (introduced in [#9593])
- reads each file's `column_statistics[c].min_value` /
- `.max_value` for each sort column `c`, then sorts the file list by
- the min row. The PR wires this helper into the optimizer's
- `Unsupported` branch — `sort_files_within_groups_by_statistics`
- does the per-group orchestration and decides whether any group is
- non-overlapping after the sort.
+ pre-existing [`MinMaxStatistics`] helper reads each file's
+ `column_statistics[c].min_value` / `.max_value` for each sort
+ column `c`, then sorts the file list by the min row.
+ `sort_files_within_groups_by_statistics` does the per-group
+ orchestration and decides whether any group is non-overlapping
+ after the sort.
2. **Check adjacency within each group**: walk each sorted file group
independently and ask whether `file[i].max ≤ file[i+1].min` for
every adjacent pair (touching at the boundary is fine — value `v`
@@ -269,7 +256,6 @@ three things inside `FileScanConfig::try_pushdown_sort`:
itself and the plan becomes streamable.
[`MinMaxStatistics`]: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/statistics.rs
-[#9593]: https://github.com/apache/datafusion/pull/9593
One caveat that comes straight from `MinMaxStatistics`: the stats
sort only fires when every `ORDER BY` expression is a plain column.
@@ -278,11 +264,11 @@ is no per-file min/max for the function output to compare against.
Extending sort pushdown across monotonic function wrappers is one of
the open follow-ups.
-*(Within #21956's `Inexact` path, `EquivalenceProperties`'s
-monotonicity inference does let function-wrapped sorts benefit from
-row-group iteration reverse when the source declares a compatible
-natural ordering — but stats-based reorder still needs a plain
-column.)*
+*(The runtime reorder path covered later does let function-wrapped
+sorts benefit from row-group iteration reverse via
+`EquivalenceProperties`'s monotonicity inference, when the source
+declares a compatible natural ordering — but stats-based sort
+elimination still needs a plain column.)*
@@ -305,7 +291,7 @@ The implementation handles a few edge cases worth calling out:
rule compensates by inserting a [`BufferExec`] in the `SortExec`'s
place — bounded streaming buffer, same throughput shape, no
blocking sort. Capacity is configurable via
- [`sort_pushdown_buffer_capacity`] ([#21426]).
+ [`sort_pushdown_buffer_capacity`].
* **`fetch` preservation** through `EnforceDistribution`. The
distribution rule sometimes strips a `SortExec`'s `fetch` field and
re-adds the node later. The PR plumbs `fetch` through so a
@@ -330,7 +316,6 @@ The implementation handles a few edge cases worth calling out:
[`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs
[`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426
-[#21426]: https://github.com/apache/datafusion/pull/21426
## Benchmarks
@@ -340,7 +325,7 @@ The [`sort_pushdown`] benchmark suite reproduces the
"wrong-order file list" scenario by generating Parquet files whose
names are intentionally reversed against their sort-key ranges. Numbers
below are `--partitions 1`, release build, with stats-based sort
-elimination ([#21182]) enabled, versus `main`:
+elimination enabled, versus `main`:
[`sort_pushdown`]: https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown
@@ -393,16 +378,16 @@ fall outside that window:
For both, a full external `SortExec` is overkill. The parquet
metadata is right there, and reading the *most-promising* data
first lets `TopK`'s dynamic filter threshold tighten quickly so the
-rest gets pruned. [#21956] wires that up by generalising the
-`Inexact` path that [#19064] introduced.
+rest gets pruned. Runtime reorder wires that up by generalising
+the `Inexact` path the rule introduced.
### `try_pushdown_sort` — one decision, three outcomes
-The `Exact` / `Inexact` / `Unsupported` protocol from [#19064]
-stays. The new PR broadens the **conditions** that route a query
-into `Inexact`:
+The `Exact` / `Inexact` / `Unsupported` protocol stays. The
+runtime reorder path broadens the **conditions** that route a
+query into `Inexact`:
| Condition | Outcome |
| --- | --- |
@@ -435,9 +420,10 @@ struct ParquetSource {
The opener reads them at scan time to drive three composable steps:
1. **File-level reorder.** `FileSource::reorder_files` sits in the
- shared morsel queue (the [#21351] work-stealing primitive) and
- sorts the partitioned-file list by `min(col)`. The first file
- picked across all partitions is globally the most-promising one.
+ shared morsel queue (a work-stealing primitive that lets sibling
+ partitions share a single file pool) and sorts the
+ partitioned-file list by `min(col)`. The first file picked across
+ all partitions is globally the most-promising one.
2. **Row-group-level reorder.** Once a file is opened,
`PreparedAccessPlan::reorder_by_statistics` sorts that file's
`row_group_indexes` by `min(col)` ASC. The row group most likely
@@ -445,8 +431,9 @@ The opener reads them at scan time to drive three composable steps:
3. **Reverse.** For `DESC` requests,
`PreparedAccessPlan::reverse` flips the iteration after the
stats reorder normalises everything to ASC-by-min. Same
- primitive [#19064] introduced for declared reverse scans —
- [#21956] just routes more queries through it.
+ primitive the rule originally introduced for declared reverse
+ scans — the runtime pipeline just routes more queries through
+ it.
The two layers compose naturally because they sort by the same
key. A file's `min(col)` is the minimum over its row groups'
@@ -490,10 +477,10 @@ no-op.
the scan there'd be nothing actionable to do with it.
* **Multi-column sort secondary keys.** The reorder currently only
uses the leading sort expression — secondary keys are ignored.
- Tracked as a follow-up in [#22198].
+ An open follow-up.
* **Function-wrapped sort without a source-declared ordering.**
Without a declared ordering to invert, the reversed-equivalence
- branch has nothing to satisfy. Tracked in the same follow-up.
+ branch has nothing to satisfy. Same follow-up.
* **Source declares a forward prefix of the request.** When the
source's declared `output_ordering` is a non-empty proper prefix
of the request (e.g. source `[a DESC, b ASC]`, request
@@ -511,26 +498,25 @@ no-op.
ascending and the query wants descending, we should be able to skip
the sort — we just need to read the data in the opposite order.
-The first iteration of this lives in [#18817] and operates at the
-**row group** level: it reverses the *iteration order of row groups*
-so the last RG is opened first, but rows within each RG are still
-decoded forward. The resulting stream is "RGs descending × rows
-ascending" — close to the requested order, but not strictly DESC. The
-optimizer therefore reports this as `Inexact` and leaves the
-`SortExec` in place; the win is that `TopK`'s dynamic filter tightens
-much faster, because the very first row groups read already contain
-values near the final answer. A tight threshold means subsequent row
-groups can be skipped via min/max statistics. This ships today and
-is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted files.
-
-[#18817]: https://github.com/apache/datafusion/pull/18817
+The first iteration of this operates at the **row group** level:
+it reverses the *iteration order of row groups* so the last RG is
+opened first, but rows within each RG are still decoded forward.
+The resulting stream is "RGs descending × rows ascending" — close
+to the requested order, but not strictly DESC. The optimizer
+therefore reports this as `Inexact` and leaves the `SortExec` in
+place; the win is that `TopK`'s dynamic filter tightens much
+faster, because the very first row groups read already contain
+values near the final answer. A tight threshold means subsequent
+row groups can be skipped via min/max statistics. This ships today
+and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted
+files.
To turn this into `Exact` reverse — so the `SortExec` can be removed
outright — each emitted batch itself has to be in DESC order. The
straightforward row-group-level approach (decode an entire RG forward,
materialize all rows, reverse the buffer, then emit) is correct and
was actually proposed first, in an earlier iteration of this work
-([#18817], later closed and split into smaller pieces). Review
+that was later closed and split into smaller pieces. Review
feedback there — primarily from [@2010YOUY01] — flagged the memory
profile as too aggressive: caching an entire row group's worth of
decoded rows before any batch can be emitted is roughly:
@@ -542,9 +528,8 @@ decoded rows before any batch can be emitted is roughly:
return 10 — defeating the point of the `LIMIT`.
The agreed direction coming out of that discussion was to ship the
-narrower `Inexact` row-group-reverse first (which landed in
-[#19064]), and to build `Exact` reverse on a finer-grained primitive
-once `arrow-rs` exposed one.
+narrower `Inexact` row-group-reverse first, and to build `Exact`
+reverse on a finer-grained primitive once `arrow-rs` exposed one.
### Empirical note — runtime cost of `Inexact` + `TopK`
@@ -596,20 +581,20 @@ Why didn't we just upstream the internal `Exact` reverse, then?
group, so any RG-level `Exact` implementation — ours included —
has to decode the entire row group, reverse the buffer in
memory, and only then emit. That is the same memory profile that
-`#18817` was rejected for: a peak of one whole row group
-(~128 MB) of decoded data, vs. the few-MB-per-batch streaming
-profile readers normally have. Our runtime advantage over
-`Inexact` + `TopK` does *not* come from decoding less — both
-paths decode the relevant row group's sort column in full — it
-comes from skipping the per-row heap maintenance, the dynamic
+got the earlier RG-level proposal rejected: a peak of one whole
+row group (~128 MB) of decoded data, vs. the few-MB-per-batch
+streaming profile readers normally have. Our runtime advantage
+over `Inexact` + `TopK` does *not* come from decoding less —
+both paths decode the relevant row group's sort column in full —
+it comes from skipping the per-row heap maintenance, the dynamic
filter evaluation, and the `SortExec` final ordering pass that
`Inexact` keeps on top. So we end up running our `Exact` reverse
-in-house but cannot land it as the upstream default for the same
-memory reason that closed `#18817`.
+in-house but cannot land it as the upstream default, for the
+same memory reason that closed the earlier proposal.
**The fix that keeps both the runtime win and a streaming memory
-profile is page-level `Exact` reverse via arrow-rs [#9937]**,
-described next.
+profile is page-level `Exact` reverse via arrow-rs**, described
+next.
That primitive is the **page-level** reverse traversal. Parquet's
`OffsetIndex` already gives us byte-precise locations for every data
@@ -618,19 +603,16 @@ decode it forward, reverse the resulting batch, and emit. Peak buffer
drops to one page (~1 MB) and first-batch latency drops to the cost
of one page decode — the row-group-level memory cliff disappears.
-We are landing this primitive upstream in arrow-rs as
-[#9937], with the discussion in [#9934]. Early numbers on a 100k-row,
-98-page column chunk show **~50× faster time-to-first-N** for `n ≤ 1
-page` and **~9× faster** for `n` spanning 10 pages, compared with the
-row-group-level Exact reverse described above. The DataFusion-side
-integration that turns this primitive into an `Exact` result is a
-follow-up to #9937 and is gated on its merge.
+We are landing this primitive upstream in arrow-rs. Early numbers
+on a 100k-row, 98-page column chunk show **~50× faster
+time-to-first-N** for `n ≤ 1 page` and **~9× faster** for `n`
+spanning 10 pages, compared with the row-group-level Exact reverse
+described above. The DataFusion-side integration that turns this
+primitive into an `Exact` result is a follow-up and is gated on
+the arrow-rs merge.
[@2010YOUY01]: https://github.com/2010YOUY01
-[#9937]: https://github.com/apache/arrow-rs/pull/9937
-[#9934]: https://github.com/apache/arrow-rs/issues/9934
-
One natural question: why not reverse the rows *within* a page
directly? Because we can't. Parquet's page encodings (RLE, dictionary,
delta, bit-packing) are all forward streams — you cannot decode the
@@ -666,39 +648,42 @@ e.g. when file ranges genuinely overlap, or when the sort is on a
function output rather than a plain column. The two directions are
not alternatives; they compose:
-* **`Exact` reverse for `ORDER BY ... DESC`.** Today's row-group
+* [`Exact` reverse for `ORDER BY ... DESC`]. Today's row-group
reverse returns `Inexact` and the `SortExec` stays on top; the
- arrow-rs page-level reverse primitive ([#9937]) is what unlocks
- `Exact` reverse on `DESC` queries (and therefore full `SortExec`
- elimination on `DESC`). Memory + first-batch latency rule out doing
- the same thing at the row-group level. Gated on #9937.
-* **Dynamic / TopK-driven path.** When `Exact` cannot fire, `TopK`'s
- [dynamic filter][dyn-filters-blog] still benefits enormously from
- reading the *best* data first. This thread also builds on the
- [limit pruning][limit-pruning-blog] work that turned `LIMIT` into
- an I/O optimization across the pruning pipeline. The
- recently-merged morsel-style work scheduling in `FileStream`
- ([#21351]) gives sibling partitions a *shared work queue* with
+ arrow-rs page-level reverse primitive is what unlocks `Exact`
+ reverse on `DESC` queries (and therefore full `SortExec`
+ elimination on `DESC`). Memory + first-batch latency rule out
+ doing the same thing at the row-group level. Gated on the
+ arrow-rs side.
+* **Dynamic / TopK-driven path.** When `Exact` cannot fire,
+ `TopK`'s [dynamic filter][dyn-filters-blog] still benefits
+ enormously from reading the *best* data first. This thread also
+ builds on the [limit pruning][limit-pruning-blog] work that
+ turned `LIMIT` into an I/O optimization across the pruning
+ pipeline. The recently-merged [morsel-style work scheduling] in
+ `FileStream` gives sibling partitions a *shared work queue* with
file-level work-stealing — no CPU sits idle when one partition
- runs out of files. The proposed [#21733] sorts files in
- that shared queue by per-file statistics *before* any partition
- picks, so the first file read is globally optimal and tightens the
- dynamic filter immediately. Combined with **TopK threshold init from
- parquet statistics** ([#21712]) and **`try_pushdown_sort` driving
- runtime row-group / file reorder + reverse** ([#21956], landed),
- the threshold can be set before reading a single byte. The reorder
+ runs out of files. The proposed
+ [global file reorder in the shared queue] sorts files in that
+ shared queue by per-file statistics *before* any partition
+ picks, so the first file read is globally optimal and tightens
+ the dynamic filter immediately. Combined with
+ [TopK threshold init from parquet statistics] and the runtime
+ row-group / file reorder + reverse path described above, the
+ threshold can be set before reading a single byte. The reorder
mechanism applies to any `ORDER BY [LIMIT N]` on
- parquet, not just TopK queries with a dynamic filter. The combined statistics-driven `TopK` pipeline is in flight
- as [#21580].
+ parquet, not just TopK queries with a dynamic filter. The
+ [combined statistics-driven `TopK` pipeline] is in flight.
The mechanism here is **RG-level pruning, not mid-stream early
return**. With the threshold known up front, the parquet
- `PruningPredicate` rejects entire row groups against their min/max
- statistics before any I/O — those row groups are never decoded.
- The row group(s) the reader *does* open still have their sort
- column decoded in full to feed the dynamic filter. On the #21580
- microbenchmark (single file, 61 sorted row groups, `--partitions 1`),
- **60 of the 61 row groups are skipped** and only one is decoded:
+ `PruningPredicate` rejects entire row groups against their
+ min/max statistics before any I/O — those row groups are never
+ decoded. The row group(s) the reader *does* open still have
+ their sort column decoded in full to feed the dynamic filter.
+ On the in-flight microbenchmark (single file, 61 sorted row
+ groups, `--partitions 1`), **60 of the 61 row groups are
+ skipped** and only one is decoded:
| Query | Baseline | With pipeline | Speedup |
| ------------------------------ | -------: | ------------: | ------: |
@@ -709,81 +694,81 @@ not alternatives; they compose:
The stack reports `Inexact` — the `SortExec` stays on top to
enforce correctness across overlapping ranges — so this path
- cannot do *true* mid-stream early return. Once the parquet reader
- opens a row group, the sort column has to be decoded all the way
- through; once a `FileStream` picks up a file from the shared work
- queue, it has to finish that file. Today's dynamic work scheduling
- ([#21351]) is **file-granular**: idle partitions stop pulling
- new files from the queue once a global limit is satisfied, but
- the partition that's currently inside a file decodes that file's
- remaining row groups regardless. Mid-file RG-level early return
- on `TopK` convergence is **not implemented yet** — the work
- queue holds `PartitionedFile`, not row-group descriptors.
-
- Closing the tap the moment `TopK` has K confirmed winners therefore
- needs either:
-
- * the **`Exact` path**, where the `SortExec` is gone entirely and
- the data source's own `fetch` becomes a static limit that the
- reader can honour at batch granularity; or
+ cannot do *true* mid-stream early return. Once the parquet
+ reader opens a row group, the sort column has to be decoded all
+ the way through; once a `FileStream` picks up a file from the
+ shared work queue, it has to finish that file. Today's dynamic
+ work scheduling is **file-granular**: idle partitions stop
+ pulling new files from the queue once a global limit is
+ satisfied, but the partition that's currently inside a file
+ decodes that file's remaining row groups regardless. Mid-file
+ RG-level early return on `TopK` convergence is **not implemented
+ yet** — the work queue holds `PartitionedFile`, not row-group
+ descriptors.
+
+ Closing the tap the moment `TopK` has K confirmed winners
+ therefore needs either:
+
+ * the **`Exact` path**, where the `SortExec` is gone entirely
+ and the data source's own `fetch` becomes a static limit that
+ the reader can honour at batch granularity; or
* **finer-grained dynamic scheduling** — having the shared queue
- hold row-group descriptors instead of whole files, so a partition
- can release its current file's remaining row groups back to the
- pool once a global signal says enough TopK winners have been
- found. This is a natural extension of [#21351] and [#21733] but
- is not yet on a PR.
-
- The three mechanisms compose. Stats pruning saves the row groups
- that *can't* matter (skipped without I/O). The dynamic filter
- narrows what's decoded inside the row groups the reader does
- open. `Exact` or finer-grained scheduling is what eventually
- closes the tap once `TopK` is satisfied.
-* **Filtered reverse TopK end-to-end.** `WHERE filter ORDER BY ts
- DESC LIMIT N` is the dominant observability query shape and the
- one where the arrow-rs page-reverse primitive matters most:
- `RowSelection::with_limit` cannot pre-compute the last `N` matching
- rows when the filter is selective, so the only correct strategy is
- to stream pages backward, evaluate the filter, and stop when `N`
- matches are collected. The DataFusion-side integration is the
- follow-up to #9937.
-* **Unifying `EnforceDistribution` and `EnforceSorting`** into a
- single `EnsureRequirements` rule ([#21976]). The two existing rules
- are coupled through `SortExec.preserve_partitioning`, which makes
+ hold row-group descriptors instead of whole files, so a
+ partition can release its current file's remaining row groups
+ back to the pool once a global signal says enough TopK
+ winners have been found. A natural extension of the existing
+ morsel work but not yet on a PR.
+
+ The three mechanisms compose. Stats pruning saves the row
+ groups that *can't* matter (skipped without I/O). The dynamic
+ filter narrows what's decoded inside the row groups the reader
+ does open. `Exact` or finer-grained scheduling is what
+ eventually closes the tap once `TopK` is satisfied.
+* **Filtered reverse `TopK` end-to-end.** `WHERE filter ORDER BY
+ ts DESC LIMIT N` is the dominant observability query shape and
+ the one where the arrow-rs page-reverse primitive matters most:
+ `RowSelection::with_limit` cannot pre-compute the last `N`
+ matching rows when the filter is selective, so the only correct
+ strategy is to stream pages backward, evaluate the filter, and
+ stop when `N` matches are collected. The DataFusion-side
+ integration is a follow-up to the arrow-rs primitive.
+* [Unifying `EnforceDistribution` and `EnforceSorting`] into a
+ single `EnsureRequirements` rule. The two existing rules are
+ coupled through `SortExec.preserve_partitioning`, which makes
their composition non-idempotent and has caused a class of
production bugs. Other engines (Spark's `EnsureRequirements`,
- Trino's `AddExchanges`) handle both in a single rule. Merging them
- also gives future sort-related optimizations a single coherent place
- to live. In progress.
-* **OFFSET pushdown to parquet** ([#21828]) so `ORDER BY ts LIMIT K
- OFFSET N` queries can skip the first `N` rows at the row-group level
+ Trino's `AddExchanges`) handle both in a single rule. Merging
+ them also gives future sort-related optimizations a single
+ coherent place to live. In progress.
+* [OFFSET pushdown to parquet] so `ORDER BY ts LIMIT K OFFSET N`
+ queries can skip the first `N` rows at the row-group level
instead of decoding and discarding them. In progress.
-* **Multi-column and function-wrapped reorder follow-ups** ([#22198]).
- The reorder mechanism in #21956 currently only uses the leading
- sort key and only fires on plain columns. Lexicographic multi-key
- reorder via `arrow::compute::lexsort_to_indices` is low-hanging
- fruit; extending to monotonic function wrappers via leaf-column
- extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs a
- bit more `EquivalenceProperties` integration but is doable.
-
-[#21976]: https://github.com/apache/datafusion/pull/21976
-[#21956]: https://github.com/apache/datafusion/pull/21956
-[#22198]: https://github.com/apache/datafusion/issues/22198
-[#21712]: https://github.com/apache/datafusion/pull/21712
-[#21580]: https://github.com/apache/datafusion/pull/21580
-[#21828]: https://github.com/apache/datafusion/pull/21828
-[#21351]: https://github.com/apache/datafusion/pull/21351
-[#21733]: https://github.com/apache/datafusion/issues/21733
+* [Multi-column and function-wrapped reorder follow-ups]. The
+ reorder mechanism currently only uses the leading sort key and
+ only fires on plain columns. Lexicographic multi-key reorder
+ via `arrow::compute::lexsort_to_indices` is low-hanging fruit;
+ extending to monotonic function wrappers via leaf-column
+ extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs
+ a bit more `EquivalenceProperties` integration but is doable.
+
+[`Exact` reverse for `ORDER BY ... DESC`]: https://github.com/apache/arrow-rs/pull/9937
+[morsel-style work scheduling]: https://github.com/apache/datafusion/pull/21351
+[global file reorder in the shared queue]: https://github.com/apache/datafusion/issues/21733
+[TopK threshold init from parquet statistics]: https://github.com/apache/datafusion/pull/21712
+[combined statistics-driven `TopK` pipeline]: https://github.com/apache/datafusion/pull/21580
+[Unifying `EnforceDistribution` and `EnforceSorting`]: https://github.com/apache/datafusion/pull/21976
+[OFFSET pushdown to parquet]: https://github.com/apache/datafusion/pull/21828
+[Multi-column and function-wrapped reorder follow-ups]: https://github.com/apache/datafusion/issues/22198
Concretely useful issues for new contributors:
-* [#17348] — the umbrella issue for sort pushdown.
-* [#21317] — sort pushdown: reorder row groups by statistics within
- each file.
-* [#19394] — add more `ExecutionPlan` impls to support sort pushdown.
+* [Umbrella issue for sort pushdown][umbrella-issue].
+* [Reorder row groups by statistics within each file][rg-reorder-issue].
+* [Add more `ExecutionPlan` impls to support sort pushdown][more-impls-issue].
-[#17348]: https://github.com/apache/datafusion/issues/17348
-[#21317]: https://github.com/apache/datafusion/issues/21317
-[#19394]: https://github.com/apache/datafusion/issues/19394
+[umbrella-issue]: https://github.com/apache/datafusion/issues/17348
+[rg-reorder-issue]: https://github.com/apache/datafusion/issues/21317
+[more-impls-issue]: https://github.com/apache/datafusion/issues/19394
## Acknowledgements
@@ -806,18 +791,26 @@ Prior posts this work builds on:
* [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses.
* [Turning LIMIT into an I/O Optimization: Inside DataFusion's Multi-Layer Pruning Stack][limit-pruning-blog] — the pruning pipeline this work plugs into.
-Issues and PRs:
-
-* Umbrella issue: [apache/datafusion#17348][#17348]
-* `MinMaxStatistics` foundation: [apache/datafusion#9593][#9593]
-* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064][#19064]
-* Sort elimination via statistics: [apache/datafusion#21182][#21182]
-* Runtime reorder for TopK convergence: [apache/datafusion#21956][#21956]
-* `BufferExec` capacity for sort elimination: [apache/datafusion#21426][#21426]
-* Dynamic / TopK-driven path: [apache/datafusion#21351][#21351] (morsel-style work scheduling),
- [apache/datafusion#21733][#21733] (global file reorder in shared queue)
-* Benchmark suite: [`sort_pushdown`]
-* Row-group reverse scan: [apache/datafusion#18817][#18817]
-* Page-level reverse (arrow-rs): [apache/arrow-rs#9934][#9934],
- [apache/arrow-rs#9937][#9937]
-* `EnsureRequirements`: [apache/datafusion#21976][#21976]
+Landed PRs that make up this work:
+
+* `MinMaxStatistics` foundation: [apache/datafusion#9593](https://github.com/apache/datafusion/pull/9593)
+* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064](https://github.com/apache/datafusion/pull/19064)
+* Reverse-output redesign: [apache/datafusion#19446](https://github.com/apache/datafusion/pull/19446), [apache/datafusion#19557](https://github.com/apache/datafusion/pull/19557)
+* Sort elimination via statistics: [apache/datafusion#21182](https://github.com/apache/datafusion/pull/21182)
+* `BufferExec` capacity for sort elimination: [apache/datafusion#21426](https://github.com/apache/datafusion/pull/21426)
+* Morsel-style work scheduling: [apache/datafusion#21351](https://github.com/apache/datafusion/pull/21351)
+* Runtime reorder for `TopK` convergence: [apache/datafusion#21956](https://github.com/apache/datafusion/pull/21956)
+* Row-group-level `Inexact` reverse: [apache/datafusion#18817](https://github.com/apache/datafusion/pull/18817)
+
+In flight / open:
+
+* Page-level reverse (arrow-rs): [apache/arrow-rs#9937](https://github.com/apache/arrow-rs/pull/9937), discussion in [apache/arrow-rs#9934](https://github.com/apache/arrow-rs/issues/9934)
+* `EnsureRequirements`: [apache/datafusion#21976](https://github.com/apache/datafusion/pull/21976)
+* OFFSET pushdown to parquet: [apache/datafusion#21828](https://github.com/apache/datafusion/pull/21828)
+* TopK threshold init from parquet statistics: [apache/datafusion#21712](https://github.com/apache/datafusion/pull/21712)
+* Combined statistics-driven `TopK` pipeline: [apache/datafusion#21580](https://github.com/apache/datafusion/pull/21580)
+* Global file reorder in shared queue: [apache/datafusion#21733](https://github.com/apache/datafusion/issues/21733)
+* Multi-column / function-wrapped reorder follow-ups: [apache/datafusion#22198](https://github.com/apache/datafusion/issues/22198)
+* Umbrella issue for sort pushdown: [apache/datafusion#17348](https://github.com/apache/datafusion/issues/17348)
+
+Benchmark suite: [`sort_pushdown`]
From 3d069008cfeffcaf103cf39808dde518e1ee6a18 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 15:24:39 +0800
Subject: [PATCH 10/14] Restructure: split into merged-features / bottlenecks /
roadmap
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The previous draft mixed merged work, in-flight work, and runtime-cost
analysis into a single 'Reverse Scans' section and a sprawling 'What's
Next' section. Reorganize so the post answers three clear questions in
sequence:
1. What's merged today? (Sort Elimination via Statistics + benchmark,
Runtime Reorder for TopK Convergence, Reverse Scans for DESC) —
unchanged content, just kept tight.
2. Where do those merged features still leave performance on the
table? New 'Current Bottlenecks' section with three explicitly
numbered bottlenecks: SortExec stays / sort column fully decoded
inside open RG / file-granular scheduling can't close the tap
mid-file. Pulls in the runtime-cost content that used to be buried
in an 'Empirical note' subsection.
3. How does each next-step optimization remove a specific bottleneck?
New 'Roadmap' section maps page-level Exact reverse to bottlenecks
1+2, row-group-level dynamic early termination to bottleneck 3, and
shows the in-flight 17x-60x pipeline benchmark as a preview of what
stacking these mechanisms can deliver.
Smaller follow-ups (EnsureRequirements, OFFSET pushdown, multi-column
reorder) collected at the end of the roadmap section as a short
'Other follow-ups' bullet list.
---
content/blog/2026-05-25-sort-pushdown.md | 365 ++++++++++-------------
1 file changed, 152 insertions(+), 213 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 8e12170c..7306d21a 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -511,118 +511,107 @@ row groups can be skipped via min/max statistics. This ships today
and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted
files.
-To turn this into `Exact` reverse — so the `SortExec` can be removed
-outright — each emitted batch itself has to be in DESC order. The
-straightforward row-group-level approach (decode an entire RG forward,
-materialize all rows, reverse the buffer, then emit) is correct and
-was actually proposed first, in an earlier iteration of this work
-that was later closed and split into smaller pieces. Review
-feedback there — primarily from [@2010YOUY01] — flagged the memory
-profile as too aggressive: caching an entire row group's worth of
-decoded rows before any batch can be emitted is roughly:
-
-* **Peak buffer of one whole row group** (~128 MB by default), versus
- the few-MB-per-batch streaming profile readers normally have.
-* **First-batch latency = full last-row-group decode**. For
- `ORDER BY ts DESC LIMIT 10` that means decoding ~1 million rows to
- return 10 — defeating the point of the `LIMIT`.
-
-The agreed direction coming out of that discussion was to ship the
-narrower `Inexact` row-group-reverse first, and to build `Exact`
-reverse on a finer-grained primitive once `arrow-rs` exposed one.
-
-### Empirical note — runtime cost of `Inexact` + `TopK`
-
-We run an internal row-group-level `Exact` reverse implementation in
-production and tested swapping in upstream's `Inexact` row-group
-reverse + `TopK` on `ORDER BY ts DESC LIMIT N` queries. End-to-end
-latency went **up**, not down. A few cost components stack up on the
-`Inexact` + `TopK` side:
-
-* **`LIMIT N` does not propagate as a static stop signal to the
- source.** In the `Inexact` path the `SortExec` stays on top and
- `TopK`'s fetch belongs to `SortExec`, not to the parquet scan. The
- only mechanism that can cut work below the `SortExec` is the
- dynamic-filter pushdown: as the heap fills, the filter (`ts >
- threshold`) is pushed to the source and its threshold tightens
- with every batch. That filter is enough to **stats-prune
- subsequent, not-yet-opened row groups** entirely — if a row
- group's `max(ts) < threshold` it is skipped without decode. But
- inside the row group the source is currently reading, the
- filter pushdown does not unwind to "stop": the sort column has
- to be **fully decoded** so the filter can be evaluated row by
- row, the surviving rows feed the heap to tighten the threshold,
- and only then can the resulting `RowSelection` skip the *other*
- columns for rows that didn't pass. For
- `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is still
- ~1M sort-column decodes regardless of `N`; the LIMIT only saves
- work on non-sort columns inside the same row group and on whole
- *subsequent* row groups that the tightened threshold can prune.
- The internal RG-level `Exact` reverse path, by contrast, deletes
- the `SortExec` so the LIMIT becomes a static fetch on the source.
- The source still has to decode the target row group in full —
- parquet does not allow partial row-group reads, so this part is
- the same as `Inexact` — but it then reverses the buffer in
- memory, takes the first K rows, and **stops**. No subsequent row
- group is opened, no stats check, no filter machinery, no per-row
- heap maintenance, no `SortExec` final ordering pass. The wins
- come from removing those per-row and per-RG overheads on top, not
- from decoding less sort-column data on the target row group.
-* **`SortExec` itself adds ordering work on top of `Inexact`.** The
- reversed-RG stream is not strictly DESC (rows within each RG are
- still forward), so `Inexact` keeps the surrounding `SortExec`.
- Even when the heap is settled and the dynamic filter has
- pruned the tail, the outer operator does its own final ordering
- pass — overhead that `Exact` (which deletes the `SortExec`)
- does not pay.
-
-Why didn't we just upstream the internal `Exact` reverse, then?
-**Memory.** Parquet does not allow reading only part of a row
-group, so any RG-level `Exact` implementation — ours included —
-has to decode the entire row group, reverse the buffer in
-memory, and only then emit. That is the same memory profile that
-got the earlier RG-level proposal rejected: a peak of one whole
-row group (~128 MB) of decoded data, vs. the few-MB-per-batch
-streaming profile readers normally have. Our runtime advantage
-over `Inexact` + `TopK` does *not* come from decoding less —
-both paths decode the relevant row group's sort column in full —
-it comes from skipping the per-row heap maintenance, the dynamic
-filter evaluation, and the `SortExec` final ordering pass that
-`Inexact` keeps on top. So we end up running our `Exact` reverse
-in-house but cannot land it as the upstream default, for the
-same memory reason that closed the earlier proposal.
-
-**The fix that keeps both the runtime win and a streaming memory
-profile is page-level `Exact` reverse via arrow-rs**, described
-next.
-
-That primitive is the **page-level** reverse traversal. Parquet's
-`OffsetIndex` already gives us byte-precise locations for every data
-page in a column chunk, so we can `seek` directly to the last page,
-decode it forward, reverse the resulting batch, and emit. Peak buffer
-drops to one page (~1 MB) and first-batch latency drops to the cost
-of one page decode — the row-group-level memory cliff disappears.
-
-We are landing this primitive upstream in arrow-rs. Early numbers
-on a 100k-row, 98-page column chunk show **~50× faster
-time-to-first-N** for `n ≤ 1 page` and **~9× faster** for `n`
-spanning 10 pages, compared with the row-group-level Exact reverse
-described above. The DataFusion-side integration that turns this
-primitive into an `Exact` result is a follow-up and is gated on
-the arrow-rs merge.
+Why not full `Exact` reverse, which would delete the `SortExec`
+outright? An earlier proposal — primarily reviewed by
+[@2010YOUY01] — that decoded an entire row group forward,
+materialized all rows, reversed the buffer, then emitted was
+correct but had a prohibitive memory profile: a peak of one whole
+row group (~128 MB) of decoded data vs. the few-MB-per-batch
+streaming profile readers normally have. The agreed direction was
+to ship the narrower `Inexact` row-group-reverse first, and to
+build `Exact` reverse on a finer-grained primitive once `arrow-rs`
+exposed one. The bottleneck section below details what that
+`Inexact`-keeps-`SortExec` decision costs at runtime, and the
+roadmap section after it describes how the page-level primitive
+removes the cost.
[@2010YOUY01]: https://github.com/2010YOUY01
-One natural question: why not reverse the rows *within* a page
-directly? Because we can't. Parquet's page encodings (RLE, dictionary,
-delta, bit-packing) are all forward streams — you cannot decode the
-last value without decoding every value that came before it. The
-design therefore is: **reverse the page traversal, forward-decode
-each page, reverse the resulting RecordBatch**. This is the algorithm
-shape DataFusion's `RecordBatchReader` integration will use once
-arrow-rs ships the primitive.
+## Current Bottlenecks
+
+Stats-based **sort elimination** removes the `SortExec` entirely
+when ranges are non-overlapping — there's nothing more to optimize
+on that path. But the `Inexact` paths (**runtime reorder** for
+`TopK`, and **row-group reverse** for `DESC`) leave three concrete
+inefficiencies on the table when `Exact` cannot fire:
+
+### Bottleneck 1: `SortExec` stays on top, so `LIMIT N` does not propagate as a static stop signal
+
+In the `Inexact` path the `SortExec` stays in the plan and
+`TopK`'s fetch belongs to `SortExec`, not to the parquet scan.
+The only thing that can cut work below the `SortExec` is the
+dynamic-filter pushdown: as the heap fills, the filter
+(`ts > threshold`) is pushed to the source and its threshold
+tightens with every batch. That filter does **stats-prune
+subsequent, not-yet-opened row groups** — if a row group's
+`max(ts) < threshold` it is skipped without decode. But the
+`SortExec` keeps pulling batches, and the outer operator does its
+own final ordering pass on the "RGs descending × rows ascending"
+stream even after the heap is settled. We have measured this
+in-house: swapping our internal `Exact` reverse for upstream's
+`Inexact` reverse + `TopK` on `ORDER BY ts DESC LIMIT N` makes
+end-to-end latency go **up**, not down — exactly because the
+`SortExec` final pass and the per-row heap maintenance pile up on
+top.
+
+### Bottleneck 2: Inside the currently-open row group, the sort column is fully decoded
+
+Even with the dynamic filter pushed all the way to parquet, the
+filter has to be evaluated row-by-row inside the open row group:
+the sort column has to be **fully decoded** so each value can be
+compared against the threshold, the surviving rows feed the heap
+to tighten the threshold, and only then can the resulting
+`RowSelection` skip the *other* columns for rows that didn't
+pass. For `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that
+is ~1M sort-column decodes regardless of `N`. Parquet doesn't
+allow partial row-group reads, so even an RG-level `Exact`
+reverse would pay this same cost — the only way to materially
+reduce it is to drop to page granularity.
+
+### Bottleneck 3: File-granular work scheduling can't close the tap mid-file
+
+Once a `FileStream` picks up a file from the shared work queue,
+it has to finish that file. Today's dynamic work scheduling is
+**file-granular**: idle partitions stop pulling new files from
+the queue once a global limit is satisfied, but the partition
+that's currently inside a file decodes that file's remaining row
+groups regardless. The work queue holds `PartitionedFile`, not
+row-group descriptors. So even with a tight threshold and
+aggressive stats pruning of un-opened row groups, the *currently
+open* file gets read to completion.
+
+## Roadmap: Removing the Bottlenecks
+
+### Page-level `Exact` reverse — addresses bottlenecks 1 + 2
+
+Parquet's `OffsetIndex` gives us byte-precise locations for every
+data page in a column chunk, so we can `seek` directly to the last
+page, decode it forward, reverse the resulting batch, and emit.
+Peak buffer drops from ~128 MB (one row group) to ~1 MB (one
+page), and first-batch latency drops to the cost of one page
+decode — the row-group-level memory cliff disappears. With each
+batch already in DESC order, `PushdownSort` can finally return
+`Exact` for `DESC` requests, the `SortExec` is removed, and
+`LIMIT N` becomes a static fetch on the source. The
+`Inexact`-final-ordering-pass overhead from Bottleneck 1 goes
+away outright, and the Bottleneck-2 decode reduces to the rows
+the page-level seek actually pulls in.
+
+Why not reverse the rows *within* a page directly? Because we
+can't. Parquet's page encodings (RLE, dictionary, delta,
+bit-packing) are all forward streams — you cannot decode the last
+value without decoding every value that came before it. The
+design is: **reverse the page traversal, forward-decode each
+page, reverse the resulting `RecordBatch`**.
+
+The primitive is landing upstream in arrow-rs. Early numbers on a
+100k-row, 98-page column chunk show **~50× faster
+time-to-first-N** for `n ≤ 1 page` and **~9× faster** for `n`
+spanning 10 pages, compared with the row-group-level `Exact`
+reverse. The DataFusion-side integration that turns this primitive
+into an `Exact` result is a follow-up gated on the arrow-rs merge.
-The killer use case is **filtered reverse TopK**:
+The killer use case is **filtered reverse `TopK`**:
```sql
SELECT * FROM events
@@ -631,115 +620,66 @@ ORDER BY ts DESC
LIMIT 10
```
-Here `RowSelection::with_limit` cannot help — you don't know in
-advance which rows match `user_id = 42`, so you can't pre-compute a
-selection of the "last 10 matching rows". The only correct strategy
-is to stream pages backward, evaluate the filter on each, and stop
-when 10 matches are collected. Row-group reverse stops at a
-~128 MB granularity. Page reverse stops at ~1 MB granularity. For a
-selective filter, the saving compounds.
-
-## What's Next
-
-Sort pushdown is a long-running line of work and there is more to do.
-Beyond the `Exact` path described above, there is a complementary
-**dynamic / TopK-driven path** that helps when `Exact` cannot apply —
-e.g. when file ranges genuinely overlap, or when the sort is on a
-function output rather than a plain column. The two directions are
-not alternatives; they compose:
-
-* [`Exact` reverse for `ORDER BY ... DESC`]. Today's row-group
- reverse returns `Inexact` and the `SortExec` stays on top; the
- arrow-rs page-level reverse primitive is what unlocks `Exact`
- reverse on `DESC` queries (and therefore full `SortExec`
- elimination on `DESC`). Memory + first-batch latency rule out
- doing the same thing at the row-group level. Gated on the
- arrow-rs side.
-* **Dynamic / TopK-driven path.** When `Exact` cannot fire,
- `TopK`'s [dynamic filter][dyn-filters-blog] still benefits
- enormously from reading the *best* data first. This thread also
- builds on the [limit pruning][limit-pruning-blog] work that
- turned `LIMIT` into an I/O optimization across the pruning
- pipeline. The recently-merged [morsel-style work scheduling] in
- `FileStream` gives sibling partitions a *shared work queue* with
- file-level work-stealing — no CPU sits idle when one partition
- runs out of files. The proposed
- [global file reorder in the shared queue] sorts files in that
- shared queue by per-file statistics *before* any partition
- picks, so the first file read is globally optimal and tightens
- the dynamic filter immediately. Combined with
- [TopK threshold init from parquet statistics] and the runtime
- row-group / file reorder + reverse path described above, the
- threshold can be set before reading a single byte. The reorder
- mechanism applies to any `ORDER BY [LIMIT N]` on
- parquet, not just TopK queries with a dynamic filter. The
- [combined statistics-driven `TopK` pipeline] is in flight.
-
- The mechanism here is **RG-level pruning, not mid-stream early
- return**. With the threshold known up front, the parquet
- `PruningPredicate` rejects entire row groups against their
- min/max statistics before any I/O — those row groups are never
- decoded. The row group(s) the reader *does* open still have
- their sort column decoded in full to feed the dynamic filter.
- On the in-flight microbenchmark (single file, 61 sorted row
- groups, `--partitions 1`), **60 of the 61 row groups are
- skipped** and only one is decoded:
-
- | Query | Baseline | With pipeline | Speedup |
- | ------------------------------ | -------: | ------------: | ------: |
- | `ORDER BY col DESC LIMIT 100` | 28.5 ms | 1.64 ms | **17×** |
- | `ORDER BY col DESC LIMIT 1000` | 22.2 ms | 0.37 ms | **60×** |
- | `SELECT * ORDER BY ... LIMIT 100` | 22.5 ms | 0.66 ms | **34×** |
- | `SELECT * ORDER BY ... LIMIT 1000` | 22.4 ms | 0.61 ms | **37×** |
-
- The stack reports `Inexact` — the `SortExec` stays on top to
- enforce correctness across overlapping ranges — so this path
- cannot do *true* mid-stream early return. Once the parquet
- reader opens a row group, the sort column has to be decoded all
- the way through; once a `FileStream` picks up a file from the
- shared work queue, it has to finish that file. Today's dynamic
- work scheduling is **file-granular**: idle partitions stop
- pulling new files from the queue once a global limit is
- satisfied, but the partition that's currently inside a file
- decodes that file's remaining row groups regardless. Mid-file
- RG-level early return on `TopK` convergence is **not implemented
- yet** — the work queue holds `PartitionedFile`, not row-group
- descriptors.
-
- Closing the tap the moment `TopK` has K confirmed winners
- therefore needs either:
-
- * the **`Exact` path**, where the `SortExec` is gone entirely
- and the data source's own `fetch` becomes a static limit that
- the reader can honour at batch granularity; or
- * **finer-grained dynamic scheduling** — having the shared queue
- hold row-group descriptors instead of whole files, so a
- partition can release its current file's remaining row groups
- back to the pool once a global signal says enough TopK
- winners have been found. A natural extension of the existing
- morsel work but not yet on a PR.
-
- The three mechanisms compose. Stats pruning saves the row
- groups that *can't* matter (skipped without I/O). The dynamic
- filter narrows what's decoded inside the row groups the reader
- does open. `Exact` or finer-grained scheduling is what
- eventually closes the tap once `TopK` is satisfied.
-* **Filtered reverse `TopK` end-to-end.** `WHERE filter ORDER BY
- ts DESC LIMIT N` is the dominant observability query shape and
- the one where the arrow-rs page-reverse primitive matters most:
- `RowSelection::with_limit` cannot pre-compute the last `N`
- matching rows when the filter is selective, so the only correct
- strategy is to stream pages backward, evaluate the filter, and
- stop when `N` matches are collected. The DataFusion-side
- integration is a follow-up to the arrow-rs primitive.
+`RowSelection::with_limit` cannot help here — you don't know in
+advance which rows match `user_id = 42`, so you can't pre-compute
+a selection of the "last 10 matching rows". The only correct
+strategy is to stream pages backward, evaluate the filter on
+each, and stop when 10 matches are collected. Row-group reverse
+stops at a ~128 MB granularity. Page reverse stops at ~1 MB
+granularity. For a selective filter, the saving compounds.
+
+### Row-group-level dynamic early termination — addresses bottleneck 3
+
+The work queue today holds `PartitionedFile`. Switching it to
+hold **row-group descriptors** instead lets a partition release
+its current file's remaining row groups back to the pool the
+moment a global signal says enough `TopK` winners have been
+found. A natural extension of the existing morsel-style work
+scheduling but not yet on a PR.
+
+The two roadmap items above are *complementary*, not
+alternatives:
+
+* `Exact` reverse closes the tap for `DESC` queries by removing
+ the `SortExec` entirely.
+* Row-group-level scheduling closes the tap for `Inexact` queries
+ where `Exact` still cannot fire (function-wrapped sorts,
+ overlapping ranges) — the `SortExec` stays, but the scan stops
+ pulling row groups once `TopK` is satisfied.
+
+### Preview: the combined statistics-driven `TopK` pipeline
+
+The [combined statistics-driven `TopK` pipeline] is the in-flight
+work that stacks several of these mechanisms: pre-scan
+[TopK threshold init from parquet statistics],
+[global file reorder in the shared queue], and the runtime
+row-group / file reorder + reverse already merged. On a
+microbenchmark (single file, 61 sorted row groups, `--partitions 1`)
+**60 of the 61 row groups are skipped**, only one is decoded:
+
+| Query | Baseline | With pipeline | Speedup |
+| ------------------------------ | -------: | ------------: | ------: |
+| `ORDER BY col DESC LIMIT 100` | 28.5 ms | 1.64 ms | **17×** |
+| `ORDER BY col DESC LIMIT 1000` | 22.2 ms | 0.37 ms | **60×** |
+| `SELECT * ORDER BY ... LIMIT 100` | 22.5 ms | 0.66 ms | **34×** |
+| `SELECT * ORDER BY ... LIMIT 1000` | 22.4 ms | 0.61 ms | **37×** |
+
+This pipeline still reports `Inexact` — the `SortExec` stays on
+top to enforce correctness across overlapping ranges — so it pays
+the Bottleneck-1 and Bottleneck-3 overheads listed above. The
+17×–60× is what statistics-driven RG-level pruning alone can
+deliver; `Exact` reverse + row-group-level early termination is
+what pushes it further.
+
+### Other follow-ups
+
* [Unifying `EnforceDistribution` and `EnforceSorting`] into a
single `EnsureRequirements` rule. The two existing rules are
coupled through `SortExec.preserve_partitioning`, which makes
their composition non-idempotent and has caused a class of
production bugs. Other engines (Spark's `EnsureRequirements`,
- Trino's `AddExchanges`) handle both in a single rule. Merging
- them also gives future sort-related optimizations a single
- coherent place to live. In progress.
+ Trino's `AddExchanges`) handle both in a single rule. In
+ progress.
* [OFFSET pushdown to parquet] so `ORDER BY ts LIMIT K OFFSET N`
queries can skip the first `N` rows at the row-group level
instead of decoding and discarding them. In progress.
@@ -751,7 +691,6 @@ not alternatives; they compose:
extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs
a bit more `EquivalenceProperties` integration but is doable.
-[`Exact` reverse for `ORDER BY ... DESC`]: https://github.com/apache/arrow-rs/pull/9937
[morsel-style work scheduling]: https://github.com/apache/datafusion/pull/21351
[global file reorder in the shared queue]: https://github.com/apache/datafusion/issues/21733
[TopK threshold init from parquet statistics]: https://github.com/apache/datafusion/pull/21712
From 2c22ce303efd46dadcedaf5891ec61ef4d2e84ae Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 15:29:27 +0800
Subject: [PATCH 11/14] Tighten verbose paragraphs; drop unrelated
limit-pruning blog reference
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
- 'Why not full Exact reverse' paragraph: cut reviewer attribution and
forward-pointers that were already in the bottleneck/roadmap
sections that follow.
- TL;DR: trim Runtime Reorder + Reverse Scans bullets to capability
and impact; drop implementation mechanics like 'stamps two flags'
and 'three-step pipeline'.
- 'The PushdownSort Rule' section: cut three paragraphs of
'covered in X below' forward-references that were repeating
the section TOC.
- Function-wrapped parenthetical in Sort Elimination: 4 lines to 2.
- Single-partition vs multi-partition edge case: drop the trailing
'which is why the example is drawn that way' tangent.
- 'What this change does not affect' note: trimmed redundant prose.
- Remove all references to the limit-pruning blog (intro mention,
link definition, References section bullet) — that work is about
static LIMIT as an I/O optimization, separate problem from sort
ordering.
---
content/blog/2026-05-25-sort-pushdown.md | 116 ++++++++---------------
1 file changed, 41 insertions(+), 75 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 7306d21a..d7bbdbff 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -72,20 +72,16 @@ further still.
upgrades the source's ordering claim from `Unsupported` to
`Exact` and **removes the `SortExec`** that `EnforceSorting`
inserted earlier.
- * **Runtime reorder for `TopK` convergence** — whenever the
- leading sort key is a plain column in the file schema (or the
- source's reversed declared ordering satisfies the request),
- `try_pushdown_sort` stamps two flags on the source and the
- opener runs a three-step runtime pipeline — file-level reorder
- in the shared morsel queue, row-group reorder by min/max stats,
- then optional iteration reverse for `DESC` requests. `SortExec`
- stays, but `TopK`'s dynamic filter tightens fast on the
- most-promising data and the rest is pruned.
+ * **Runtime reorder for `TopK` convergence** — when the leading
+ sort key is a plain column (or the reversed source ordering
+ satisfies the request), the scan reorders files and row groups
+ by `min/max` stats so the most-promising data is read first.
+ `SortExec` stays, but `TopK`'s dynamic filter tightens fast
+ and the rest is pruned.
* **Reverse scans for `ORDER BY ... DESC`** — a row-group-level
- reverse returns `Inexact` (Sort stays, but `TopK` terminates
- early). The page-level reverse primitive needed for `Exact`
- reverse — and so for full `SortExec` removal on `DESC` queries
- — is in flight in arrow-rs.
+ reverse returns `Inexact`. Full `SortExec` removal on `DESC`
+ requires a page-level reverse primitive that's in flight in
+ arrow-rs.
* Real-world benchmarks on the `sort_pushdown` suite (`Exact` path):
`ORDER BY ... LIMIT` queries get **27× and 49× faster**; full
`ORDER BY` scans get **~2×** faster.
@@ -158,8 +154,8 @@ follow:
heap tightens, the filter's threshold tightens with it, and entire
row groups can be skipped by checking the live threshold against
the row group's min/max statistics. (See the earlier
- [dynamic filters][dyn-filters-blog] and [limit pruning][limit-pruning-blog]
- posts for the full background on this mechanism.)
+ [dynamic filters][dyn-filters-blog] post for the full background
+ on this mechanism.)
Both paths use the same underlying min/max statistics, but for
different purposes: `Exact` uses them at plan time to prove
@@ -167,7 +163,6 @@ non-overlap and justify removing the sort; `Inexact` uses them at
runtime to skip row groups that can no longer improve the heap.
[dyn-filters-blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
-[limit-pruning-blog]: https://datafusion.apache.org/blog/2026/03/20/limit-pruning/
The diagram above shows the result we want: the plan after sort
pushdown loses the `SortExec` node. Everything downstream — the
@@ -178,32 +173,19 @@ producing the order requested.
## The `PushdownSort` Rule
-The **`PushdownSort`** physical optimizer rule defines a uniform
-API for asking each `ExecutionPlan` two questions:
+The **`PushdownSort`** physical optimizer rule asks each
+`ExecutionPlan` two questions:
1. "Can you produce output in *this* ordering?"
2. "If yes, please rearrange yourself so that it actually does."
-The protocol uses three results — `Exact`, `Inexact`, `Unsupported` —
-that downstream operators can interpret uniformly. The Parquet
-`FileSource` answers by comparing the requested ordering against the
-per-file declared ordering: if natural ordering satisfies the request,
-it returns `Exact`; if the *reverse* of the declared ordering does,
-it returns `Inexact` and flips on `reverse_row_groups=true` so the
-scan reads row groups from last to first (the row-group-level reverse
-covered later in this post); otherwise it returns `Unsupported`.
-
-The rule's initial scope was deliberately narrow. It set up the
-API and delivered the reverse-scan case end-to-end, but did **not**
-add any statistics-based file rearrangement — that came later,
-covered in
-[Sort Elimination via Statistics](#sort-elimination-via-statistics)
-below. A finer-grained extension that broadens this `Inexact` path
-with a three-step runtime reorder pipeline is covered in
-[Runtime Reorder for TopK Convergence](#runtime-reorder-for-topk-convergence).
-
-The same rule also handles **reverse-output** for `DESC` queries —
-picked up again in the reverse-scan section below.
+The answer is one of `Exact`, `Inexact`, `Unsupported`. The Parquet
+`FileSource` answers by comparing the requested ordering against
+the per-file declared ordering: natural ordering satisfies →
+`Exact`; reversed satisfies → `Inexact` (sets
+`reverse_row_groups=true`); otherwise → `Unsupported`. The rest of
+this post is what each merged capability does on top of this
+protocol.
## Sort Elimination via Statistics
@@ -264,11 +246,9 @@ is no per-file min/max for the function output to compare against.
Extending sort pushdown across monotonic function wrappers is one of
the open follow-ups.
-*(The runtime reorder path covered later does let function-wrapped
-sorts benefit from row-group iteration reverse via
-`EquivalenceProperties`'s monotonicity inference, when the source
-declares a compatible natural ordering — but stats-based sort
-elimination still needs a plain column.)*
+(Runtime reorder covered later does handle some function-wrapped
+sorts via monotonicity inference — but stats-based sort elimination
+still needs a plain column.)
@@ -304,15 +284,12 @@ The implementation handles a few edge cases worth calling out:
`SortPreservingMergeExec` then picks rows across streams in value
order to produce the final globally sorted output. The rule only
has to prove the per-stream property.
-* **Single-partition vs multi-partition execution**. With the default
- multi-partition setup, `EnforceDistribution` byte-range-splits files
- into single-file groups, after which `validated_output_ordering()`
- works correctly on its own. Stats-based reorder only triggers when
- files have not been split — typically `--partitions 1` runs, or
- files small enough that the splitter leaves them alone. In the
- typical `--partitions 1` case the "per-group" distinction collapses
- (one group equals the whole table), which is why the example earlier
- in this section is drawn that way.
+* **Single-partition vs multi-partition execution.** The default
+ multi-partition setup byte-range-splits files into single-file
+ groups, after which `validated_output_ordering()` works on its
+ own. Stats-based reorder only fires when files aren't split —
+ typically `--partitions 1` or files small enough that the
+ splitter leaves them alone.
[`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs
[`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426
@@ -352,12 +329,11 @@ removed:
the runtime-difference section above. A 342 ms full-file scan
collapses into a 7 ms K-row read.
-It is worth saying explicitly what this change does **not** affect.
-The default multi-partition execution path is unchanged: those plans
-already produced correct orderings via byte-range splitting, so
-stats-based sort elimination simply does not trigger. There is no
-regression and no behavior change for the typical multi-threaded
-query.
+The default multi-partition execution path is unaffected: those
+plans already produce correct orderings via byte-range splitting,
+so stats-based sort elimination simply does not fire there. No
+regression and no behavior change for typical multi-threaded
+queries.
## Runtime Reorder for TopK Convergence
@@ -511,21 +487,12 @@ row groups can be skipped via min/max statistics. This ships today
and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted
files.
-Why not full `Exact` reverse, which would delete the `SortExec`
-outright? An earlier proposal — primarily reviewed by
-[@2010YOUY01] — that decoded an entire row group forward,
-materialized all rows, reversed the buffer, then emitted was
-correct but had a prohibitive memory profile: a peak of one whole
-row group (~128 MB) of decoded data vs. the few-MB-per-batch
-streaming profile readers normally have. The agreed direction was
-to ship the narrower `Inexact` row-group-reverse first, and to
-build `Exact` reverse on a finer-grained primitive once `arrow-rs`
-exposed one. The bottleneck section below details what that
-`Inexact`-keeps-`SortExec` decision costs at runtime, and the
-roadmap section after it describes how the page-level primitive
-removes the cost.
-
-[@2010YOUY01]: https://github.com/2010YOUY01
+Why not full `Exact` reverse that deletes the `SortExec`?
+Decoding a whole row group forward, reversing the buffer, and
+emitting works — but peaks at ~128 MB vs. the few-MB-per-batch
+streaming profile readers expect. `Exact` reverse waits on a
+page-level primitive that keeps the runtime win on a streaming
+memory budget — see the roadmap below.
## Current Bottlenecks
@@ -725,10 +692,9 @@ invariants — is what made this work possible.
## References
-Prior posts this work builds on:
+Prior post this work builds on:
* [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses.
-* [Turning LIMIT into an I/O Optimization: Inside DataFusion's Multi-Layer Pruning Stack][limit-pruning-blog] — the pruning pipeline this work plugs into.
Landed PRs that make up this work:
From e389c3f3561213abdb03136ad5ae20fd975c085d Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 15:56:55 +0800
Subject: [PATCH 12/14] Fold reverse as case of Inexact runtime reorder; make
two-trigger asymmetry explicit
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Per code in datasource-parquet/src/source.rs:849-870, the reversed-
satisfies branch is 'strictly more powerful than the column-in-schema
check' — it runs the request through EquivalenceProperties's full
reasoning machinery and handles function monotonicity, constants from
filters, equivalence relationships, and multi-column composite
orderings. The blog had been treating reverse as just step 3 of the
runtime pipeline, which undersold its standalone reach.
Structural changes:
- Drop the standalone 'Reverse Scans for ORDER BY DESC' H2 section;
reverse is now a case of the Inexact runtime reorder path.
- Rename Runtime Reorder section to 'Runtime Reorder for TopK and
DESC Queries'; intro now lists three classes that fall outside
Exact (unsorted, overlapping, DESC).
- 'try_pushdown_sort' subsection rewritten as 'Two independent
triggers for Inexact', describing column-in-schema vs reversed-
satisfies as separate signals with the latter being strictly more
powerful.
- 'Three runtime steps' subsection: step 3 now explicitly notes when
steps 1-2 are skipped and only the iteration reverse runs.
- New 'ORDER BY DESC in practice' subsection right after the 3-step
pipeline, explaining the RGs-descending-x-rows-ascending stream.
- Move reverse-scan.svg from the deleted Reverse Scans section into
the Roadmap > Page-level Exact reverse subsection where it
illustrates the 128 MB vs 1 MB peak comparison directly.
Accuracy fix:
- 'Multi-column reorder follow-ups' bullet was inaccurate — said the
reorder 'only fires on plain columns'. The reverse path does
handle function-wrapped and multi-column cases via
EquivalenceProperties; only the stats reorder step is restricted.
Updated wording to scope the limitation correctly.
---
content/blog/2026-05-25-sort-pushdown.md | 278 +++++++++++------------
1 file changed, 131 insertions(+), 147 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index d7bbdbff..7681e77e 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -44,12 +44,14 @@ already in that order. CPU wasted. Memory wasted. Streaming defeated.
[Apache DataFusion]: https://datafusion.apache.org/
-This post walks through the **sort pushdown** work that closed that
-gap. It covers two complementary capabilities — sort elimination via
-statistics, and runtime reorder for `TopK` convergence — and lands
-real benchmark speedups of **2.1×–49× on common queries**. The same
-machinery extends to `ORDER BY ... DESC`, and the page-level reverse
-primitive we are adding upstream in [arrow-rs] will push the gains
+This post walks through the **sort pushdown** work that closed
+that gap. It covers two complementary capabilities — **sort
+elimination via statistics** (the `Exact` path, which deletes the
+`SortExec`) and **runtime reorder** (the `Inexact` path, which
+keeps the `SortExec` but reads the most-promising data first for
+`TopK` and `DESC` queries) — and lands real benchmark speedups of
+**2.1×–49× on common queries**. The page-level reverse primitive
+we are adding upstream in [arrow-rs] will push the `DESC` gains
further still.
[arrow-rs]: https://github.com/apache/arrow-rs
@@ -72,16 +74,15 @@ further still.
upgrades the source's ordering claim from `Unsupported` to
`Exact` and **removes the `SortExec`** that `EnforceSorting`
inserted earlier.
- * **Runtime reorder for `TopK` convergence** — when the leading
- sort key is a plain column (or the reversed source ordering
- satisfies the request), the scan reorders files and row groups
- by `min/max` stats so the most-promising data is read first.
- `SortExec` stays, but `TopK`'s dynamic filter tightens fast
- and the rest is pruned.
- * **Reverse scans for `ORDER BY ... DESC`** — a row-group-level
- reverse returns `Inexact`. Full `SortExec` removal on `DESC`
- requires a page-level reverse primitive that's in flight in
- arrow-rs.
+ * **Runtime reorder for `TopK` and `DESC` queries** — when the
+ leading sort key is a plain column (or the reversed source
+ ordering satisfies the request), the scan reorders files and
+ row groups by `min/max` stats so the most-promising data is
+ read first; for `DESC` requests it additionally flips
+ iteration. `SortExec` stays `Inexact`, but `TopK`'s dynamic
+ filter tightens fast and the rest is pruned. Full `SortExec`
+ removal on `DESC` requires a page-level reverse primitive
+ that's in flight in arrow-rs.
* Real-world benchmarks on the `sort_pushdown` suite (`Exact` path):
`ORDER BY ... LIMIT` queries get **27× and 49× faster**; full
`ORDER BY` scans get **~2×** faster.
@@ -179,13 +180,13 @@ The **`PushdownSort`** physical optimizer rule asks each
1. "Can you produce output in *this* ordering?"
2. "If yes, please rearrange yourself so that it actually does."
-The answer is one of `Exact`, `Inexact`, `Unsupported`. The Parquet
-`FileSource` answers by comparing the requested ordering against
-the per-file declared ordering: natural ordering satisfies →
-`Exact`; reversed satisfies → `Inexact` (sets
-`reverse_row_groups=true`); otherwise → `Unsupported`. The rest of
-this post is what each merged capability does on top of this
-protocol.
+The answer is one of `Exact`, `Inexact`, `Unsupported`. `Exact`
+means the surrounding `SortExec` can be deleted entirely; `Inexact`
+means the source will read the data in a near-sorted order so
+`TopK` and other consumers benefit, but `SortExec` stays for
+strict correctness. The rest of this post is what each merged
+capability does on top of this protocol — first the `Exact` path,
+then the `Inexact` path.
## Sort Elimination via Statistics
@@ -335,12 +336,12 @@ so stats-based sort elimination simply does not fire there. No
regression and no behavior change for typical multi-threaded
queries.
-## Runtime Reorder for TopK Convergence
+## Runtime Reorder for `TopK` and `DESC` Queries
Stats-based sort elimination handles the `Exact` upgrade — strong
correctness, sort elimination — but only when the table has a
declared `output_ordering` *and* the files are provably
-non-overlapping after sorting by min. Two large classes of queries
+non-overlapping after sorting by min. Three classes of queries
fall outside that window:
* **Unsorted data** — no `WITH ORDER`, no parquet `sorting_columns`.
@@ -350,40 +351,58 @@ fall outside that window:
jobs share time windows. The `Exact` upgrade keeps the `SortExec`
because the global ordering can't be proven, even though the
files often do contain large stretches of in-order data.
+* **`ORDER BY ... DESC` on ASC-sorted data** — flipping iteration
+ at the row-group level emits "RGs descending × rows ascending",
+ close to the requested order but not strictly DESC, so the
+ `SortExec` has to stay for correctness.
-For both, a full external `SortExec` is overkill. The parquet
+For all three, a full external `SortExec` is overkill. The parquet
metadata is right there, and reading the *most-promising* data
first lets `TopK`'s dynamic filter threshold tighten quickly so the
rest gets pruned. Runtime reorder wires that up by generalising
the `Inexact` path the rule introduced.
-### `try_pushdown_sort` — one decision, three outcomes
+### Two independent triggers for `Inexact`
-The `Exact` / `Inexact` / `Unsupported` protocol stays. The
-runtime reorder path broadens the **conditions** that route a
-query into `Inexact`:
-
-| Condition | Outcome |
-| --- | --- |
-| `eq_properties.ordering_satisfy(request)` | `Exact` — sort elimination |
-| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — runtime reorder pipeline |
-| Neither | `Unsupported` — `SortExec` stays, no source-side optimisation |
-
-The "reversed satisfies" branch is what handles function-wrapped
-sorts (`date_trunc('day', ts) DESC`, `ceil(value) DESC`,
-`CAST(x AS Date) DESC`) — `EquivalenceProperties`'s monotonicity
-reasoning recognises that `f(col) DESC` is satisfied by `col ASC`
-reversed, even though parquet has no stats keyed by `f(col)`
-itself.
-
-### Two flags on `ParquetSource`, three runtime steps
+`try_pushdown_sort` first checks whether the natural ordering
+already satisfies the request (→ `Exact`) or whether a non-empty
+*proper prefix* of the request is already satisfied (→
+`Unsupported`, so the outer `SortExec`'s `sort_prefix`
+optimisation can fire instead). Otherwise it looks at two
+**independent** Inexact signals — either one is enough, and they
+compose when both apply:
+
+**Stats-based RG reorder** — fires when the leading sort key is a
+plain `Column` in the file schema. The opener sorts row groups by
+`min(col)` via parquet statistics. Restrictive (plain physical
+column only), but lets the scan globally reorder data so the
+most-promising row group is decoded first.
+
+**Iteration reverse** — fires when the source's declared ordering,
+**reversed**, satisfies the request. This goes through the full
+`EquivalenceProperties` reasoning machinery and is **strictly more
+powerful** than the column-in-schema check above. It fires for:
+
+* **Function monotonicity** — file declares `ts DESC`, request is
+ `date_trunc('day', ts) ASC` → reversed `ts ASC` satisfies the
+ request via monotonicity even though parquet has no stats keyed
+ by the function. Same for `ceil(value)`, `CAST(x AS Date)`, etc.
+* **Constant columns from filters** — `WHERE region = 'us'` marks
+ `region` as constant in the equivalence class, so a request
+ involving `region` is trivially satisfied.
+* **Equivalence relationships** — `WHERE a = b` transfers a known
+ ordering on `a` to a request on `b`.
+* **Multi-column composite orderings** — the source's declared
+ multi-key ordering reversed satisfies the multi-key request as a
+ whole.
+
+### Three runtime steps in the opener
-When `try_pushdown_sort` returns `Inexact`, it stamps two fields on
-the `ParquetSource`:
+The two triggers above set two fields on `ParquetSource`:
```rust
struct ParquetSource {
@@ -393,114 +412,76 @@ struct ParquetSource {
}
```
-The opener reads them at scan time to drive three composable steps:
-
-1. **File-level reorder.** `FileSource::reorder_files` sits in the
- shared morsel queue (a work-stealing primitive that lets sibling
- partitions share a single file pool) and sorts the
- partitioned-file list by `min(col)`. The first file picked across
- all partitions is globally the most-promising one.
-2. **Row-group-level reorder.** Once a file is opened,
- `PreparedAccessPlan::reorder_by_statistics` sorts that file's
- `row_group_indexes` by `min(col)` ASC. The row group most likely
- to contribute to `TopK` is decoded first.
-3. **Reverse.** For `DESC` requests,
- `PreparedAccessPlan::reverse` flips the iteration after the
- stats reorder normalises everything to ASC-by-min. Same
- primitive the rule originally introduced for declared reverse
- scans — the runtime pipeline just routes more queries through
- it.
-
-The two layers compose naturally because they sort by the same
-key. A file's `min(col)` is the minimum over its row groups'
-`min(col)` values, so the file with the smallest `min` contains
-the row group with the smallest `min`. Sorting files by `min(col)`
-and then sorting row groups by `min(col)` within each file
-produces an approximately min-ordered global stream — the first
-batch comes from the most-promising row group in the
-most-promising file, exactly what `TopK`'s dynamic filter needs
-to tighten its threshold fast.
-
-`reverse_row_groups`'s meaning depends on which way `Inexact` was
-reached. When the column-in-schema condition fires, the stats
-reorder produces ASC-by-min, so `reverse_row_groups` simply mirrors
-the request direction. When only the reversed-equivalence
-condition fires (function-wrapped case with a declared source
-ordering), `reverse_row_groups` is `true` unconditionally — there
-is no stats reorder to compose with, just a flip of the file's
-natural order.
-
-Both flags surface on the `DataSourceExec` line in `EXPLAIN` so
-plan inspection and snapshot tests can confirm the pushdown fired:
+The opener consumes them in three composable steps:
+
+1. **File-level reorder** (`FileSource::reorder_files`). The shared
+ morsel queue — a work-stealing primitive that lets sibling
+ partitions share a single file pool — sorts the partitioned-file
+ list by `min(col)`. The first file picked across all partitions
+ is globally the most-promising one. Skipped when the stats
+ reorder trigger didn't fire.
+2. **Row-group-level reorder**
+ (`PreparedAccessPlan::reorder_by_statistics`). Once a file is
+ opened, sort its row groups by `min(col)` ASC so the most-promising
+ row group is decoded first. Same trigger as step 1; the two
+ layers nest because a file's `min(col)` is the minimum over its
+ row groups' `min(col)` values.
+3. **Iteration reverse** (`PreparedAccessPlan::reverse`). Flips the
+ row-group iteration order. For `DESC` requests on a plain
+ column the flip composes with steps 1–2 (ASC-by-min → reverse →
+ DESC-by-min). For the function-wrapped / constants-from-filters /
+ multi-column cases, steps 1–2 are skipped and this is the only
+ step that runs — just a flip of the file's natural order.
+
+Both flags surface on the `DataSourceExec` line in `EXPLAIN`:
```text
DataSourceExec: file_groups=..., file_type=parquet,
sort_order_for_reorder=[a@0 ASC], reverse_row_groups=true
```
-Absence of either flag means the corresponding runtime step is a
-no-op.
-
-### When runtime reorder does *not* fire
-
-* **Aggregations on top of the sort key.** `SELECT URL, COUNT(*) AS c
- FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench
- TopK shape) — the leading sort key (`c`) is an aggregation result
- and has no per-RG stats in the parquet file, so the
- column-in-schema check fails. Pushing sort metadata through
- `AggregateExec` is a separate problem: the aggregated value
- doesn't exist before aggregation, so even if the metadata reached
- the scan there'd be nothing actionable to do with it.
-* **Multi-column sort secondary keys.** The reorder currently only
- uses the leading sort expression — secondary keys are ignored.
- An open follow-up.
-* **Function-wrapped sort without a source-declared ordering.**
- Without a declared ordering to invert, the reversed-equivalence
- branch has nothing to satisfy. Same follow-up.
-* **Source declares a forward prefix of the request.** When the
- source's declared `output_ordering` is a non-empty proper prefix
- of the request (e.g. source `[a DESC, b ASC]`, request
- `[a DESC, b ASC, c DESC]`), `try_pushdown_sort` returns
- `Unsupported` so the surrounding `SortExec` can keep its
- `sort_prefix` annotation — prefix-aware early termination in
- `TopK` is strictly better than the runtime reorder on data that
- is already in prefix order on disk.
-
-## Reverse Scans for `ORDER BY ... DESC`
-
-
-
-`ORDER BY ts DESC` is the same problem in reverse. If a file is sorted
-ascending and the query wants descending, we should be able to skip
-the sort — we just need to read the data in the opposite order.
-
-The first iteration of this operates at the **row group** level:
-it reverses the *iteration order of row groups* so the last RG is
-opened first, but rows within each RG are still decoded forward.
-The resulting stream is "RGs descending × rows ascending" — close
-to the requested order, but not strictly DESC. The optimizer
-therefore reports this as `Inexact` and leaves the `SortExec` in
-place; the win is that `TopK`'s dynamic filter tightens much
-faster, because the very first row groups read already contain
-values near the final answer. A tight threshold means subsequent
-row groups can be skipped via min/max statistics. This ships today
-and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted
-files.
-
-Why not full `Exact` reverse that deletes the `SortExec`?
-Decoding a whole row group forward, reversing the buffer, and
+### `ORDER BY ... DESC` in practice
+
+A `DESC` request on an ASC-sorted plain column goes through both
+triggers — the stats reorder normalises to ASC-by-min and the
+iteration reverse flips to DESC-by-min. The result is *"RGs
+descending × rows ascending"* — close to the requested order but
+not strictly DESC, hence `Inexact`. The `SortExec` stays for
+correctness, but `TopK`'s dynamic filter tightens fast because the
+first row groups read already contain values near the final
+answer, so subsequent row groups can be skipped via min/max
+statistics. This is what powers fast `ORDER BY ts DESC LIMIT N` on
+ASC-sorted files today.
+
+Why not full `Exact` reverse that deletes the `SortExec` outright?
+Decoding a whole row group forward, reversing the buffer, then
emitting works — but peaks at ~128 MB vs. the few-MB-per-batch
streaming profile readers expect. `Exact` reverse waits on a
page-level primitive that keeps the runtime win on a streaming
-memory budget — see the roadmap below.
+memory budget — covered in the roadmap below.
+
+### When neither Inexact trigger fires
+
+* **Aggregations on the sort key** — `SELECT URL, COUNT(*) AS c FROM
+ hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench TopK
+ shape). The leading sort key `c` is an aggregate result with no
+ per-RG stats and no equivalence to a file column, so neither
+ trigger fires. Pushing sort metadata through `AggregateExec` is a
+ separate problem entirely.
+* **Function-wrapped sort with no source-declared ordering** — the
+ reversed-equivalence branch has nothing to invert.
+* **Source declares a forward prefix of the request** —
+ `try_pushdown_sort` returns `Unsupported` so the surrounding
+ `SortExec` can keep its `sort_prefix` annotation; prefix-aware
+ early termination in `TopK` is strictly better than reorder on
+ data that's already in prefix order on disk.
## Current Bottlenecks
-Stats-based **sort elimination** removes the `SortExec` entirely
-when ranges are non-overlapping — there's nothing more to optimize
-on that path. But the `Inexact` paths (**runtime reorder** for
-`TopK`, and **row-group reverse** for `DESC`) leave three concrete
-inefficiencies on the table when `Exact` cannot fire:
+Sort elimination removes the `SortExec` entirely when ranges are
+non-overlapping — there's nothing more to optimize on that path.
+The `Inexact` runtime-reorder path is where the merged work still
+leaves performance on the table. Three concrete inefficiencies:
### Bottleneck 1: `SortExec` stays on top, so `LIMIT N` does not propagate as a static stop signal
@@ -551,6 +532,8 @@ open* file gets read to completion.
### Page-level `Exact` reverse — addresses bottlenecks 1 + 2
+
+
Parquet's `OffsetIndex` gives us byte-precise locations for every
data page in a column chunk, so we can `seek` directly to the last
page, decode it forward, reverse the resulting batch, and emit.
@@ -651,10 +634,11 @@ what pushes it further.
queries can skip the first `N` rows at the row-group level
instead of decoding and discarding them. In progress.
* [Multi-column and function-wrapped reorder follow-ups]. The
- reorder mechanism currently only uses the leading sort key and
- only fires on plain columns. Lexicographic multi-key reorder
- via `arrow::compute::lexsort_to_indices` is low-hanging fruit;
- extending to monotonic function wrappers via leaf-column
+ **stats reorder step** currently only uses the leading sort key
+ on a plain column (reverse handles the rest via
+ `EquivalenceProperties` reasoning). Lexicographic multi-key
+ reorder via `arrow::compute::lexsort_to_indices` is low-hanging
+ fruit; extending to monotonic function wrappers via leaf-column
extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs
a bit more `EquivalenceProperties` integration but is doable.
From 3d7b080c8172478cdfb0ebcd62dc1f37da3a6131 Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 15:59:17 +0800
Subject: [PATCH 13/14] =?UTF-8?q?Drop=20EnsureRequirements=20+=20OFFSET=20?=
=?UTF-8?q?pushdown=20=E2=80=94=20both=20unrelated=20to=20sort=20pushdown?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
EnsureRequirements (#21976) is a rule-unification effort for
EnforceDistribution+EnforceSorting. Touches the same area but isn't
a sort pushdown optimization.
OFFSET pushdown (#21828) is about LIMIT/OFFSET pruning. Same kind of
tangent as the limit-pruning blog reference removed earlier — it's
LIMIT optimization, not sort pushdown.
The remaining 'Multi-column and function-wrapped reorder follow-ups'
bullet is actually directly about sort pushdown's reorder step
(#22198), so it stays. With the other two removed, 'Other follow-ups'
collapsed to a single point — promoted to its own subsection
'Extending the stats reorder step' for clarity.
Also dropped the corresponding entries from the References section.
---
content/blog/2026-05-25-sort-pushdown.md | 38 ++++++++----------------
1 file changed, 13 insertions(+), 25 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 7681e77e..6dbd4821 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -621,34 +621,24 @@ the Bottleneck-1 and Bottleneck-3 overheads listed above. The
deliver; `Exact` reverse + row-group-level early termination is
what pushes it further.
-### Other follow-ups
-
-* [Unifying `EnforceDistribution` and `EnforceSorting`] into a
- single `EnsureRequirements` rule. The two existing rules are
- coupled through `SortExec.preserve_partitioning`, which makes
- their composition non-idempotent and has caused a class of
- production bugs. Other engines (Spark's `EnsureRequirements`,
- Trino's `AddExchanges`) handle both in a single rule. In
- progress.
-* [OFFSET pushdown to parquet] so `ORDER BY ts LIMIT K OFFSET N`
- queries can skip the first `N` rows at the row-group level
- instead of decoding and discarding them. In progress.
-* [Multi-column and function-wrapped reorder follow-ups]. The
- **stats reorder step** currently only uses the leading sort key
- on a plain column (reverse handles the rest via
- `EquivalenceProperties` reasoning). Lexicographic multi-key
- reorder via `arrow::compute::lexsort_to_indices` is low-hanging
- fruit; extending to monotonic function wrappers via leaf-column
- extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs
- a bit more `EquivalenceProperties` integration but is doable.
+### Extending the stats reorder step
+
+Alongside removing the bottlenecks above, the
+[stats reorder step itself has room to grow][stats-reorder-followup].
+Today it only uses the leading sort key on a plain column — reverse
+already handles function-wrapped and multi-column cases via
+`EquivalenceProperties` reasoning, but stats-based RG ordering only
+fires on a plain leading column. Lexicographic multi-key reorder via
+`arrow::compute::lexsort_to_indices` is low-hanging fruit; extending
+to monotonic function wrappers via leaf-column extraction (e.g.
+`date_trunc('day', ts)` → use `min(ts)`) needs a bit more
+`EquivalenceProperties` integration but is doable.
[morsel-style work scheduling]: https://github.com/apache/datafusion/pull/21351
[global file reorder in the shared queue]: https://github.com/apache/datafusion/issues/21733
[TopK threshold init from parquet statistics]: https://github.com/apache/datafusion/pull/21712
[combined statistics-driven `TopK` pipeline]: https://github.com/apache/datafusion/pull/21580
-[Unifying `EnforceDistribution` and `EnforceSorting`]: https://github.com/apache/datafusion/pull/21976
-[OFFSET pushdown to parquet]: https://github.com/apache/datafusion/pull/21828
-[Multi-column and function-wrapped reorder follow-ups]: https://github.com/apache/datafusion/issues/22198
+[stats-reorder-followup]: https://github.com/apache/datafusion/issues/22198
Concretely useful issues for new contributors:
@@ -694,8 +684,6 @@ Landed PRs that make up this work:
In flight / open:
* Page-level reverse (arrow-rs): [apache/arrow-rs#9937](https://github.com/apache/arrow-rs/pull/9937), discussion in [apache/arrow-rs#9934](https://github.com/apache/arrow-rs/issues/9934)
-* `EnsureRequirements`: [apache/datafusion#21976](https://github.com/apache/datafusion/pull/21976)
-* OFFSET pushdown to parquet: [apache/datafusion#21828](https://github.com/apache/datafusion/pull/21828)
* TopK threshold init from parquet statistics: [apache/datafusion#21712](https://github.com/apache/datafusion/pull/21712)
* Combined statistics-driven `TopK` pipeline: [apache/datafusion#21580](https://github.com/apache/datafusion/pull/21580)
* Global file reorder in shared queue: [apache/datafusion#21733](https://github.com/apache/datafusion/issues/21733)
From 970b6d6f07b586a6c0d44ed968699cef1362961a Mon Sep 17 00:00:00 2001
From: Qi Zhu <821684824@qq.com>
Date: Sun, 17 May 2026 16:02:07 +0800
Subject: [PATCH 14/14] Roadmap RG-level early termination: split into
non-overlap vs overlap (k-way merge)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The 'switch work queue from PartitionedFile to RG descriptors' fix is
sufficient for non-overlapping ranges (post-reorder), where the first
file globally has the smallest values and subsequent files are
already stats-pruned. For overlapping ranges, the next smallest value
could sit in any of several open files — matching the non-overlap
efficiency requires explicit k-way merge across open files' next-RG
mins. The dynamic filter does this implicitly (RGs with max < threshold
are dropped), but explicit comparison closes the tap earlier when the
filter tightens slowly.
---
content/blog/2026-05-25-sort-pushdown.md | 28 +++++++++++++++++++-----
1 file changed, 23 insertions(+), 5 deletions(-)
diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
index 6dbd4821..070698d4 100644
--- a/content/blog/2026-05-25-sort-pushdown.md
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -581,11 +581,29 @@ granularity. For a selective filter, the saving compounds.
### Row-group-level dynamic early termination — addresses bottleneck 3
The work queue today holds `PartitionedFile`. Switching it to
-hold **row-group descriptors** instead lets a partition release
-its current file's remaining row groups back to the pool the
-moment a global signal says enough `TopK` winners have been
-found. A natural extension of the existing morsel-style work
-scheduling but not yet on a PR.
+hold **row-group descriptors** lets a partition stop mid-file the
+moment a global signal says `TopK` has K confirmed winners. Two
+flavors depending on whether file ranges actually overlap after
+stats reorder:
+
+* **Non-overlapping ranges.** The first file globally contains
+ the smallest values, the second contains the next batch, and so
+ on. Once `TopK`'s threshold passes file 0's max, every
+ subsequent file is pruned by stats already — the only fix
+ needed is the RG-granular queue so the partition currently
+ inside file 0 also stops at the right RG.
+* **Overlapping ranges.** The smallest *next* value could sit in
+ any of several open files. Matching the non-overlap efficiency
+ requires actively comparing each open file's next-RG `min` and
+ pulling from whichever is smallest — a **k-way merge across
+ files** at RG granularity. The dynamic-filter pushdown already
+ approximates this implicitly (an RG whose `max < threshold` is
+ dropped), but explicit k-way comparison would close the tap
+ earlier when the filter tightens slowly across overlapping
+ files.
+
+A natural extension of the existing morsel-style work scheduling
+but not yet on a PR.
The two roadmap items above are *complementary*, not
alternatives: