From 5f7f97646d17817bf654e6ab7276ad2814776554 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Thu, 14 May 2026 15:11:53 +0800 Subject: [PATCH 01/14] Add blog: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A walkthrough of the sort pushdown work landed and in flight on Apache DataFusion. Covers: - Why SortExec is expensive and what `Exact` / `Inexact` ordering mean at runtime (static `fetch` vs `TopK` dynamic filter). - Phase 1 (#19064): the `PushdownSort` rule + reverse row-group case. - Phase 2 (#21182): statistics-based file sort that upgrades `Unsupported` to `Exact`, eliminating the `SortExec` on non-overlapping ASC scans. Includes the BufferExec compensation (#21426) so the SPM above doesn't lose its implicit memory buffer. - Reverse scans: today's row-group reverse (Inexact, #18817) and the community decision to wait for arrow-rs page-level reverse (#9937) before pursuing Exact reverse, after memory-profile pushback on the original row-group-level proposal. - Benchmarks: 2.1×-49× on the ASC-LIMIT sort_pushdown suite. - What's next: the dynamic / TopK-driven path (#21351 merged, #21733, #21712, #21956, #21580) including the precise RG-pruning vs mid-stream-early-return distinction, and the EnsureRequirements unification (#21976). - Links into the prior dynamic filters and limit pruning posts so the series reads as a coherent thread. --- content/blog/2026-05-11-sort-pushdown.md | 592 ++++++++++++++++++ content/images/sort-pushdown/benchmark.svg | 75 +++ .../sort-pushdown/phase1-file-reorder.svg | 88 +++ .../sort-pushdown/phase2-stats-overlap.svg | 79 +++ content/images/sort-pushdown/plan-diff.svg | 70 +++ content/images/sort-pushdown/reverse-scan.svg | 100 +++ 6 files changed, 1004 insertions(+) create mode 100644 content/blog/2026-05-11-sort-pushdown.md create mode 100644 content/images/sort-pushdown/benchmark.svg create mode 100644 content/images/sort-pushdown/phase1-file-reorder.svg create mode 100644 content/images/sort-pushdown/phase2-stats-overlap.svg create mode 100644 content/images/sort-pushdown/plan-diff.svg create mode 100644 content/images/sort-pushdown/reverse-scan.svg diff --git a/content/blog/2026-05-11-sort-pushdown.md b/content/blog/2026-05-11-sort-pushdown.md new file mode 100644 index 00000000..d8726038 --- /dev/null +++ b/content/blog/2026-05-11-sort-pushdown.md @@ -0,0 +1,592 @@ +--- +layout: post +title: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O +date: 2026-05-11 +author: Qi Zhu +categories: [performance] +--- + + + +[TOC] + +*Qi Zhu, [Massive](https://www.massive.com/)* + +Many [Apache Parquet] datasets are already sorted on disk. Time-series +files are usually written in ingestion-time order. Event logs are sharded +and sorted by event id. Partitioned tables come with a natural ordering +implied by the partition key. The information about that ordering is +sitting right there in the file metadata. + +[Apache Parquet]: https://parquet.apache.org/ + +Until recently, [Apache DataFusion] would still re-sort those files on +every `ORDER BY` query. Every `SELECT ... ORDER BY ts LIMIT 100` did a +full external sort across the entire scan, even though the data was +already in that order. CPU wasted. Memory wasted. Streaming defeated. + +[Apache DataFusion]: https://datafusion.apache.org/ + +This post walks through the **sort pushdown** work that closed that gap. +It is structured in two phases — file rearrangement first, then a +statistics-based proof of non-overlap — and lands real benchmark +speedups of **2.1×–49× on common queries**. The same machinery extends +to `ORDER BY ... DESC`, and the page-level reverse primitive we are +adding upstream in [arrow-rs] will push the gains further still. + +[arrow-rs]: https://github.com/apache/arrow-rs + +## TL;DR + +* DataFusion can now **skip `SortExec` entirely** when input files are + already in the requested order. +* Two phases: + * **Phase 1** — establish the `PushdownSort` rule and the + `Exact` / `Inexact` / `Unsupported` protocol; ship the reverse + row-group case for `ORDER BY ... DESC` (reports `Inexact`). + * **Phase 2** — sort files within each partition by Parquet + `min/max` statistics and *prove* non-overlap, upgrading + `Unsupported` to `Exact` so `PushdownSort` removes the `SortExec` + that `EnforceSorting` inserted earlier. +* Real-world benchmarks on the `sort_pushdown` suite: + `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full + `ORDER BY` scans get **~2×** faster. +* Reverse scans (`ORDER BY ... DESC`) ride the same machinery: a + merged row-group-level reverse returns `Inexact` (Sort stays, but + `TopK` terminates early); the page-level reverse primitive needed + for `Exact` reverse — and so for full `SortExec` removal on `DESC` + queries — is in flight in arrow-rs. + +## Why Sort Pushdown Matters + +`SortExec` is one of the most expensive operators in a query plan. +It is blocking by construction — no row can leave until every input +row has been seen and compared — so it tends to dominate both latency +and peak memory. The cost gets paid even when: + +* the file is already ordered by the sort key (very common for + timestamp columns); +* the query only needs the top *N* rows (`ORDER BY ts LIMIT 100`), in + which case full sort + truncate is wildly wasteful; +* the next operator (`SortPreservingMergeExec`, `SortMergeJoinExec`, + a window function) was going to consume ordered input anyway. + +The data DataFusion needs to avoid this work is **already in the file +metadata**. Parquet writers can record per-column statistics (`min`, +`max`) at the row-group level. Files written by Spark, DuckDB, +arrow-rs, and others routinely include them. And explicit `WITH ORDER` +clauses in DataFusion's SQL `CREATE EXTERNAL TABLE` give the optimizer +a direct ordering hint. The job of sort pushdown is to **use that +information**. + +## How DataFusion Tracks Ordering + +EXPLAIN before / after: SortExec eliminated once ordering is Exact + +Each `FileScanConfig` carries an `output_ordering` — the ordering +that the optimizer is willing to claim for the scan's output. There +are two flavours: + +* **`Exact`** — the optimizer is *certain* the output is in this order. + Sort-handling rules treat an `Exact` ordering as a proof and **remove + the surrounding `SortExec`**. ([`EnforceSorting`] does this when the + scan declares `Exact` from the start; the sort pushdown rule covered + in this post does the same upgrade later in the pipeline.) +* **`Inexact`** — the optimizer *believes* the output is probably + ordered, but cannot prove it. Downstream operators like + `SortPreservingMergeExec` can still benefit from this hint, but the + explicit `SortExec` stays for safety. + +[`EnforceSorting`]: https://docs.rs/datafusion-physical-optimizer/latest/datafusion_physical_optimizer/enforce_sorting/struct.EnforceSorting.html + +A helper called `validated_output_ordering()` is the gatekeeper. It +walks the list of files inside a partition, checks whether the +declared per-file ordering is consistent with the file order on disk, +and either confirms the ordering or **strips it entirely** if it +sees something ambiguous (e.g. file `b` comes before file `a` in the +file list but file `a`'s range comes first). + +### `Exact` and `Inexact` at runtime + +`Exact` and `Inexact` lead to different runtime behaviour, and +distinguishing them up front makes the rest of this post easier to +follow: + +* With **`Exact`**, the `SortExec` is removed and the LIMIT becomes + a **static fetch** on the source. The reader stops the moment the + requested number of rows has been emitted — early termination + at batch granularity, no dynamic state needed. +* With **`Inexact`**, the `SortExec` stays in place. The LIMIT + materialises inside the sort as a `TopK` heap of size K. `TopK` + exposes a [**dynamic filter**][dyn-filters-blog] — a runtime + expression of the form *"only rows that could still beat the + current K-th-best value are worth considering"* — and pushes it + back to the parquet scanner. As more data is processed and the + heap tightens, the filter's threshold tightens with it, and entire + row groups can be skipped by checking the live threshold against + the row group's min/max statistics. (See the earlier + [dynamic filters][dyn-filters-blog] and [limit pruning][limit-pruning-blog] + posts for the full background on this mechanism.) + +Both paths use the same underlying min/max statistics, but for +different purposes: `Exact` uses them at plan time to prove +non-overlap and justify removing the sort; `Inexact` uses them at +runtime to skip row groups that can no longer improve the heap. + +[dyn-filters-blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/ +[limit-pruning-blog]: https://datafusion.apache.org/blog/2026/03/20/limit-pruning/ + +The diagram above shows the result we want: the plan after sort +pushdown loses the `SortExec` node. Everything downstream — the +`SortPreservingMergeExec`, the `RepartitionExec`, the +`DataSourceExec` — was already in the plan. We just need the +optimizer to convince itself that the bottom of the plan is +producing the order requested. + +## Phase 1: The Pushdown API and Reverse Scans + +Phase 1 ([#19064]) introduced the **`PushdownSort`** physical +optimizer rule and a uniform API for asking each `ExecutionPlan` two +questions: + +[#19064]: https://github.com/apache/datafusion/pull/19064 + +1. "Can you produce output in *this* ordering?" +2. "If yes, please rearrange yourself so that it actually does." + +The protocol uses three results — `Exact`, `Inexact`, `Unsupported` — +that downstream operators can interpret uniformly. The Parquet +`FileSource` answers by comparing the requested ordering against the +per-file declared ordering: if natural ordering satisfies the request, +it returns `Exact`; if the *reverse* of the declared ordering does, +it returns `Inexact` and flips on `reverse_row_groups=true` so the +scan reads row groups from last to first (the row-group-level reverse +covered later in this post); otherwise it returns `Unsupported`. + +Phase 1's scope was deliberately narrow. It set up the API and +delivered the reverse-scan case end-to-end, but it did **not** add +any statistics-based file rearrangement — that came later in Phase 2. +A finer-grained extension that reorders row groups *within* each file +by min/max statistics — so the row group with the best sort-key value +is read first and TopK can tighten its threshold faster — is also +in progress in [#21956]. + +Phase 1 also produced a useful side improvement: + +* **Reverse-output redesign** ([#19446], [#19557]) extended the same + rule to `DESC` queries — picked up again in the reverse-scan + section below. + +[#19446]: https://github.com/apache/datafusion/pull/19446 +[#19557]: https://github.com/apache/datafusion/pull/19557 + +## Phase 2: Use Statistics to Prove Non-Overlap + +Phase 2: rearranging files within a partition by min/max statistics so the file list is in range order + +Phase 1 left a sharp edge that motivated Phase 2 ([#21182]). Consider +this realistic scenario: + +[#21182]: https://github.com/apache/datafusion/pull/21182 + +* Three files: `a.parquet`, `b.parquet`, `c.parquet`. +* Each declares `WITH ORDER (ts ASC)`. +* Internally each file *is* sorted by `ts`. +* But they were written by different ingestion jobs and end up listed + in the **wrong order** on disk (e.g. alphabetical by name, not by + time). + +`validated_output_ordering()` looks at this, sees that the +file-internal ordering disagrees with the file-list order, and +**strips the ordering entirely**. From the optimizer's point of view +the scan now has no declared ordering, so `EnforceSorting` (which runs +earlier in the pipeline) inserts a `SortExec`. The data is sorted on +disk; the optimizer just can't tell. + +Phase 2 fixes this in `PushdownSort`, which runs late — after +`EnforceDistribution` and `EnforceSorting` have already shaped the +plan. When `PushdownSort` finds a `SortExec` above a file scan whose +ordering was stripped (a `FileSource` `Unsupported` result), it does +three things inside `FileScanConfig::try_pushdown_sort`: + +1. **Sort the file list by per-file statistics on the sort + column(s)** within each file group (the diagram above). The + pre-existing [`MinMaxStatistics`] helper (introduced in [#9593]) + reads each file's `column_statistics[c].min_value` / + `.max_value` for each sort column `c`, then sorts the file list by + the min row. Phase 2 wires this helper into the optimizer's + `Unsupported` branch — `sort_files_within_groups_by_statistics` + does the per-group orchestration and decides whether any group is + non-overlapping after the sort. +2. **Check adjacency within each group**: walk each sorted file group + independently and ask whether `file[i].max ≤ file[i+1].min` for + every adjacent pair (touching at the boundary is fine — value `v` + showing up as the last row of one file and the first row of the + next still produces a sorted stream). The check is **per file + group**, not across groups; cross-group ordering is the job of + `SortPreservingMergeExec` at runtime (more on this below). +3. **Upgrade `Unsupported` to `Exact`** when adjacency holds, the + table has a declared `output_ordering` (from `WITH ORDER` or + parquet `sorting_columns`), and the sort columns are null-free — + the last condition preserves `NULLS LAST`/`NULLS FIRST` semantics + across file boundaries. `PushdownSort` then removes the `SortExec` + itself and the plan becomes streamable. + +[`MinMaxStatistics`]: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/statistics.rs +[#9593]: https://github.com/apache/datafusion/pull/9593 + +One caveat that comes straight from `MinMaxStatistics`: the stats +sort only fires when every `ORDER BY` expression is a plain column. +`ORDER BY date_trunc('hour', ts)` silently skips the upgrade — there +is no per-file min/max for the function output to compare against. +Extending sort pushdown across monotonic function wrappers is one of +the open follow-ups. + +Phase 2: detecting non-overlapping ranges via min/max statistics + +The diagram above contrasts the two cases. On the left, ranges are +non-overlapping after sort, so we can guarantee that emitting the +files in min-order produces a globally sorted stream. On the right, +the ranges overlap, so even after sorting the files by `min(ts)` we +cannot guarantee global ordering — Phase 2 correctly bails out and +keeps `SortExec` in place. + +The implementation handles a few edge cases worth calling out: + +* **Buffering the eliminated `SortExec`.** When the `SortExec` was + sitting under a `SortPreservingMergeExec` with + `preserve_partitioning=true`, it wasn't just sorting — it was also + acting as an *implicit in-memory buffer* for the SPM above it. The + SPM picks rows from each partition stream one at a time; without + the upstream `SortExec` holding batches in memory, the SPM would + read directly from I/O-bound sources and stall on every pick. Phase + 2 compensates by inserting a [`BufferExec`] in the `SortExec`'s + place — bounded streaming buffer, same throughput shape, no + blocking sort. Capacity is configurable via + [`sort_pushdown_buffer_capacity`] ([#21426]). +* **`fetch` preservation** through `EnforceDistribution`. The + distribution rule sometimes strips a `SortExec`'s `fetch` field and + re-adds the node later. Phase 2 plumbs `fetch` through so a + surviving `LIMIT` is not lost. +* **Per-group, not global, non-overlap.** Phase 2's adjacency check is + scoped to each file group. Two file groups can have *overlapping* + ranges and the upgrade still fires, as long as each group is + internally non-overlapping. That works because each group already + produces an independently ordered stream at runtime, and + `SortPreservingMergeExec` then picks rows across streams in value + order to produce the final globally sorted output. Phase 2 only has + to prove the per-stream property. +* **Single-partition vs multi-partition execution**. With the default + multi-partition setup, `EnforceDistribution` byte-range-splits files + into single-file groups, after which `validated_output_ordering()` + works correctly on its own. Phase 2 only triggers when files have + not been split — typically `--partitions 1` runs, or files small + enough that the splitter leaves them alone. In the typical `--partitions + 1` case the "per-group" distinction collapses (one group equals the + whole table), which is why the example earlier in this section is + drawn that way. + +[`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs +[`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426 +[#21426]: https://github.com/apache/datafusion/pull/21426 + +## Benchmarks + +Sort pushdown phase 2 benchmark: 2x-49x speedup across four queries + +The [`sort_pushdown`] benchmark suite reproduces the +"wrong-order file list" scenario by generating Parquet files whose +names are intentionally reversed against their sort-key ranges. Numbers +below are `--partitions 1`, release build, on the merged Phase 2 +branch versus `main`: + +[`sort_pushdown`]: https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown + +| Query | Before | After | Speedup | +| ------------------------------------------- | -------:| -------:| -------:| +| Q1 — `ORDER BY key` (full scan) | 259 ms | 122 ms | **2.1×** | +| Q2 — `ORDER BY key LIMIT 100` | 80 ms | 3 ms | **27×** | +| Q3 — `SELECT * ORDER BY key` | 700 ms | 313 ms | **2.2×** | +| Q4 — `SELECT * ORDER BY key LIMIT 100` | 342 ms | 7 ms | **49×** | + +The shape of the speedup is what you would expect once `SortExec` is +removed: + +* **Full-scan queries (Q1, Q3)** still have to push every row through + the pipeline, so the gain is "just" the cost of the sort itself — + roughly half the original time. This matches the rule of thumb that + a blocking sort doubles end-to-end latency on data that fits in + memory. +* **`LIMIT` queries (Q2, Q4)** benefit much more because removing + `SortExec` converts the LIMIT into a static `fetch` on the data + source — the reader stops the moment K rows have been emitted, + instead of reading the full file, sorting, and truncating. + This is the "early termination at batch granularity" case from + the runtime-difference section above. A 342 ms full-file scan + collapses into a 7 ms K-row read. + +It is worth saying explicitly what this change does **not** affect. +The default multi-partition execution path is unchanged: those plans +already produced correct orderings via byte-range splitting, so +Phase 2 simply does not trigger. There is no regression and no behavior +change for the typical multi-threaded query. + +## Reverse Scans for `ORDER BY ... DESC` + +Row-group reverse vs page reverse: 128MB and 8 pages vs 1MB and 1 page + +`ORDER BY ts DESC` is the same problem in reverse. If a file is sorted +ascending and the query wants descending, we should be able to skip +the sort — we just need to read the data in the opposite order. + +The first iteration of this lives in [#18817] and operates at the +**row group** level: it reverses the *iteration order of row groups* +so the last RG is opened first, but rows within each RG are still +decoded forward. The resulting stream is "RGs descending × rows +ascending" — close to the requested order, but not strictly DESC. The +optimizer therefore reports this as `Inexact` and leaves the +`SortExec` in place; the win is that `TopK`'s dynamic filter tightens +much faster, because the very first row groups read already contain +values near the final answer. A tight threshold means subsequent row +groups can be skipped via min/max statistics. This ships today and +is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted files. + +[#18817]: https://github.com/apache/datafusion/pull/18817 + +To turn this into `Exact` reverse — so the `SortExec` can be removed +outright — each emitted batch itself has to be in DESC order. The +straightforward row-group-level approach (decode an entire RG forward, +materialize all rows, reverse the buffer, then emit) is correct and +was actually proposed first, in an earlier iteration of this work +([#18817], later closed and split into smaller pieces). Review +feedback there — primarily from [@2010YOUY01] — flagged the memory +profile as too aggressive: caching an entire row group's worth of +decoded rows before any batch can be emitted is roughly: + +* **Peak buffer of one whole row group** (~128 MB by default), versus + the few-MB-per-batch streaming profile readers normally have. +* **First-batch latency = full last-row-group decode**. For + `ORDER BY ts DESC LIMIT 10` that means decoding ~1 million rows to + return 10 — defeating the point of the `LIMIT`. + +The agreed direction coming out of that discussion was to ship the +narrower `Inexact` row-group-reverse first (which became Phase 1 in +[#19064]), and to build `Exact` reverse on a finer-grained primitive +once `arrow-rs` exposed one. + +That primitive is the **page-level** reverse traversal. Parquet's +`OffsetIndex` already gives us byte-precise locations for every data +page in a column chunk, so we can `seek` directly to the last page, +decode it forward, reverse the resulting batch, and emit. Peak buffer +drops to one page (~1 MB) and first-batch latency drops to the cost +of one page decode — the row-group-level memory cliff disappears. + +We are landing this primitive upstream in arrow-rs as +[#9937], with the discussion in [#9934]. Early numbers on a 100k-row, +98-page column chunk show **~50× faster time-to-first-N** for `n ≤ 1 +page` and **~9× faster** for `n` spanning 10 pages, compared with the +row-group-level Exact reverse described above. The DataFusion-side +integration that turns this primitive into an `Exact` result is a +follow-up to #9937 and is gated on its merge. + +[@2010YOUY01]: https://github.com/2010YOUY01 + +[#9937]: https://github.com/apache/arrow-rs/pull/9937 +[#9934]: https://github.com/apache/arrow-rs/issues/9934 + +One natural question: why not reverse the rows *within* a page +directly? Because we can't. Parquet's page encodings (RLE, dictionary, +delta, bit-packing) are all forward streams — you cannot decode the +last value without decoding every value that came before it. The +design therefore is: **reverse the page traversal, forward-decode +each page, reverse the resulting RecordBatch**. This is the algorithm +shape that DataFusion's Phase-2 `RecordBatchReader` integration will +use once arrow-rs ships the primitive. + +The killer use case is **filtered reverse TopK**: + +```sql +SELECT * FROM events +WHERE user_id = 42 +ORDER BY ts DESC +LIMIT 10 +``` + +Here `RowSelection::with_limit` cannot help — you don't know in +advance which rows match `user_id = 42`, so you can't pre-compute a +selection of the "last 10 matching rows". The only correct strategy +is to stream pages backward, evaluate the filter on each, and stop +when 10 matches are collected. Row-group reverse stops at a +~128 MB granularity. Page reverse stops at ~1 MB granularity. For a +selective filter, the saving compounds. + +## What's Next + +Sort pushdown is a long-running line of work and there is more to do. +Beyond the `Exact` path described above, there is a complementary +**dynamic / TopK-driven path** that helps when `Exact` cannot apply — +e.g. when file ranges genuinely overlap, or when the sort is on a +function output rather than a plain column. The two directions are +not alternatives; they compose: + +* **`Exact` reverse for `ORDER BY ... DESC`.** Today's row-group + reverse returns `Inexact` and the `SortExec` stays on top; the + arrow-rs page-level reverse primitive ([#9937]) is what unlocks + `Exact` reverse on `DESC` queries (and therefore full `SortExec` + elimination on `DESC`). Memory + first-batch latency rule out doing + the same thing at the row-group level. Gated on #9937. +* **Dynamic / TopK-driven path.** When `Exact` cannot fire, `TopK`'s + [dynamic filter][dyn-filters-blog] still benefits enormously from + reading the *best* data first. This thread also builds on the + [limit pruning][limit-pruning-blog] work that turned `LIMIT` into + an I/O optimization across the pruning pipeline. The + recently-merged morsel-style work scheduling in `FileStream` + ([#21351]) gives sibling partitions a *shared work queue* with + file-level work-stealing — no CPU sits idle when one partition + runs out of files. The proposed [#21733] sorts files in + that shared queue by per-file statistics *before* any partition + picks, so the first file read is globally optimal and tightens the + dynamic filter immediately. Combined with **TopK threshold init from + parquet statistics** ([#21712]) and **row-group reorder within each + file** ([#21956]), the threshold can be set before reading a single + byte. The combined statistics-driven `TopK` pipeline is in flight + as [#21580]. + + The mechanism here is **RG-level pruning, not mid-stream early + return**. With the threshold known up front, the parquet + `PruningPredicate` rejects entire row groups against their min/max + statistics before any I/O — those row groups are never decoded. + The row group(s) the reader *does* open still have their sort + column decoded in full to feed the dynamic filter. On the #21580 + microbenchmark (single file, 61 sorted row groups, `--partitions 1`), + **60 of the 61 row groups are skipped** and only one is decoded: + + | Query | Baseline | With pipeline | Speedup | + | ------------------------------ | -------: | ------------: | ------: | + | `ORDER BY col DESC LIMIT 100` | 28.5 ms | 1.64 ms | **17×** | + | `ORDER BY col DESC LIMIT 1000` | 22.2 ms | 0.37 ms | **60×** | + | `SELECT * ORDER BY ... LIMIT 100` | 22.5 ms | 0.66 ms | **34×** | + | `SELECT * ORDER BY ... LIMIT 1000` | 22.4 ms | 0.61 ms | **37×** | + + The stack reports `Inexact` — the `SortExec` stays on top to + enforce correctness across overlapping ranges — so this path + cannot do *true* mid-stream early return. Once the parquet reader + opens a row group, the sort column has to be decoded all the way + through; once a `FileStream` picks up a file from the shared work + queue, it has to finish that file. Today's dynamic work scheduling + ([#21351]) is **file-granular**: idle partitions stop pulling + new files from the queue once a global limit is satisfied, but + the partition that's currently inside a file decodes that file's + remaining row groups regardless. Mid-file RG-level early return + on `TopK` convergence is **not implemented yet** — the work + queue holds `PartitionedFile`, not row-group descriptors. + + Closing the tap the moment `TopK` has K confirmed winners therefore + needs either: + + * the **`Exact` path**, where the `SortExec` is gone entirely and + the data source's own `fetch` becomes a static limit that the + reader can honour at batch granularity; or + * **finer-grained dynamic scheduling** — having the shared queue + hold row-group descriptors instead of whole files, so a partition + can release its current file's remaining row groups back to the + pool once a global signal says enough TopK winners have been + found. This is a natural extension of [#21351] and [#21733] but + is not yet on a PR. + + The three mechanisms compose. Stats pruning saves the row groups + that *can't* matter (skipped without I/O). The dynamic filter + narrows what's decoded inside the row groups the reader does + open. `Exact` or finer-grained scheduling is what eventually + closes the tap once `TopK` is satisfied. +* **Phase 3 — filter + sort early termination.** `WHERE filter ORDER + BY ts DESC LIMIT N` is the dominant observability query shape and + the one where the arrow-rs page-reverse primitive matters most: + `RowSelection::with_limit` cannot pre-compute the last `N` matching + rows when the filter is selective, so the only correct strategy is + to stream pages backward, evaluate the filter, and stop when `N` + matches are collected. The DataFusion-side integration is the + follow-up to #9937. +* **Unifying `EnforceDistribution` and `EnforceSorting`** into a + single `EnsureRequirements` rule ([#21976]). The two existing rules + are coupled through `SortExec.preserve_partitioning`, which makes + their composition non-idempotent and has caused a class of + production bugs. Other engines (Spark's `EnsureRequirements`, + Trino's `AddExchanges`) handle both in a single rule. Merging them + also gives future sort-related optimizations a single coherent place + to live. In progress. +* **OFFSET pushdown to parquet** ([#21828]) so `ORDER BY ts LIMIT K + OFFSET N` queries can skip the first `N` rows at the row-group level + instead of decoding and discarding them. In progress. + +[#21976]: https://github.com/apache/datafusion/pull/21976 +[#21956]: https://github.com/apache/datafusion/pull/21956 +[#21712]: https://github.com/apache/datafusion/pull/21712 +[#21580]: https://github.com/apache/datafusion/pull/21580 +[#21828]: https://github.com/apache/datafusion/pull/21828 +[#21351]: https://github.com/apache/datafusion/pull/21351 +[#21733]: https://github.com/apache/datafusion/issues/21733 + +Concretely useful issues for new contributors: + +* [#17348] — the umbrella issue for sort pushdown. +* [#21317] — sort pushdown: reorder row groups by statistics within + each file. +* [#19394] — add more `ExecutionPlan` impls to support sort pushdown. + +[#17348]: https://github.com/apache/datafusion/issues/17348 +[#21317]: https://github.com/apache/datafusion/issues/21317 +[#19394]: https://github.com/apache/datafusion/issues/19394 + +## Acknowledgements + +Thank you to [@alamb], [@adriangb], [@xudong963], [@2010YOUY01], and +[@Dandandan] for reviewing the design and the patches across many +iterations. The DataFusion community's willingness to engage deeply +with optimizer changes — including the ones that touch foundational +invariants — is what made this work possible. + +[@alamb]: https://github.com/alamb +[@adriangb]: https://github.com/adriangb +[@xudong963]: https://github.com/xudong963 +[@2010YOUY01]: https://github.com/2010YOUY01 +[@Dandandan]: https://github.com/Dandandan + +## References + +Prior posts this work builds on: + +* [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses. +* [Turning LIMIT into an I/O Optimization: Inside DataFusion's Multi-Layer Pruning Stack][limit-pruning-blog] — the pruning pipeline this work plugs into. + +Issues and PRs: + +* Umbrella issue: [apache/datafusion#17348][#17348] +* `MinMaxStatistics` foundation: [apache/datafusion#9593][#9593] +* Phase 1: [apache/datafusion#19064][#19064] +* Phase 2: [apache/datafusion#21182][#21182] +* `BufferExec` capacity for sort elimination: [apache/datafusion#21426][#21426] +* Dynamic / TopK-driven path: [apache/datafusion#21351][#21351] (morsel-style work scheduling), + [apache/datafusion#21733][#21733] (global file reorder in shared queue) +* Benchmark suite: [`sort_pushdown`] +* Row-group reverse scan: [apache/datafusion#18817][#18817] +* Page-level reverse (arrow-rs): [apache/arrow-rs#9934][#9934], + [apache/arrow-rs#9937][#9937] +* `EnsureRequirements`: [apache/datafusion#21976][#21976] diff --git a/content/images/sort-pushdown/benchmark.svg b/content/images/sort-pushdown/benchmark.svg new file mode 100644 index 00000000..30afb7b2 --- /dev/null +++ b/content/images/sort-pushdown/benchmark.svg @@ -0,0 +1,75 @@ + + + + sort_pushdown benchmark (single partition, release, reversed-name data) + + + + 700ms + + 259ms + + 80ms + + 7ms + + 0 + + + + Q1 + ORDER BY full + + 259 + + 122 + 2.1× + + + Q2 + ORDER BY LIMIT + + 80 + + 3 + 27× + + + Q3 + SELECT * ORDER BY + + 700 + + 313 + 2.2× + + + Q4 + SELECT * ORDER BY LIMIT + + 342 + + 7 + 49× + + + latency (ms) + + + + main (before) + + sort pushdown phase 2 + Lower is better + --partitions 1, release + diff --git a/content/images/sort-pushdown/phase1-file-reorder.svg b/content/images/sort-pushdown/phase1-file-reorder.svg new file mode 100644 index 00000000..9ae798ba --- /dev/null +++ b/content/images/sort-pushdown/phase1-file-reorder.svg @@ -0,0 +1,88 @@ + + + + + + + + + + Phase 1: file rearrangement by declared ordering + + + Before — directory order: + + + + a.parquet + ts ∈ [200, 300] + + + b.parquet + ts ∈ [100, 200] + + + c.parquet + ts ∈ [0, 100] + + validated_output_ordering() = None + → SortExec required + + + + PushdownSort + sort by min(ts) + + + After — sorted by stats: + + + c.parquet + ts ∈ [0, 100] + + + b.parquet + ts ∈ [100, 200] + + + a.parquet + ts ∈ [200, 300] + + validated_output_ordering() = Exact + → SortExec removed + + + Range layout + + + + + + 0 + 100 + 200 + 300 + + + + c + + b + + a + + Non-overlapping → ordering provable + + + SELECT * FROM events ORDER BY ts + diff --git a/content/images/sort-pushdown/phase2-stats-overlap.svg b/content/images/sort-pushdown/phase2-stats-overlap.svg new file mode 100644 index 00000000..027860ef --- /dev/null +++ b/content/images/sort-pushdown/phase2-stats-overlap.svg @@ -0,0 +1,79 @@ + + + + Phase 2: use min/max statistics to prove non-overlap + + + + Non-overlapping ranges + + + + + + + + 0 + 100 + 200 + 300 + min(ts) / max(ts) + + + + file_c [0..100] + + file_b [100..200] + + file_a [200..300] + + + + + + Ordering: Exact ✓ + SortExec can be removed + + + + Overlapping ranges + + + + + + + 0 + 100 + 200 + 300 + min(ts) / max(ts) + + + + file_x [0..180] + + file_y [80..260] + + file_z [140..300] + + + + + + Ordering: Inexact (or stripped) + SortExec stays + + + PushdownSort sorts files by min, checks adjacency, upgrades to Exact only when ranges don't overlap. + diff --git a/content/images/sort-pushdown/plan-diff.svg b/content/images/sort-pushdown/plan-diff.svg new file mode 100644 index 00000000..a4d08673 --- /dev/null +++ b/content/images/sort-pushdown/plan-diff.svg @@ -0,0 +1,70 @@ + + + + + + + + + EXPLAIN before / after sort pushdown + + + + Before — SortExec on top + + + CoalescePartitionsExec + + + + + SortExec + expr=[ts ASC], full sort + + + + + RepartitionExec + + + + + DataSourceExec + files: [a.parquet, b.parquet, c.parquet] + + + + After — SortExec eliminated + + + SortPreservingMergeExec + + + + + + SortExec (removed) + no longer needed + + + + + + + RepartitionExec + + + + + DataSourceExec + files: [c.parquet, b.parquet, a.parquet] + diff --git a/content/images/sort-pushdown/reverse-scan.svg b/content/images/sort-pushdown/reverse-scan.svg new file mode 100644 index 00000000..443a0a1c --- /dev/null +++ b/content/images/sort-pushdown/reverse-scan.svg @@ -0,0 +1,100 @@ + + + + + + + + + + + + ORDER BY ts DESC LIMIT 10 — row-group reverse vs page reverse + + + + Row-group reverse (today, merged) + + + + RowGroup (last, ~128 MB) + + + + + + + + + + + + P0 + P1 + P2 + P3 + P4 + P5 + P6 + P7 + + + Decode the entire row group, reverse in memory, take 10. + Peak buffer: ~128 MB + Pages decoded: 8 + Time-to-first-N: ~29 µs + + + + + + + + + + + + + + Page reverse (upstream POC, arrow-rs #9937) + + + RowGroup (last) + + + + + + + + + + + P0 + P1 + P2 + P3 + P4 + P5 + P6 + P7 + + + + + + Seek to last page only via OffsetIndex, decode, reverse, return. + Peak buffer: ~1 MB + Pages decoded: 1 + Time-to-first-N: ~565 ns (≈ 50× faster) + From 47e45dd28d7b2d2f07b2ce1e344ce25f82391de9 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Thu, 14 May 2026 15:34:17 +0800 Subject: [PATCH 02/14] Push draft date to 2026-05-25 --- ...{2026-05-11-sort-pushdown.md => 2026-05-25-sort-pushdown.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename content/blog/{2026-05-11-sort-pushdown.md => 2026-05-25-sort-pushdown.md} (99%) diff --git a/content/blog/2026-05-11-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md similarity index 99% rename from content/blog/2026-05-11-sort-pushdown.md rename to content/blog/2026-05-25-sort-pushdown.md index d8726038..8dd9d8b8 100644 --- a/content/blog/2026-05-11-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -1,7 +1,7 @@ --- layout: post title: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O -date: 2026-05-11 +date: 2026-05-25 author: Qi Zhu categories: [performance] --- From dab94fd74037eb6c1f57861574caa0497fcea9ae Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Fri, 15 May 2026 17:40:51 +0800 Subject: [PATCH 03/14] Add Phase 3 section covering #21956 runtime reorder pipeline MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The post previously only mentioned #21956 in passing. The PR is landing the full mechanism — `try_pushdown_sort` decision tree, two flags on `ParquetSource`, three composable runtime steps (file reorder + RG reorder + reverse), and a `sort_prefix`- preserving short-circuit — so cover it as a dedicated phase between Phase 2 and the existing reverse-scan section. - TL;DR: add a Phase 3 bullet alongside Phase 1 and Phase 2. - Phase 1: replace the in-flight `#21956` aside with a cross-link to the new section. - Phase 2: keep the caveat about function-wrapped sorts but note that #21956's `Inexact` path now covers them via monotonicity inference. - New `## Phase 3` section with two SVG diagrams: a decision tree for the three protocol outcomes, and a three-step pipeline for the `Inexact` runtime. Covers the two-flag design, the nested file/RG layers, when EXPLAIN surfaces each flag, and four scenarios where Phase 3 does not fire (aggregations, multi- column secondary keys, function-wrapped sorts without a declared ordering, source declares a forward prefix of the request). - "What's Next": rename the old "Phase 3 — filter + sort" bullet to "Filtered reverse TopK end-to-end" so the label doesn't clash with the new section, and add a follow-up bullet referencing #22198 for multi-column / function-wrapped reorder. --- content/blog/2026-05-25-sort-pushdown.md | 178 ++++++++++++++++-- .../images/sort-pushdown/pr21956-decision.svg | 66 +++++++ .../pr21956-runtime-pipeline.svg | 69 +++++++ 3 files changed, 298 insertions(+), 15 deletions(-) create mode 100644 content/images/sort-pushdown/pr21956-decision.svg create mode 100644 content/images/sort-pushdown/pr21956-runtime-pipeline.svg diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 8dd9d8b8..32366246 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -56,8 +56,10 @@ adding upstream in [arrow-rs] will push the gains further still. ## TL;DR * DataFusion can now **skip `SortExec` entirely** when input files are - already in the requested order. -* Two phases: + already in the requested order, and **read the most-promising data + first** when they aren't — so `TopK` converges fast and the rest + gets pruned by statistics. +* Three phases: * **Phase 1** — establish the `PushdownSort` rule and the `Exact` / `Inexact` / `Unsupported` protocol; ship the reverse row-group case for `ORDER BY ... DESC` (reports `Inexact`). @@ -65,9 +67,16 @@ adding upstream in [arrow-rs] will push the gains further still. `min/max` statistics and *prove* non-overlap, upgrading `Unsupported` to `Exact` so `PushdownSort` removes the `SortExec` that `EnforceSorting` inserted earlier. -* Real-world benchmarks on the `sort_pushdown` suite: - `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full - `ORDER BY` scans get **~2×** faster. + * **Phase 3** ([#21956]) — generalise `Inexact`: whenever the + leading sort key is a plain column in the file schema (or the + source's reversed declared ordering satisfies the request), + `try_pushdown_sort` stamps two flags on the source and the + opener runs a three-step runtime pipeline — file-level reorder + in the shared morsel queue, row-group reorder by min/max stats, + then optional iteration reverse for `DESC` requests. +* Real-world benchmarks on the `sort_pushdown` suite (Phase 2's + `Exact` upgrade): `ORDER BY ... LIMIT` queries get **27× and 49× + faster**; full `ORDER BY` scans get **~2×** faster. * Reverse scans (`ORDER BY ... DESC`) ride the same machinery: a merged row-group-level reverse returns `Inexact` (Sort stays, but `TopK` terminates early); the page-level reverse primitive needed @@ -183,10 +192,9 @@ covered later in this post); otherwise it returns `Unsupported`. Phase 1's scope was deliberately narrow. It set up the API and delivered the reverse-scan case end-to-end, but it did **not** add any statistics-based file rearrangement — that came later in Phase 2. -A finer-grained extension that reorders row groups *within* each file -by min/max statistics — so the row group with the best sort-key value -is read first and TopK can tighten its threshold faster — is also -in progress in [#21956]. +A finer-grained extension that broadens this `Inexact` path with a +three-step runtime reorder pipeline lands in [#21956] — covered in +[Phase 3](#phase-3-runtime-reorder-for-inexact-pushdown) below. Phase 1 also produced a useful side improvement: @@ -259,6 +267,12 @@ is no per-file min/max for the function output to compare against. Extending sort pushdown across monotonic function wrappers is one of the open follow-ups. +*(Within #21956's `Inexact` path, `EquivalenceProperties`'s +monotonicity inference does let function-wrapped sorts benefit from +row-group iteration reverse when the source declares a compatible +natural ordering — but stats-based reorder still needs a plain +column.)* + Phase 2: detecting non-overlapping ranges via min/max statistics The diagram above contrasts the two cases. On the left, ranges are @@ -348,6 +362,130 @@ already produced correct orderings via byte-range splitting, so Phase 2 simply does not trigger. There is no regression and no behavior change for the typical multi-threaded query. +## Phase 3: Runtime Reorder for Inexact Pushdown + +Phase 2 handles the `Exact` upgrade — strong correctness, sort +elimination — but only when the table has a declared +`output_ordering` *and* the files are provably non-overlapping after +sorting by min. Two large classes of queries fall outside that +window: + +* **Unsorted data** — no `WITH ORDER`, no parquet `sorting_columns`. + Phase 2 cannot fire because there is no ordering claim to upgrade. +* **Overlapping ranges** — files written by different ingestion + jobs share time windows. Phase 2 keeps the `SortExec` because the + global ordering can't be proven, even though the files often do + contain large stretches of in-order data. + +For both, a full external `SortExec` is overkill. The parquet +metadata is right there, and reading the *most-promising* data +first lets `TopK`'s dynamic filter threshold tighten quickly so the +rest gets pruned. Phase 3 ([#21956]) wires that up by generalising +the `Inexact` path Phase 1 introduced. + +### `try_pushdown_sort` — one decision, three outcomes + +try_pushdown_sort decision tree: Exact, Inexact, or Unsupported + +The `Exact` / `Inexact` / `Unsupported` protocol from Phase 1 stays. +Phase 3 broadens the **conditions** that route a query into +`Inexact`: + +| Condition | Outcome | +| --- | --- | +| `eq_properties.ordering_satisfy(request)` | `Exact` — Phase 1 / 2 sort elimination | +| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — Phase 3 runtime pipeline | +| Neither | `Unsupported` — `SortExec` stays, no source-side optimisation | + +The "reversed satisfies" branch is what handles function-wrapped +sorts (`date_trunc('day', ts) DESC`, `ceil(value) DESC`, +`CAST(x AS Date) DESC`) — `EquivalenceProperties`'s monotonicity +reasoning recognises that `f(col) DESC` is satisfied by `col ASC` +reversed, even though parquet has no stats keyed by `f(col)` +itself. + +### Two flags on `ParquetSource`, three runtime steps + +Phase 3 runtime pipeline: file reorder, RG reorder, then optional reverse + +When `try_pushdown_sort` returns `Inexact`, it stamps two fields on +the `ParquetSource`: + +```rust +struct ParquetSource { + sort_order_for_reorder: Option, // what to reorder by + reverse_row_groups: bool, // whether to flip iteration + // ... +} +``` + +The opener reads them at scan time to drive three composable steps: + +1. **File-level reorder.** `FileSource::reorder_files` sits in the + shared morsel queue (the [#21351] work-stealing primitive) and + sorts the partitioned-file list by `min(col)`. The first file + picked across all partitions is globally the most-promising one. +2. **Row-group-level reorder.** Once a file is opened, + `PreparedAccessPlan::reorder_by_statistics` sorts that file's + `row_group_indexes` by `min(col)` ASC. The row group most likely + to contribute to `TopK` is decoded first. +3. **Reverse.** For `DESC` requests, + `PreparedAccessPlan::reverse` flips the iteration after the + stats reorder normalises everything to ASC-by-min. Same + primitive Phase 1 introduced for declared reverse scans — Phase + 3 just routes more queries through it. + +The two layers **nest by construction**: file `i`'s `min(col)` is +a lower bound on every row group inside it, so the file queue's +order is a natural prefix of the within-file row-group order. +Choosing the same key (`min`) in both layers keeps the strategies +consistent. + +`reverse_row_groups`'s meaning depends on which way `Inexact` was +reached. When the column-in-schema condition fires, the stats +reorder produces ASC-by-min, so `reverse_row_groups` simply mirrors +the request direction. When only the reversed-equivalence +condition fires (function-wrapped case with a declared source +ordering), `reverse_row_groups` is `true` unconditionally — there +is no stats reorder to compose with, just a flip of the file's +natural order. + +Both flags surface on the `DataSourceExec` line in `EXPLAIN` so +plan inspection and snapshot tests can confirm the pushdown fired: + +```text +DataSourceExec: file_groups=..., file_type=parquet, + sort_order_for_reorder=[a@0 ASC], reverse_row_groups=true +``` + +Absence of either flag means the corresponding runtime step is a +no-op. + +### When Phase 3 does *not* fire + +* **Aggregations on top of the sort key.** `SELECT URL, COUNT(*) AS c + FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench + TopK shape) — the leading sort key (`c`) is an aggregation result + and has no per-RG stats in the parquet file, so the + column-in-schema check fails. Pushing sort metadata through + `AggregateExec` is a separate problem: the aggregated value + doesn't exist before aggregation, so even if the metadata reached + the scan there'd be nothing actionable to do with it. +* **Multi-column sort secondary keys.** The reorder currently only + uses the leading sort expression — secondary keys are ignored. + Tracked as a follow-up in [#22198]. +* **Function-wrapped sort without a source-declared ordering.** + Without a declared ordering to invert, the reversed-equivalence + branch has nothing to satisfy. Tracked in the same follow-up. +* **Source declares a forward prefix of the request.** When the + source's declared `output_ordering` is a non-empty proper prefix + of the request (e.g. source `[a DESC, b ASC]`, request + `[a DESC, b ASC, c DESC]`), `try_pushdown_sort` returns + `Unsupported` so the surrounding `SortExec` can keep its + `sort_prefix` annotation — prefix-aware early termination in + `TopK` is strictly better than the Phase 3 reorder on data that + is already in prefix order on disk. + ## Reverse Scans for `ORDER BY ... DESC` Row-group reverse vs page reverse: 128MB and 8 pages vs 1MB and 1 page @@ -464,9 +602,11 @@ not alternatives; they compose: that shared queue by per-file statistics *before* any partition picks, so the first file read is globally optimal and tightens the dynamic filter immediately. Combined with **TopK threshold init from - parquet statistics** ([#21712]) and **row-group reorder within each - file** ([#21956]), the threshold can be set before reading a single - byte. The combined statistics-driven `TopK` pipeline is in flight + parquet statistics** ([#21712]) and **`try_pushdown_sort` driving + runtime row-group / file reorder + reverse** ([#21956], landed), + the threshold can be set before reading a single byte. The reorder + mechanism applies to any `ORDER BY [LIMIT N]` on + parquet, not just TopK queries with a dynamic filter. The combined statistics-driven `TopK` pipeline is in flight as [#21580]. The mechanism here is **RG-level pruning, not mid-stream early @@ -516,9 +656,9 @@ not alternatives; they compose: narrows what's decoded inside the row groups the reader does open. `Exact` or finer-grained scheduling is what eventually closes the tap once `TopK` is satisfied. -* **Phase 3 — filter + sort early termination.** `WHERE filter ORDER - BY ts DESC LIMIT N` is the dominant observability query shape and - the one where the arrow-rs page-reverse primitive matters most: +* **Filtered reverse TopK end-to-end.** `WHERE filter ORDER BY ts + DESC LIMIT N` is the dominant observability query shape and the + one where the arrow-rs page-reverse primitive matters most: `RowSelection::with_limit` cannot pre-compute the last `N` matching rows when the filter is selective, so the only correct strategy is to stream pages backward, evaluate the filter, and stop when `N` @@ -535,9 +675,17 @@ not alternatives; they compose: * **OFFSET pushdown to parquet** ([#21828]) so `ORDER BY ts LIMIT K OFFSET N` queries can skip the first `N` rows at the row-group level instead of decoding and discarding them. In progress. +* **Multi-column and function-wrapped reorder follow-ups** ([#22198]). + The reorder mechanism in #21956 currently only uses the leading + sort key and only fires on plain columns. Lexicographic multi-key + reorder via `arrow::compute::lexsort_to_indices` is low-hanging + fruit; extending to monotonic function wrappers via leaf-column + extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs a + bit more `EquivalenceProperties` integration but is doable. [#21976]: https://github.com/apache/datafusion/pull/21976 [#21956]: https://github.com/apache/datafusion/pull/21956 +[#22198]: https://github.com/apache/datafusion/issues/22198 [#21712]: https://github.com/apache/datafusion/pull/21712 [#21580]: https://github.com/apache/datafusion/pull/21580 [#21828]: https://github.com/apache/datafusion/pull/21828 diff --git a/content/images/sort-pushdown/pr21956-decision.svg b/content/images/sort-pushdown/pr21956-decision.svg new file mode 100644 index 00000000..a8203241 --- /dev/null +++ b/content/images/sort-pushdown/pr21956-decision.svg @@ -0,0 +1,66 @@ + + + + + + + + + try_pushdown_sort: Exact / Inexact / Unsupported decision + + + + PushdownSort rule + source.try_pushdown_sort(req, eq) + + + + + + eq.ordering_satisfy(req)? + (natural ordering already matches?) + + + + yes + + Exact + drop SortExec + + + + no + + + + column_in_file_schema + || reversed_satisfies ? + + + + yes + + Inexact + set both flags + + + + no + + Unsupported + SortExec stays + + + + Exact + → Phase 2 sort elimination · fetch becomes static limit + Inexact + → #21956 runtime pipeline: file reorder + RG reorder + reverse · SortExec / TopK kept on top for correctness + diff --git a/content/images/sort-pushdown/pr21956-runtime-pipeline.svg b/content/images/sort-pushdown/pr21956-runtime-pipeline.svg new file mode 100644 index 00000000..5bb8d678 --- /dev/null +++ b/content/images/sort-pushdown/pr21956-runtime-pipeline.svg @@ -0,0 +1,69 @@ + + + + + + + + + Inexact pushdown: two flags drive a three-step runtime pipeline + + + + ParquetSource carries the inexact-pushdown decision + sort_order_for_reorder = Some([req_col ASC | DESC]) + reverse_row_groups = bool + // set by try_pushdown_sort, read by the opener at scan time + + + + + + + 1 + File-level reorder · shared morsel queue + FileSource::reorder_files + → sort files by min(col); first file picked across all + partitions is globally the most-promising one + + + for each opened file + + + + + 2 + Row-group-level reorder · per file + PreparedAccessPlan::reorder_by_statistics + → row_group_indexes sorted ASC by min(col) + using parquet column statistics + + + if reverse_row_groups + + + + + 3 + Reverse iteration · DESC requests + PreparedAccessPlan::reverse + → row_group_indexes.into_iter().rev() + + + + + + Decoder reads row groups in this order + SortExec / TopK above the source still enforces final ordering + — the stats reorder is approximate, not strict — + From d2ad95c7633b31be54408c63f57a82b7e1f46bd9 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sat, 16 May 2026 09:42:52 +0800 Subject: [PATCH 04/14] Add empirical note: why we keep an out-of-tree RG-level Exact reverse MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add an ### Empirical note subsection inside "Reverse Scans for `ORDER BY ... DESC`" that records what we found running an in-house RG-level `Exact` reverse against upstream `Inexact` + `TopK`: - `LIMIT N` does not propagate as a static stop signal in the Inexact path. The dynamic filter pushdown can stats-prune *subsequent* row groups once the threshold tightens, but inside the row group `TopK` is currently reading the sort column has to be fully decoded so the filter can be evaluated row by row. `LIMIT 10` on a 1M-row row group is still ~1M sort-column decodes regardless of N. LIMIT only saves work on non-sort columns inside that row group and on whole subsequent row groups the threshold prunes. - `SortExec` stays on top of `Inexact`, so the final ordering pass and per-row heap maintenance are both extra costs the `Exact` path (which deletes `SortExec`) does not pay. Then explain why we run RG-level `Exact` in production but did not upstream it: parquet does not allow partial row-group reads, so any RG-level `Exact` implementation peaks at one whole row group (~128 MB) of decoded data in memory — the same constraint that closed `#18817`. Our runtime advantage comes from skipping heap / filter / `SortExec` overhead, not from decoding less. Frame the page-level `Exact` reverse work in arrow-rs `#9937` as the path that keeps the runtime win we measured while bringing peak memory back into the streaming regime via `OffsetIndex` page-level seek. --- content/blog/2026-05-25-sort-pushdown.md | 74 ++++++++++++++++++++++++ 1 file changed, 74 insertions(+) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 32366246..96708cd0 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -529,6 +529,80 @@ narrower `Inexact` row-group-reverse first (which became Phase 1 in [#19064]), and to build `Exact` reverse on a finer-grained primitive once `arrow-rs` exposed one. +### Empirical note — runtime cost of `Inexact` + `TopK` + +We run an internal row-group-level `Exact` reverse implementation in +production and tested swapping in upstream's `Inexact` row-group +reverse + `TopK` on `ORDER BY ts DESC LIMIT N` queries. End-to-end +latency went **up**, not down. A few cost components stack up on the +`Inexact` + `TopK` side: + +* **`LIMIT N` does not propagate as a static stop signal to the + source.** In the `Inexact` path the `SortExec` stays on top and + `TopK`'s fetch belongs to `SortExec`, not to the parquet scan. The + only mechanism that can cut work below the `SortExec` is the + dynamic-filter pushdown: as the heap fills, the filter (`ts > + threshold`) is pushed to the source and its threshold tightens + with every batch. That filter is enough to **stats-prune + subsequent, not-yet-opened row groups** entirely — if a row + group's `max(ts) < threshold` it is skipped without decode. But + inside the row group the source is currently reading, the + filter pushdown does not unwind to "stop": the sort column has + to be **fully decoded** so the filter can be evaluated row by + row, the surviving rows feed the heap to tighten the threshold, + and only then can the resulting `RowSelection` skip the *other* + columns for rows that didn't pass. For + `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is still + ~1M sort-column decodes regardless of `N`; the LIMIT only saves + work on non-sort columns inside the same row group and on whole + *subsequent* row groups that the tightened threshold can prune. + The internal `Exact` reverse path, by contrast, deletes the + `SortExec` so the LIMIT becomes a static fetch on the source. + The source walks pages of the target row group from the back, + decodes each batch, reverses the batch row-wise, emits — and + stops the moment K rows have been delivered. For + `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is one + batch worth of decode work, not 1M. No filter machinery, no + heap, no per-row threshold check. +* **`SortExec` itself adds ordering work on top of `Inexact`.** The + reversed-RG stream is not strictly DESC (rows within each RG are + still forward), so `Inexact` keeps the surrounding `SortExec`. + Even when the heap is settled and the dynamic filter has + pruned the tail, the outer operator does its own final ordering + pass — overhead that `Exact` (which deletes the `SortExec`) + does not pay. + +Why didn't we just upstream the internal `Exact` reverse, then? +**Memory.** Parquet does not allow reading only part of a row +group, so any RG-level `Exact` implementation — ours included — +has to decode the entire row group, reverse the buffer in +memory, and only then emit. That is the same memory profile that +`#18817` was rejected for: a peak of one whole row group +(~128 MB) of decoded data, vs. the few-MB-per-batch streaming +profile readers normally have. Our runtime advantage over +`Inexact` + `TopK` does *not* come from decoding less — both +paths decode the relevant row group's sort column in full — it +comes from skipping the per-row heap maintenance, the dynamic +filter evaluation, and the `SortExec` final ordering pass that +`Inexact` keeps on top. So we end up running our `Exact` reverse +in-house but cannot land it as the upstream default for the same +memory reason that closed `#18817`. + +**That is the direct motivation behind the page-level `Exact` +reverse work we are pushing upstream in arrow-rs `#9937`.** It +shrinks the unit of work from one whole row group down to one +page (~1 MB): the reader uses parquet's `OffsetIndex` to `seek` +to the last page of the column chunk, decode it forward, reverse +the resulting batch, and emit — without ever materialising the +rest of the row group in memory. The streaming memory profile is +preserved and the runtime advantage we measured internally is +kept. Once `#9937` and the DataFusion follow-up land, the +upstream default for `ORDER BY ts DESC LIMIT N` becomes `Exact` +reverse at page granularity: `SortExec` removed, static fetch on +the source, peak memory in the streaming regime, no `TopK` heap +overhead, with K rows returned after roughly one page's worth of +decode work. + That primitive is the **page-level** reverse traversal. Parquet's `OffsetIndex` already gives us byte-precise locations for every data page in a column chunk, so we can `seek` directly to the last page, From 8921f90d572ab26bfd11dabe40ae3b5c1f0bb672 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sat, 16 May 2026 16:13:48 +0800 Subject: [PATCH 05/14] Correct internal RG-Exact description and trim arrow-rs #9937 duplication MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two corrections in the empirical-note / reverse-scans section: 1. The internal RG-level `Exact` reverse path was incorrectly described as "walks pages from the back, decodes each batch, reverses row-wise, stops the moment K rows have been delivered." That is actually the page-level `Exact` shape (arrow-rs #9937), not the in-house RG-level implementation. Parquet does not allow partial row-group reads, so the in-house path has to decode the entire target row group, reverse the buffer in memory, take the first K rows, and stop — same memory profile as #18817's proposal. The runtime advantage over `Inexact` + `TopK` comes from removing the per-row heap maintenance, dynamic-filter evaluation, and `SortExec` final ordering pass, not from decoding less sort-column data. Sort col decode on the target row group is the same on both paths. 2. The arrow-rs #9937 paragraph I previously added duplicated the technical detail already present in the long-standing "That primitive is the page-level reverse traversal..." paragraph. Replaced with a one-sentence bridge ("The fix that keeps both the runtime win and a streaming memory profile is page-level `Exact` reverse via arrow-rs #9937, described next.") so the existing paragraph carries the explanation without repetition. --- content/blog/2026-05-25-sort-pushdown.md | 35 +++++++++--------------- 1 file changed, 13 insertions(+), 22 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 96708cd0..acafec65 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -556,14 +556,16 @@ latency went **up**, not down. A few cost components stack up on the ~1M sort-column decodes regardless of `N`; the LIMIT only saves work on non-sort columns inside the same row group and on whole *subsequent* row groups that the tightened threshold can prune. - The internal `Exact` reverse path, by contrast, deletes the - `SortExec` so the LIMIT becomes a static fetch on the source. - The source walks pages of the target row group from the back, - decodes each batch, reverses the batch row-wise, emits — and - stops the moment K rows have been delivered. For - `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is one - batch worth of decode work, not 1M. No filter machinery, no - heap, no per-row threshold check. + The internal RG-level `Exact` reverse path, by contrast, deletes + the `SortExec` so the LIMIT becomes a static fetch on the source. + The source still has to decode the target row group in full — + parquet does not allow partial row-group reads, so this part is + the same as `Inexact` — but it then reverses the buffer in + memory, takes the first K rows, and **stops**. No subsequent row + group is opened, no stats check, no filter machinery, no per-row + heap maintenance, no `SortExec` final ordering pass. The wins + come from removing those per-row and per-RG overheads on top, not + from decoding less sort-column data on the target row group. * **`SortExec` itself adds ordering work on top of `Inexact`.** The reversed-RG stream is not strictly DESC (rows within each RG are still forward), so `Inexact` keeps the surrounding `SortExec`. @@ -588,20 +590,9 @@ filter evaluation, and the `SortExec` final ordering pass that in-house but cannot land it as the upstream default for the same memory reason that closed `#18817`. -**That is the direct motivation behind the page-level `Exact` -reverse work we are pushing upstream in arrow-rs `#9937`.** It -shrinks the unit of work from one whole row group down to one -page (~1 MB): the reader uses parquet's `OffsetIndex` to `seek` -to the last page of the column chunk, decode it forward, reverse -the resulting batch, and emit — without ever materialising the -rest of the row group in memory. The streaming memory profile is -preserved and the runtime advantage we measured internally is -kept. Once `#9937` and the DataFusion follow-up land, the -upstream default for `ORDER BY ts DESC LIMIT N` becomes `Exact` -reverse at page granularity: `SortExec` removed, static fetch on -the source, peak memory in the streaming regime, no `TopK` heap -overhead, with K rows returned after roughly one page's worth of -decode work. +**The fix that keeps both the runtime win and a streaming memory +profile is page-level `Exact` reverse via arrow-rs [#9937]**, +described next. That primitive is the **page-level** reverse traversal. Parquet's `OffsetIndex` already gives us byte-precise locations for every data From 33951193eb2ec8594ff3038949ce79468f93dc25 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 13:16:47 +0800 Subject: [PATCH 06/14] Mark #21956 as landed (merged via merge queue) --- content/blog/2026-05-25-sort-pushdown.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index acafec65..082dede7 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -193,7 +193,7 @@ Phase 1's scope was deliberately narrow. It set up the API and delivered the reverse-scan case end-to-end, but it did **not** add any statistics-based file rearrangement — that came later in Phase 2. A finer-grained extension that broadens this `Inexact` path with a -three-step runtime reorder pipeline lands in [#21956] — covered in +three-step runtime reorder pipeline landed in [#21956] — covered in [Phase 3](#phase-3-runtime-reorder-for-inexact-pushdown) below. Phase 1 also produced a useful side improvement: From 7c51d24342f42006ec0129ad2bb5674e8b8e542e Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 14:25:09 +0800 Subject: [PATCH 07/14] =?UTF-8?q?Reframe=20blog=20from=20Phase=20N=20?= =?UTF-8?q?=E2=86=92=20capability-based=20sections?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All three sort-pushdown PRs have now landed, so the chronological 'Phase 1/2/3' framing is less useful for readers than a capability breakdown. Sections are now: - The PushdownSort Rule (#19064) - Sort Elimination via Statistics (#21182) - Runtime Reorder for TopK Convergence (#21956) - Reverse Scans for ORDER BY ... DESC (unchanged) In-body Phase references replaced with PR numbers or capability names; anchor links updated; references section restructured. --- content/blog/2026-05-25-sort-pushdown.md | 194 ++++++++++++----------- 1 file changed, 104 insertions(+), 90 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 082dede7..205d4807 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -44,12 +44,13 @@ already in that order. CPU wasted. Memory wasted. Streaming defeated. [Apache DataFusion]: https://datafusion.apache.org/ -This post walks through the **sort pushdown** work that closed that gap. -It is structured in two phases — file rearrangement first, then a -statistics-based proof of non-overlap — and lands real benchmark -speedups of **2.1×–49× on common queries**. The same machinery extends -to `ORDER BY ... DESC`, and the page-level reverse primitive we are -adding upstream in [arrow-rs] will push the gains further still. +This post walks through the **sort pushdown** work that closed that +gap. It covers two complementary capabilities — sort elimination via +statistics, and runtime reorder for `TopK` convergence — and lands +real benchmark speedups of **2.1×–49× on common queries**. The same +machinery extends to `ORDER BY ... DESC`, and the page-level reverse +primitive we are adding upstream in [arrow-rs] will push the gains +further still. [arrow-rs]: https://github.com/apache/arrow-rs @@ -59,29 +60,36 @@ adding upstream in [arrow-rs] will push the gains further still. already in the requested order, and **read the most-promising data first** when they aren't — so `TopK` converges fast and the rest gets pruned by statistics. -* Three phases: - * **Phase 1** — establish the `PushdownSort` rule and the - `Exact` / `Inexact` / `Unsupported` protocol; ship the reverse - row-group case for `ORDER BY ... DESC` (reports `Inexact`). - * **Phase 2** — sort files within each partition by Parquet - `min/max` statistics and *prove* non-overlap, upgrading - `Unsupported` to `Exact` so `PushdownSort` removes the `SortExec` - that `EnforceSorting` inserted earlier. - * **Phase 3** ([#21956]) — generalise `Inexact`: whenever the - leading sort key is a plain column in the file schema (or the - source's reversed declared ordering satisfies the request), +* What's supported today: + * **The `PushdownSort` rule** ([#19064]) — a physical optimizer + rule that asks each `ExecutionPlan` "can you produce output in + *this* ordering?" and uses the + `Exact` / `Inexact` / `Unsupported` answer to decide whether to + delete the surrounding `SortExec`, leave it in place with a + hint, or give up. + * **Sort elimination via statistics** ([#21182]) — `PushdownSort` + sorts files within each partition by Parquet `min/max` + statistics and, when the resulting ranges are provably + non-overlapping, upgrades the source's ordering claim from + `Unsupported` to `Exact` and **removes the `SortExec`** that + `EnforceSorting` inserted earlier. + * **Runtime reorder for `TopK` convergence** ([#21956]) — whenever + the leading sort key is a plain column in the file schema (or + the source's reversed declared ordering satisfies the request), `try_pushdown_sort` stamps two flags on the source and the opener runs a three-step runtime pipeline — file-level reorder in the shared morsel queue, row-group reorder by min/max stats, - then optional iteration reverse for `DESC` requests. -* Real-world benchmarks on the `sort_pushdown` suite (Phase 2's - `Exact` upgrade): `ORDER BY ... LIMIT` queries get **27× and 49× - faster**; full `ORDER BY` scans get **~2×** faster. -* Reverse scans (`ORDER BY ... DESC`) ride the same machinery: a - merged row-group-level reverse returns `Inexact` (Sort stays, but - `TopK` terminates early); the page-level reverse primitive needed - for `Exact` reverse — and so for full `SortExec` removal on `DESC` - queries — is in flight in arrow-rs. + then optional iteration reverse for `DESC` requests. `SortExec` + stays, but `TopK`'s dynamic filter tightens fast on the + most-promising data and the rest is pruned. + * **Reverse scans for `ORDER BY ... DESC`** ([#19446], [#19557]) — + a row-group-level reverse returns `Inexact` (Sort stays, but + `TopK` terminates early). The page-level reverse primitive + needed for `Exact` reverse — and so for full `SortExec` removal + on `DESC` queries — is in flight in arrow-rs ([#9937]). +* Real-world benchmarks on the `sort_pushdown` suite (`Exact` path): + `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full + `ORDER BY` scans get **~2×** faster. ## Why Sort Pushdown Matters @@ -169,11 +177,10 @@ pushdown loses the `SortExec` node. Everything downstream — the optimizer to convince itself that the bottom of the plan is producing the order requested. -## Phase 1: The Pushdown API and Reverse Scans +## The `PushdownSort` Rule -Phase 1 ([#19064]) introduced the **`PushdownSort`** physical -optimizer rule and a uniform API for asking each `ExecutionPlan` two -questions: +[#19064] introduced the **`PushdownSort`** physical optimizer rule +and a uniform API for asking each `ExecutionPlan` two questions: [#19064]: https://github.com/apache/datafusion/pull/19064 @@ -189,14 +196,17 @@ it returns `Inexact` and flips on `reverse_row_groups=true` so the scan reads row groups from last to first (the row-group-level reverse covered later in this post); otherwise it returns `Unsupported`. -Phase 1's scope was deliberately narrow. It set up the API and -delivered the reverse-scan case end-to-end, but it did **not** add -any statistics-based file rearrangement — that came later in Phase 2. -A finer-grained extension that broadens this `Inexact` path with a -three-step runtime reorder pipeline landed in [#21956] — covered in -[Phase 3](#phase-3-runtime-reorder-for-inexact-pushdown) below. +The initial PR's scope was deliberately narrow. It set up the API +and delivered the reverse-scan case end-to-end, but did **not** add +any statistics-based file rearrangement — that came later via +[#21182], covered in +[Sort Elimination via Statistics](#sort-elimination-via-statistics) +below. A finer-grained extension that broadens this `Inexact` path +with a three-step runtime reorder pipeline landed in [#21956] — +covered in +[Runtime Reorder for TopK Convergence](#runtime-reorder-for-topk-convergence). -Phase 1 also produced a useful side improvement: +[#19064] also produced a useful side improvement: * **Reverse-output redesign** ([#19446], [#19557]) extended the same rule to `DESC` queries — picked up again in the reverse-scan @@ -205,12 +215,13 @@ Phase 1 also produced a useful side improvement: [#19446]: https://github.com/apache/datafusion/pull/19446 [#19557]: https://github.com/apache/datafusion/pull/19557 -## Phase 2: Use Statistics to Prove Non-Overlap +## Sort Elimination via Statistics -Phase 2: rearranging files within a partition by min/max statistics so the file list is in range order +Sort elimination: rearranging files within a partition by min/max statistics so the file list is in range order -Phase 1 left a sharp edge that motivated Phase 2 ([#21182]). Consider -this realistic scenario: +The initial `Inexact`-only path left a sharp edge that motivated +stats-based sort elimination ([#21182]). Consider this realistic +scenario: [#21182]: https://github.com/apache/datafusion/pull/21182 @@ -228,7 +239,7 @@ the scan now has no declared ordering, so `EnforceSorting` (which runs earlier in the pipeline) inserts a `SortExec`. The data is sorted on disk; the optimizer just can't tell. -Phase 2 fixes this in `PushdownSort`, which runs late — after +[#21182] fixes this in `PushdownSort`, which runs late — after `EnforceDistribution` and `EnforceSorting` have already shaped the plan. When `PushdownSort` finds a `SortExec` above a file scan whose ordering was stripped (a `FileSource` `Unsupported` result), it does @@ -239,7 +250,7 @@ three things inside `FileScanConfig::try_pushdown_sort`: pre-existing [`MinMaxStatistics`] helper (introduced in [#9593]) reads each file's `column_statistics[c].min_value` / `.max_value` for each sort column `c`, then sorts the file list by - the min row. Phase 2 wires this helper into the optimizer's + the min row. The PR wires this helper into the optimizer's `Unsupported` branch — `sort_files_within_groups_by_statistics` does the per-group orchestration and decides whether any group is non-overlapping after the sort. @@ -273,14 +284,14 @@ row-group iteration reverse when the source declares a compatible natural ordering — but stats-based reorder still needs a plain column.)* -Phase 2: detecting non-overlapping ranges via min/max statistics +Detecting non-overlapping ranges via min/max statistics The diagram above contrasts the two cases. On the left, ranges are non-overlapping after sort, so we can guarantee that emitting the files in min-order produces a globally sorted stream. On the right, the ranges overlap, so even after sorting the files by `min(ts)` we -cannot guarantee global ordering — Phase 2 correctly bails out and -keeps `SortExec` in place. +cannot guarantee global ordering — the upgrade is skipped and +`SortExec` stays in place. The implementation handles a few edge cases worth calling out: @@ -290,32 +301,32 @@ The implementation handles a few edge cases worth calling out: acting as an *implicit in-memory buffer* for the SPM above it. The SPM picks rows from each partition stream one at a time; without the upstream `SortExec` holding batches in memory, the SPM would - read directly from I/O-bound sources and stall on every pick. Phase - 2 compensates by inserting a [`BufferExec`] in the `SortExec`'s + read directly from I/O-bound sources and stall on every pick. The + rule compensates by inserting a [`BufferExec`] in the `SortExec`'s place — bounded streaming buffer, same throughput shape, no blocking sort. Capacity is configurable via [`sort_pushdown_buffer_capacity`] ([#21426]). * **`fetch` preservation** through `EnforceDistribution`. The distribution rule sometimes strips a `SortExec`'s `fetch` field and - re-adds the node later. Phase 2 plumbs `fetch` through so a + re-adds the node later. The PR plumbs `fetch` through so a surviving `LIMIT` is not lost. -* **Per-group, not global, non-overlap.** Phase 2's adjacency check is +* **Per-group, not global, non-overlap.** The adjacency check is scoped to each file group. Two file groups can have *overlapping* ranges and the upgrade still fires, as long as each group is internally non-overlapping. That works because each group already produces an independently ordered stream at runtime, and `SortPreservingMergeExec` then picks rows across streams in value - order to produce the final globally sorted output. Phase 2 only has - to prove the per-stream property. + order to produce the final globally sorted output. The rule only + has to prove the per-stream property. * **Single-partition vs multi-partition execution**. With the default multi-partition setup, `EnforceDistribution` byte-range-splits files into single-file groups, after which `validated_output_ordering()` - works correctly on its own. Phase 2 only triggers when files have - not been split — typically `--partitions 1` runs, or files small - enough that the splitter leaves them alone. In the typical `--partitions - 1` case the "per-group" distinction collapses (one group equals the - whole table), which is why the example earlier in this section is - drawn that way. + works correctly on its own. Stats-based reorder only triggers when + files have not been split — typically `--partitions 1` runs, or + files small enough that the splitter leaves them alone. In the + typical `--partitions 1` case the "per-group" distinction collapses + (one group equals the whole table), which is why the example earlier + in this section is drawn that way. [`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs [`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426 @@ -323,13 +334,13 @@ The implementation handles a few edge cases worth calling out: ## Benchmarks -Sort pushdown phase 2 benchmark: 2x-49x speedup across four queries +Sort pushdown benchmark: 2x-49x speedup across four queries The [`sort_pushdown`] benchmark suite reproduces the "wrong-order file list" scenario by generating Parquet files whose names are intentionally reversed against their sort-key ranges. Numbers -below are `--partitions 1`, release build, on the merged Phase 2 -branch versus `main`: +below are `--partitions 1`, release build, with stats-based sort +elimination ([#21182]) enabled, versus `main`: [`sort_pushdown`]: https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown @@ -359,42 +370,44 @@ removed: It is worth saying explicitly what this change does **not** affect. The default multi-partition execution path is unchanged: those plans already produced correct orderings via byte-range splitting, so -Phase 2 simply does not trigger. There is no regression and no behavior -change for the typical multi-threaded query. +stats-based sort elimination simply does not trigger. There is no +regression and no behavior change for the typical multi-threaded +query. -## Phase 3: Runtime Reorder for Inexact Pushdown +## Runtime Reorder for TopK Convergence -Phase 2 handles the `Exact` upgrade — strong correctness, sort -elimination — but only when the table has a declared -`output_ordering` *and* the files are provably non-overlapping after -sorting by min. Two large classes of queries fall outside that -window: +Stats-based sort elimination handles the `Exact` upgrade — strong +correctness, sort elimination — but only when the table has a +declared `output_ordering` *and* the files are provably +non-overlapping after sorting by min. Two large classes of queries +fall outside that window: * **Unsorted data** — no `WITH ORDER`, no parquet `sorting_columns`. - Phase 2 cannot fire because there is no ordering claim to upgrade. + The `Exact` upgrade cannot fire because there is no ordering + claim to upgrade. * **Overlapping ranges** — files written by different ingestion - jobs share time windows. Phase 2 keeps the `SortExec` because the - global ordering can't be proven, even though the files often do - contain large stretches of in-order data. + jobs share time windows. The `Exact` upgrade keeps the `SortExec` + because the global ordering can't be proven, even though the + files often do contain large stretches of in-order data. For both, a full external `SortExec` is overkill. The parquet metadata is right there, and reading the *most-promising* data first lets `TopK`'s dynamic filter threshold tighten quickly so the -rest gets pruned. Phase 3 ([#21956]) wires that up by generalising -the `Inexact` path Phase 1 introduced. +rest gets pruned. [#21956] wires that up by generalising the +`Inexact` path that [#19064] introduced. ### `try_pushdown_sort` — one decision, three outcomes try_pushdown_sort decision tree: Exact, Inexact, or Unsupported -The `Exact` / `Inexact` / `Unsupported` protocol from Phase 1 stays. -Phase 3 broadens the **conditions** that route a query into -`Inexact`: +The `Exact` / `Inexact` / `Unsupported` protocol from [#19064] +stays. The new PR broadens the **conditions** that route a query +into `Inexact`: | Condition | Outcome | | --- | --- | -| `eq_properties.ordering_satisfy(request)` | `Exact` — Phase 1 / 2 sort elimination | -| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — Phase 3 runtime pipeline | +| `eq_properties.ordering_satisfy(request)` | `Exact` — sort elimination | +| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — runtime reorder pipeline | | Neither | `Unsupported` — `SortExec` stays, no source-side optimisation | The "reversed satisfies" branch is what handles function-wrapped @@ -406,7 +419,7 @@ itself. ### Two flags on `ParquetSource`, three runtime steps -Phase 3 runtime pipeline: file reorder, RG reorder, then optional reverse +Runtime reorder pipeline: file reorder, RG reorder, then optional reverse When `try_pushdown_sort` returns `Inexact`, it stamps two fields on the `ParquetSource`: @@ -432,8 +445,8 @@ The opener reads them at scan time to drive three composable steps: 3. **Reverse.** For `DESC` requests, `PreparedAccessPlan::reverse` flips the iteration after the stats reorder normalises everything to ASC-by-min. Same - primitive Phase 1 introduced for declared reverse scans — Phase - 3 just routes more queries through it. + primitive [#19064] introduced for declared reverse scans — + [#21956] just routes more queries through it. The two layers **nest by construction**: file `i`'s `min(col)` is a lower bound on every row group inside it, so the file queue's @@ -461,7 +474,7 @@ DataSourceExec: file_groups=..., file_type=parquet, Absence of either flag means the corresponding runtime step is a no-op. -### When Phase 3 does *not* fire +### When runtime reorder does *not* fire * **Aggregations on top of the sort key.** `SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench @@ -483,7 +496,7 @@ no-op. `[a DESC, b ASC, c DESC]`), `try_pushdown_sort` returns `Unsupported` so the surrounding `SortExec` can keep its `sort_prefix` annotation — prefix-aware early termination in - `TopK` is strictly better than the Phase 3 reorder on data that + `TopK` is strictly better than the runtime reorder on data that is already in prefix order on disk. ## Reverse Scans for `ORDER BY ... DESC` @@ -525,7 +538,7 @@ decoded rows before any batch can be emitted is roughly: return 10 — defeating the point of the `LIMIT`. The agreed direction coming out of that discussion was to ship the -narrower `Inexact` row-group-reverse first (which became Phase 1 in +narrower `Inexact` row-group-reverse first (which landed in [#19064]), and to build `Exact` reverse on a finer-grained primitive once `arrow-rs` exposed one. @@ -620,8 +633,8 @@ delta, bit-packing) are all forward streams — you cannot decode the last value without decoding every value that came before it. The design therefore is: **reverse the page traversal, forward-decode each page, reverse the resulting RecordBatch**. This is the algorithm -shape that DataFusion's Phase-2 `RecordBatchReader` integration will -use once arrow-rs ships the primitive. +shape DataFusion's `RecordBatchReader` integration will use once +arrow-rs ships the primitive. The killer use case is **filtered reverse TopK**: @@ -793,8 +806,9 @@ Issues and PRs: * Umbrella issue: [apache/datafusion#17348][#17348] * `MinMaxStatistics` foundation: [apache/datafusion#9593][#9593] -* Phase 1: [apache/datafusion#19064][#19064] -* Phase 2: [apache/datafusion#21182][#21182] +* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064][#19064] +* Sort elimination via statistics: [apache/datafusion#21182][#21182] +* Runtime reorder for TopK convergence: [apache/datafusion#21956][#21956] * `BufferExec` capacity for sort elimination: [apache/datafusion#21426][#21426] * Dynamic / TopK-driven path: [apache/datafusion#21351][#21351] (morsel-style work scheduling), [apache/datafusion#21733][#21733] (global file reorder in shared queue) From e246b2c6fd5ff78557ae809cc70745838fc3f049 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 14:49:43 +0800 Subject: [PATCH 08/14] Clarify two-layer composability: same key (min), not 'nest by construction' The previous wording 'nest by construction' could be read as a code- enforced property. It's actually a logical consequence of using the same sort key (min) at both file and row-group level: a file's min(col) is just the minimum over its row groups' min(col) values, so the most-promising file contains the most-promising row group. The rewritten paragraph spells that out and ties it to why TopK's dynamic filter tightens fast. --- content/blog/2026-05-25-sort-pushdown.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 205d4807..05bf141f 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -448,11 +448,15 @@ The opener reads them at scan time to drive three composable steps: primitive [#19064] introduced for declared reverse scans — [#21956] just routes more queries through it. -The two layers **nest by construction**: file `i`'s `min(col)` is -a lower bound on every row group inside it, so the file queue's -order is a natural prefix of the within-file row-group order. -Choosing the same key (`min`) in both layers keeps the strategies -consistent. +The two layers compose naturally because they sort by the same +key. A file's `min(col)` is the minimum over its row groups' +`min(col)` values, so the file with the smallest `min` contains +the row group with the smallest `min`. Sorting files by `min(col)` +and then sorting row groups by `min(col)` within each file +produces an approximately min-ordered global stream — the first +batch comes from the most-promising row group in the +most-promising file, exactly what `TopK`'s dynamic filter needs +to tighten its threshold fast. `reverse_row_groups`'s meaning depends on which way `Inexact` was reached. When the column-in-schema condition fires, the stats From 6c6d58434fd9bedb202afedc9e3a2db2a17e6642 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 15:11:02 +0800 Subject: [PATCH 09/14] Strip inline PR refs from narrative; collect all links in References Match the dynamic-filter blog's style: narrative talks about capabilities/mechanisms, not 'PR #21956 did X / PR #19064 introduced Y'. The 81 inline PR/issue references in the body were dropping the reader out of the narrative; they belong in a single Issues-and-PRs list at the end. Changes: - TL;DR: drop 6 inline PR refs from the 4 capability bullets - Body sections (PushdownSort Rule, Sort Elimination, Runtime Reorder, Reverse Scans, Empirical Note): drop ~30 inline refs to historical PRs; replace with capability names or 'the rule' / 'the runtime reorder path' style descriptions - What's Next: switch from [#NNNNN] format to named markdown links (matching dynamic-filter's Future Work style) - Issues for new contributors: same conversion - References section: rewrite using full URLs (no link-ref indirection); split into 'Landed' vs 'In flight / open' for clarity Net: ~90 lines removed, all PR/issue numbers now consolidated at the bottom of the post. --- content/blog/2026-05-25-sort-pushdown.md | 435 +++++++++++------------ 1 file changed, 214 insertions(+), 221 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 05bf141f..8e12170c 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -61,32 +61,31 @@ further still. first** when they aren't — so `TopK` converges fast and the rest gets pruned by statistics. * What's supported today: - * **The `PushdownSort` rule** ([#19064]) — a physical optimizer - rule that asks each `ExecutionPlan` "can you produce output in - *this* ordering?" and uses the - `Exact` / `Inexact` / `Unsupported` answer to decide whether to - delete the surrounding `SortExec`, leave it in place with a - hint, or give up. - * **Sort elimination via statistics** ([#21182]) — `PushdownSort` - sorts files within each partition by Parquet `min/max` - statistics and, when the resulting ranges are provably - non-overlapping, upgrades the source's ordering claim from - `Unsupported` to `Exact` and **removes the `SortExec`** that - `EnforceSorting` inserted earlier. - * **Runtime reorder for `TopK` convergence** ([#21956]) — whenever - the leading sort key is a plain column in the file schema (or - the source's reversed declared ordering satisfies the request), + * **The `PushdownSort` rule** — a physical optimizer rule that + asks each `ExecutionPlan` "can you produce output in *this* + ordering?" and uses the `Exact` / `Inexact` / `Unsupported` + answer to decide whether to delete the surrounding `SortExec`, + leave it in place with a hint, or give up. + * **Sort elimination via statistics** — `PushdownSort` sorts + files within each partition by Parquet `min/max` statistics + and, when the resulting ranges are provably non-overlapping, + upgrades the source's ordering claim from `Unsupported` to + `Exact` and **removes the `SortExec`** that `EnforceSorting` + inserted earlier. + * **Runtime reorder for `TopK` convergence** — whenever the + leading sort key is a plain column in the file schema (or the + source's reversed declared ordering satisfies the request), `try_pushdown_sort` stamps two flags on the source and the opener runs a three-step runtime pipeline — file-level reorder in the shared morsel queue, row-group reorder by min/max stats, then optional iteration reverse for `DESC` requests. `SortExec` stays, but `TopK`'s dynamic filter tightens fast on the most-promising data and the rest is pruned. - * **Reverse scans for `ORDER BY ... DESC`** ([#19446], [#19557]) — - a row-group-level reverse returns `Inexact` (Sort stays, but - `TopK` terminates early). The page-level reverse primitive - needed for `Exact` reverse — and so for full `SortExec` removal - on `DESC` queries — is in flight in arrow-rs ([#9937]). + * **Reverse scans for `ORDER BY ... DESC`** — a row-group-level + reverse returns `Inexact` (Sort stays, but `TopK` terminates + early). The page-level reverse primitive needed for `Exact` + reverse — and so for full `SortExec` removal on `DESC` queries + — is in flight in arrow-rs. * Real-world benchmarks on the `sort_pushdown` suite (`Exact` path): `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full `ORDER BY` scans get **~2×** faster. @@ -179,10 +178,8 @@ producing the order requested. ## The `PushdownSort` Rule -[#19064] introduced the **`PushdownSort`** physical optimizer rule -and a uniform API for asking each `ExecutionPlan` two questions: - -[#19064]: https://github.com/apache/datafusion/pull/19064 +The **`PushdownSort`** physical optimizer rule defines a uniform +API for asking each `ExecutionPlan` two questions: 1. "Can you produce output in *this* ordering?" 2. "If yes, please rearrange yourself so that it actually does." @@ -196,34 +193,24 @@ it returns `Inexact` and flips on `reverse_row_groups=true` so the scan reads row groups from last to first (the row-group-level reverse covered later in this post); otherwise it returns `Unsupported`. -The initial PR's scope was deliberately narrow. It set up the API -and delivered the reverse-scan case end-to-end, but did **not** add -any statistics-based file rearrangement — that came later via -[#21182], covered in +The rule's initial scope was deliberately narrow. It set up the +API and delivered the reverse-scan case end-to-end, but did **not** +add any statistics-based file rearrangement — that came later, +covered in [Sort Elimination via Statistics](#sort-elimination-via-statistics) below. A finer-grained extension that broadens this `Inexact` path -with a three-step runtime reorder pipeline landed in [#21956] — -covered in +with a three-step runtime reorder pipeline is covered in [Runtime Reorder for TopK Convergence](#runtime-reorder-for-topk-convergence). -[#19064] also produced a useful side improvement: - -* **Reverse-output redesign** ([#19446], [#19557]) extended the same - rule to `DESC` queries — picked up again in the reverse-scan - section below. - -[#19446]: https://github.com/apache/datafusion/pull/19446 -[#19557]: https://github.com/apache/datafusion/pull/19557 +The same rule also handles **reverse-output** for `DESC` queries — +picked up again in the reverse-scan section below. ## Sort Elimination via Statistics Sort elimination: rearranging files within a partition by min/max statistics so the file list is in range order The initial `Inexact`-only path left a sharp edge that motivated -stats-based sort elimination ([#21182]). Consider this realistic -scenario: - -[#21182]: https://github.com/apache/datafusion/pull/21182 +stats-based sort elimination. Consider this realistic scenario: * Three files: `a.parquet`, `b.parquet`, `c.parquet`. * Each declares `WITH ORDER (ts ASC)`. @@ -239,21 +226,21 @@ the scan now has no declared ordering, so `EnforceSorting` (which runs earlier in the pipeline) inserts a `SortExec`. The data is sorted on disk; the optimizer just can't tell. -[#21182] fixes this in `PushdownSort`, which runs late — after -`EnforceDistribution` and `EnforceSorting` have already shaped the -plan. When `PushdownSort` finds a `SortExec` above a file scan whose -ordering was stripped (a `FileSource` `Unsupported` result), it does -three things inside `FileScanConfig::try_pushdown_sort`: +Stats-based sort elimination fixes this in `PushdownSort`, which +runs late — after `EnforceDistribution` and `EnforceSorting` have +already shaped the plan. When `PushdownSort` finds a `SortExec` +above a file scan whose ordering was stripped (a `FileSource` +`Unsupported` result), it does three things inside +`FileScanConfig::try_pushdown_sort`: 1. **Sort the file list by per-file statistics on the sort column(s)** within each file group (the diagram above). The - pre-existing [`MinMaxStatistics`] helper (introduced in [#9593]) - reads each file's `column_statistics[c].min_value` / - `.max_value` for each sort column `c`, then sorts the file list by - the min row. The PR wires this helper into the optimizer's - `Unsupported` branch — `sort_files_within_groups_by_statistics` - does the per-group orchestration and decides whether any group is - non-overlapping after the sort. + pre-existing [`MinMaxStatistics`] helper reads each file's + `column_statistics[c].min_value` / `.max_value` for each sort + column `c`, then sorts the file list by the min row. + `sort_files_within_groups_by_statistics` does the per-group + orchestration and decides whether any group is non-overlapping + after the sort. 2. **Check adjacency within each group**: walk each sorted file group independently and ask whether `file[i].max ≤ file[i+1].min` for every adjacent pair (touching at the boundary is fine — value `v` @@ -269,7 +256,6 @@ three things inside `FileScanConfig::try_pushdown_sort`: itself and the plan becomes streamable. [`MinMaxStatistics`]: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/statistics.rs -[#9593]: https://github.com/apache/datafusion/pull/9593 One caveat that comes straight from `MinMaxStatistics`: the stats sort only fires when every `ORDER BY` expression is a plain column. @@ -278,11 +264,11 @@ is no per-file min/max for the function output to compare against. Extending sort pushdown across monotonic function wrappers is one of the open follow-ups. -*(Within #21956's `Inexact` path, `EquivalenceProperties`'s -monotonicity inference does let function-wrapped sorts benefit from -row-group iteration reverse when the source declares a compatible -natural ordering — but stats-based reorder still needs a plain -column.)* +*(The runtime reorder path covered later does let function-wrapped +sorts benefit from row-group iteration reverse via +`EquivalenceProperties`'s monotonicity inference, when the source +declares a compatible natural ordering — but stats-based sort +elimination still needs a plain column.)* Detecting non-overlapping ranges via min/max statistics @@ -305,7 +291,7 @@ The implementation handles a few edge cases worth calling out: rule compensates by inserting a [`BufferExec`] in the `SortExec`'s place — bounded streaming buffer, same throughput shape, no blocking sort. Capacity is configurable via - [`sort_pushdown_buffer_capacity`] ([#21426]). + [`sort_pushdown_buffer_capacity`]. * **`fetch` preservation** through `EnforceDistribution`. The distribution rule sometimes strips a `SortExec`'s `fetch` field and re-adds the node later. The PR plumbs `fetch` through so a @@ -330,7 +316,6 @@ The implementation handles a few edge cases worth calling out: [`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs [`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426 -[#21426]: https://github.com/apache/datafusion/pull/21426 ## Benchmarks @@ -340,7 +325,7 @@ The [`sort_pushdown`] benchmark suite reproduces the "wrong-order file list" scenario by generating Parquet files whose names are intentionally reversed against their sort-key ranges. Numbers below are `--partitions 1`, release build, with stats-based sort -elimination ([#21182]) enabled, versus `main`: +elimination enabled, versus `main`: [`sort_pushdown`]: https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown @@ -393,16 +378,16 @@ fall outside that window: For both, a full external `SortExec` is overkill. The parquet metadata is right there, and reading the *most-promising* data first lets `TopK`'s dynamic filter threshold tighten quickly so the -rest gets pruned. [#21956] wires that up by generalising the -`Inexact` path that [#19064] introduced. +rest gets pruned. Runtime reorder wires that up by generalising +the `Inexact` path the rule introduced. ### `try_pushdown_sort` — one decision, three outcomes try_pushdown_sort decision tree: Exact, Inexact, or Unsupported -The `Exact` / `Inexact` / `Unsupported` protocol from [#19064] -stays. The new PR broadens the **conditions** that route a query -into `Inexact`: +The `Exact` / `Inexact` / `Unsupported` protocol stays. The +runtime reorder path broadens the **conditions** that route a +query into `Inexact`: | Condition | Outcome | | --- | --- | @@ -435,9 +420,10 @@ struct ParquetSource { The opener reads them at scan time to drive three composable steps: 1. **File-level reorder.** `FileSource::reorder_files` sits in the - shared morsel queue (the [#21351] work-stealing primitive) and - sorts the partitioned-file list by `min(col)`. The first file - picked across all partitions is globally the most-promising one. + shared morsel queue (a work-stealing primitive that lets sibling + partitions share a single file pool) and sorts the + partitioned-file list by `min(col)`. The first file picked across + all partitions is globally the most-promising one. 2. **Row-group-level reorder.** Once a file is opened, `PreparedAccessPlan::reorder_by_statistics` sorts that file's `row_group_indexes` by `min(col)` ASC. The row group most likely @@ -445,8 +431,9 @@ The opener reads them at scan time to drive three composable steps: 3. **Reverse.** For `DESC` requests, `PreparedAccessPlan::reverse` flips the iteration after the stats reorder normalises everything to ASC-by-min. Same - primitive [#19064] introduced for declared reverse scans — - [#21956] just routes more queries through it. + primitive the rule originally introduced for declared reverse + scans — the runtime pipeline just routes more queries through + it. The two layers compose naturally because they sort by the same key. A file's `min(col)` is the minimum over its row groups' @@ -490,10 +477,10 @@ no-op. the scan there'd be nothing actionable to do with it. * **Multi-column sort secondary keys.** The reorder currently only uses the leading sort expression — secondary keys are ignored. - Tracked as a follow-up in [#22198]. + An open follow-up. * **Function-wrapped sort without a source-declared ordering.** Without a declared ordering to invert, the reversed-equivalence - branch has nothing to satisfy. Tracked in the same follow-up. + branch has nothing to satisfy. Same follow-up. * **Source declares a forward prefix of the request.** When the source's declared `output_ordering` is a non-empty proper prefix of the request (e.g. source `[a DESC, b ASC]`, request @@ -511,26 +498,25 @@ no-op. ascending and the query wants descending, we should be able to skip the sort — we just need to read the data in the opposite order. -The first iteration of this lives in [#18817] and operates at the -**row group** level: it reverses the *iteration order of row groups* -so the last RG is opened first, but rows within each RG are still -decoded forward. The resulting stream is "RGs descending × rows -ascending" — close to the requested order, but not strictly DESC. The -optimizer therefore reports this as `Inexact` and leaves the -`SortExec` in place; the win is that `TopK`'s dynamic filter tightens -much faster, because the very first row groups read already contain -values near the final answer. A tight threshold means subsequent row -groups can be skipped via min/max statistics. This ships today and -is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted files. - -[#18817]: https://github.com/apache/datafusion/pull/18817 +The first iteration of this operates at the **row group** level: +it reverses the *iteration order of row groups* so the last RG is +opened first, but rows within each RG are still decoded forward. +The resulting stream is "RGs descending × rows ascending" — close +to the requested order, but not strictly DESC. The optimizer +therefore reports this as `Inexact` and leaves the `SortExec` in +place; the win is that `TopK`'s dynamic filter tightens much +faster, because the very first row groups read already contain +values near the final answer. A tight threshold means subsequent +row groups can be skipped via min/max statistics. This ships today +and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted +files. To turn this into `Exact` reverse — so the `SortExec` can be removed outright — each emitted batch itself has to be in DESC order. The straightforward row-group-level approach (decode an entire RG forward, materialize all rows, reverse the buffer, then emit) is correct and was actually proposed first, in an earlier iteration of this work -([#18817], later closed and split into smaller pieces). Review +that was later closed and split into smaller pieces. Review feedback there — primarily from [@2010YOUY01] — flagged the memory profile as too aggressive: caching an entire row group's worth of decoded rows before any batch can be emitted is roughly: @@ -542,9 +528,8 @@ decoded rows before any batch can be emitted is roughly: return 10 — defeating the point of the `LIMIT`. The agreed direction coming out of that discussion was to ship the -narrower `Inexact` row-group-reverse first (which landed in -[#19064]), and to build `Exact` reverse on a finer-grained primitive -once `arrow-rs` exposed one. +narrower `Inexact` row-group-reverse first, and to build `Exact` +reverse on a finer-grained primitive once `arrow-rs` exposed one. ### Empirical note — runtime cost of `Inexact` + `TopK` @@ -596,20 +581,20 @@ Why didn't we just upstream the internal `Exact` reverse, then? group, so any RG-level `Exact` implementation — ours included — has to decode the entire row group, reverse the buffer in memory, and only then emit. That is the same memory profile that -`#18817` was rejected for: a peak of one whole row group -(~128 MB) of decoded data, vs. the few-MB-per-batch streaming -profile readers normally have. Our runtime advantage over -`Inexact` + `TopK` does *not* come from decoding less — both -paths decode the relevant row group's sort column in full — it -comes from skipping the per-row heap maintenance, the dynamic +got the earlier RG-level proposal rejected: a peak of one whole +row group (~128 MB) of decoded data, vs. the few-MB-per-batch +streaming profile readers normally have. Our runtime advantage +over `Inexact` + `TopK` does *not* come from decoding less — +both paths decode the relevant row group's sort column in full — +it comes from skipping the per-row heap maintenance, the dynamic filter evaluation, and the `SortExec` final ordering pass that `Inexact` keeps on top. So we end up running our `Exact` reverse -in-house but cannot land it as the upstream default for the same -memory reason that closed `#18817`. +in-house but cannot land it as the upstream default, for the +same memory reason that closed the earlier proposal. **The fix that keeps both the runtime win and a streaming memory -profile is page-level `Exact` reverse via arrow-rs [#9937]**, -described next. +profile is page-level `Exact` reverse via arrow-rs**, described +next. That primitive is the **page-level** reverse traversal. Parquet's `OffsetIndex` already gives us byte-precise locations for every data @@ -618,19 +603,16 @@ decode it forward, reverse the resulting batch, and emit. Peak buffer drops to one page (~1 MB) and first-batch latency drops to the cost of one page decode — the row-group-level memory cliff disappears. -We are landing this primitive upstream in arrow-rs as -[#9937], with the discussion in [#9934]. Early numbers on a 100k-row, -98-page column chunk show **~50× faster time-to-first-N** for `n ≤ 1 -page` and **~9× faster** for `n` spanning 10 pages, compared with the -row-group-level Exact reverse described above. The DataFusion-side -integration that turns this primitive into an `Exact` result is a -follow-up to #9937 and is gated on its merge. +We are landing this primitive upstream in arrow-rs. Early numbers +on a 100k-row, 98-page column chunk show **~50× faster +time-to-first-N** for `n ≤ 1 page` and **~9× faster** for `n` +spanning 10 pages, compared with the row-group-level Exact reverse +described above. The DataFusion-side integration that turns this +primitive into an `Exact` result is a follow-up and is gated on +the arrow-rs merge. [@2010YOUY01]: https://github.com/2010YOUY01 -[#9937]: https://github.com/apache/arrow-rs/pull/9937 -[#9934]: https://github.com/apache/arrow-rs/issues/9934 - One natural question: why not reverse the rows *within* a page directly? Because we can't. Parquet's page encodings (RLE, dictionary, delta, bit-packing) are all forward streams — you cannot decode the @@ -666,39 +648,42 @@ e.g. when file ranges genuinely overlap, or when the sort is on a function output rather than a plain column. The two directions are not alternatives; they compose: -* **`Exact` reverse for `ORDER BY ... DESC`.** Today's row-group +* [`Exact` reverse for `ORDER BY ... DESC`]. Today's row-group reverse returns `Inexact` and the `SortExec` stays on top; the - arrow-rs page-level reverse primitive ([#9937]) is what unlocks - `Exact` reverse on `DESC` queries (and therefore full `SortExec` - elimination on `DESC`). Memory + first-batch latency rule out doing - the same thing at the row-group level. Gated on #9937. -* **Dynamic / TopK-driven path.** When `Exact` cannot fire, `TopK`'s - [dynamic filter][dyn-filters-blog] still benefits enormously from - reading the *best* data first. This thread also builds on the - [limit pruning][limit-pruning-blog] work that turned `LIMIT` into - an I/O optimization across the pruning pipeline. The - recently-merged morsel-style work scheduling in `FileStream` - ([#21351]) gives sibling partitions a *shared work queue* with + arrow-rs page-level reverse primitive is what unlocks `Exact` + reverse on `DESC` queries (and therefore full `SortExec` + elimination on `DESC`). Memory + first-batch latency rule out + doing the same thing at the row-group level. Gated on the + arrow-rs side. +* **Dynamic / TopK-driven path.** When `Exact` cannot fire, + `TopK`'s [dynamic filter][dyn-filters-blog] still benefits + enormously from reading the *best* data first. This thread also + builds on the [limit pruning][limit-pruning-blog] work that + turned `LIMIT` into an I/O optimization across the pruning + pipeline. The recently-merged [morsel-style work scheduling] in + `FileStream` gives sibling partitions a *shared work queue* with file-level work-stealing — no CPU sits idle when one partition - runs out of files. The proposed [#21733] sorts files in - that shared queue by per-file statistics *before* any partition - picks, so the first file read is globally optimal and tightens the - dynamic filter immediately. Combined with **TopK threshold init from - parquet statistics** ([#21712]) and **`try_pushdown_sort` driving - runtime row-group / file reorder + reverse** ([#21956], landed), - the threshold can be set before reading a single byte. The reorder + runs out of files. The proposed + [global file reorder in the shared queue] sorts files in that + shared queue by per-file statistics *before* any partition + picks, so the first file read is globally optimal and tightens + the dynamic filter immediately. Combined with + [TopK threshold init from parquet statistics] and the runtime + row-group / file reorder + reverse path described above, the + threshold can be set before reading a single byte. The reorder mechanism applies to any `ORDER BY [LIMIT N]` on - parquet, not just TopK queries with a dynamic filter. The combined statistics-driven `TopK` pipeline is in flight - as [#21580]. + parquet, not just TopK queries with a dynamic filter. The + [combined statistics-driven `TopK` pipeline] is in flight. The mechanism here is **RG-level pruning, not mid-stream early return**. With the threshold known up front, the parquet - `PruningPredicate` rejects entire row groups against their min/max - statistics before any I/O — those row groups are never decoded. - The row group(s) the reader *does* open still have their sort - column decoded in full to feed the dynamic filter. On the #21580 - microbenchmark (single file, 61 sorted row groups, `--partitions 1`), - **60 of the 61 row groups are skipped** and only one is decoded: + `PruningPredicate` rejects entire row groups against their + min/max statistics before any I/O — those row groups are never + decoded. The row group(s) the reader *does* open still have + their sort column decoded in full to feed the dynamic filter. + On the in-flight microbenchmark (single file, 61 sorted row + groups, `--partitions 1`), **60 of the 61 row groups are + skipped** and only one is decoded: | Query | Baseline | With pipeline | Speedup | | ------------------------------ | -------: | ------------: | ------: | @@ -709,81 +694,81 @@ not alternatives; they compose: The stack reports `Inexact` — the `SortExec` stays on top to enforce correctness across overlapping ranges — so this path - cannot do *true* mid-stream early return. Once the parquet reader - opens a row group, the sort column has to be decoded all the way - through; once a `FileStream` picks up a file from the shared work - queue, it has to finish that file. Today's dynamic work scheduling - ([#21351]) is **file-granular**: idle partitions stop pulling - new files from the queue once a global limit is satisfied, but - the partition that's currently inside a file decodes that file's - remaining row groups regardless. Mid-file RG-level early return - on `TopK` convergence is **not implemented yet** — the work - queue holds `PartitionedFile`, not row-group descriptors. - - Closing the tap the moment `TopK` has K confirmed winners therefore - needs either: - - * the **`Exact` path**, where the `SortExec` is gone entirely and - the data source's own `fetch` becomes a static limit that the - reader can honour at batch granularity; or + cannot do *true* mid-stream early return. Once the parquet + reader opens a row group, the sort column has to be decoded all + the way through; once a `FileStream` picks up a file from the + shared work queue, it has to finish that file. Today's dynamic + work scheduling is **file-granular**: idle partitions stop + pulling new files from the queue once a global limit is + satisfied, but the partition that's currently inside a file + decodes that file's remaining row groups regardless. Mid-file + RG-level early return on `TopK` convergence is **not implemented + yet** — the work queue holds `PartitionedFile`, not row-group + descriptors. + + Closing the tap the moment `TopK` has K confirmed winners + therefore needs either: + + * the **`Exact` path**, where the `SortExec` is gone entirely + and the data source's own `fetch` becomes a static limit that + the reader can honour at batch granularity; or * **finer-grained dynamic scheduling** — having the shared queue - hold row-group descriptors instead of whole files, so a partition - can release its current file's remaining row groups back to the - pool once a global signal says enough TopK winners have been - found. This is a natural extension of [#21351] and [#21733] but - is not yet on a PR. - - The three mechanisms compose. Stats pruning saves the row groups - that *can't* matter (skipped without I/O). The dynamic filter - narrows what's decoded inside the row groups the reader does - open. `Exact` or finer-grained scheduling is what eventually - closes the tap once `TopK` is satisfied. -* **Filtered reverse TopK end-to-end.** `WHERE filter ORDER BY ts - DESC LIMIT N` is the dominant observability query shape and the - one where the arrow-rs page-reverse primitive matters most: - `RowSelection::with_limit` cannot pre-compute the last `N` matching - rows when the filter is selective, so the only correct strategy is - to stream pages backward, evaluate the filter, and stop when `N` - matches are collected. The DataFusion-side integration is the - follow-up to #9937. -* **Unifying `EnforceDistribution` and `EnforceSorting`** into a - single `EnsureRequirements` rule ([#21976]). The two existing rules - are coupled through `SortExec.preserve_partitioning`, which makes + hold row-group descriptors instead of whole files, so a + partition can release its current file's remaining row groups + back to the pool once a global signal says enough TopK + winners have been found. A natural extension of the existing + morsel work but not yet on a PR. + + The three mechanisms compose. Stats pruning saves the row + groups that *can't* matter (skipped without I/O). The dynamic + filter narrows what's decoded inside the row groups the reader + does open. `Exact` or finer-grained scheduling is what + eventually closes the tap once `TopK` is satisfied. +* **Filtered reverse `TopK` end-to-end.** `WHERE filter ORDER BY + ts DESC LIMIT N` is the dominant observability query shape and + the one where the arrow-rs page-reverse primitive matters most: + `RowSelection::with_limit` cannot pre-compute the last `N` + matching rows when the filter is selective, so the only correct + strategy is to stream pages backward, evaluate the filter, and + stop when `N` matches are collected. The DataFusion-side + integration is a follow-up to the arrow-rs primitive. +* [Unifying `EnforceDistribution` and `EnforceSorting`] into a + single `EnsureRequirements` rule. The two existing rules are + coupled through `SortExec.preserve_partitioning`, which makes their composition non-idempotent and has caused a class of production bugs. Other engines (Spark's `EnsureRequirements`, - Trino's `AddExchanges`) handle both in a single rule. Merging them - also gives future sort-related optimizations a single coherent place - to live. In progress. -* **OFFSET pushdown to parquet** ([#21828]) so `ORDER BY ts LIMIT K - OFFSET N` queries can skip the first `N` rows at the row-group level + Trino's `AddExchanges`) handle both in a single rule. Merging + them also gives future sort-related optimizations a single + coherent place to live. In progress. +* [OFFSET pushdown to parquet] so `ORDER BY ts LIMIT K OFFSET N` + queries can skip the first `N` rows at the row-group level instead of decoding and discarding them. In progress. -* **Multi-column and function-wrapped reorder follow-ups** ([#22198]). - The reorder mechanism in #21956 currently only uses the leading - sort key and only fires on plain columns. Lexicographic multi-key - reorder via `arrow::compute::lexsort_to_indices` is low-hanging - fruit; extending to monotonic function wrappers via leaf-column - extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs a - bit more `EquivalenceProperties` integration but is doable. - -[#21976]: https://github.com/apache/datafusion/pull/21976 -[#21956]: https://github.com/apache/datafusion/pull/21956 -[#22198]: https://github.com/apache/datafusion/issues/22198 -[#21712]: https://github.com/apache/datafusion/pull/21712 -[#21580]: https://github.com/apache/datafusion/pull/21580 -[#21828]: https://github.com/apache/datafusion/pull/21828 -[#21351]: https://github.com/apache/datafusion/pull/21351 -[#21733]: https://github.com/apache/datafusion/issues/21733 +* [Multi-column and function-wrapped reorder follow-ups]. The + reorder mechanism currently only uses the leading sort key and + only fires on plain columns. Lexicographic multi-key reorder + via `arrow::compute::lexsort_to_indices` is low-hanging fruit; + extending to monotonic function wrappers via leaf-column + extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs + a bit more `EquivalenceProperties` integration but is doable. + +[`Exact` reverse for `ORDER BY ... DESC`]: https://github.com/apache/arrow-rs/pull/9937 +[morsel-style work scheduling]: https://github.com/apache/datafusion/pull/21351 +[global file reorder in the shared queue]: https://github.com/apache/datafusion/issues/21733 +[TopK threshold init from parquet statistics]: https://github.com/apache/datafusion/pull/21712 +[combined statistics-driven `TopK` pipeline]: https://github.com/apache/datafusion/pull/21580 +[Unifying `EnforceDistribution` and `EnforceSorting`]: https://github.com/apache/datafusion/pull/21976 +[OFFSET pushdown to parquet]: https://github.com/apache/datafusion/pull/21828 +[Multi-column and function-wrapped reorder follow-ups]: https://github.com/apache/datafusion/issues/22198 Concretely useful issues for new contributors: -* [#17348] — the umbrella issue for sort pushdown. -* [#21317] — sort pushdown: reorder row groups by statistics within - each file. -* [#19394] — add more `ExecutionPlan` impls to support sort pushdown. +* [Umbrella issue for sort pushdown][umbrella-issue]. +* [Reorder row groups by statistics within each file][rg-reorder-issue]. +* [Add more `ExecutionPlan` impls to support sort pushdown][more-impls-issue]. -[#17348]: https://github.com/apache/datafusion/issues/17348 -[#21317]: https://github.com/apache/datafusion/issues/21317 -[#19394]: https://github.com/apache/datafusion/issues/19394 +[umbrella-issue]: https://github.com/apache/datafusion/issues/17348 +[rg-reorder-issue]: https://github.com/apache/datafusion/issues/21317 +[more-impls-issue]: https://github.com/apache/datafusion/issues/19394 ## Acknowledgements @@ -806,18 +791,26 @@ Prior posts this work builds on: * [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses. * [Turning LIMIT into an I/O Optimization: Inside DataFusion's Multi-Layer Pruning Stack][limit-pruning-blog] — the pruning pipeline this work plugs into. -Issues and PRs: - -* Umbrella issue: [apache/datafusion#17348][#17348] -* `MinMaxStatistics` foundation: [apache/datafusion#9593][#9593] -* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064][#19064] -* Sort elimination via statistics: [apache/datafusion#21182][#21182] -* Runtime reorder for TopK convergence: [apache/datafusion#21956][#21956] -* `BufferExec` capacity for sort elimination: [apache/datafusion#21426][#21426] -* Dynamic / TopK-driven path: [apache/datafusion#21351][#21351] (morsel-style work scheduling), - [apache/datafusion#21733][#21733] (global file reorder in shared queue) -* Benchmark suite: [`sort_pushdown`] -* Row-group reverse scan: [apache/datafusion#18817][#18817] -* Page-level reverse (arrow-rs): [apache/arrow-rs#9934][#9934], - [apache/arrow-rs#9937][#9937] -* `EnsureRequirements`: [apache/datafusion#21976][#21976] +Landed PRs that make up this work: + +* `MinMaxStatistics` foundation: [apache/datafusion#9593](https://github.com/apache/datafusion/pull/9593) +* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064](https://github.com/apache/datafusion/pull/19064) +* Reverse-output redesign: [apache/datafusion#19446](https://github.com/apache/datafusion/pull/19446), [apache/datafusion#19557](https://github.com/apache/datafusion/pull/19557) +* Sort elimination via statistics: [apache/datafusion#21182](https://github.com/apache/datafusion/pull/21182) +* `BufferExec` capacity for sort elimination: [apache/datafusion#21426](https://github.com/apache/datafusion/pull/21426) +* Morsel-style work scheduling: [apache/datafusion#21351](https://github.com/apache/datafusion/pull/21351) +* Runtime reorder for `TopK` convergence: [apache/datafusion#21956](https://github.com/apache/datafusion/pull/21956) +* Row-group-level `Inexact` reverse: [apache/datafusion#18817](https://github.com/apache/datafusion/pull/18817) + +In flight / open: + +* Page-level reverse (arrow-rs): [apache/arrow-rs#9937](https://github.com/apache/arrow-rs/pull/9937), discussion in [apache/arrow-rs#9934](https://github.com/apache/arrow-rs/issues/9934) +* `EnsureRequirements`: [apache/datafusion#21976](https://github.com/apache/datafusion/pull/21976) +* OFFSET pushdown to parquet: [apache/datafusion#21828](https://github.com/apache/datafusion/pull/21828) +* TopK threshold init from parquet statistics: [apache/datafusion#21712](https://github.com/apache/datafusion/pull/21712) +* Combined statistics-driven `TopK` pipeline: [apache/datafusion#21580](https://github.com/apache/datafusion/pull/21580) +* Global file reorder in shared queue: [apache/datafusion#21733](https://github.com/apache/datafusion/issues/21733) +* Multi-column / function-wrapped reorder follow-ups: [apache/datafusion#22198](https://github.com/apache/datafusion/issues/22198) +* Umbrella issue for sort pushdown: [apache/datafusion#17348](https://github.com/apache/datafusion/issues/17348) + +Benchmark suite: [`sort_pushdown`] From 3d069008cfeffcaf103cf39808dde518e1ee6a18 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 15:24:39 +0800 Subject: [PATCH 10/14] Restructure: split into merged-features / bottlenecks / roadmap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous draft mixed merged work, in-flight work, and runtime-cost analysis into a single 'Reverse Scans' section and a sprawling 'What's Next' section. Reorganize so the post answers three clear questions in sequence: 1. What's merged today? (Sort Elimination via Statistics + benchmark, Runtime Reorder for TopK Convergence, Reverse Scans for DESC) — unchanged content, just kept tight. 2. Where do those merged features still leave performance on the table? New 'Current Bottlenecks' section with three explicitly numbered bottlenecks: SortExec stays / sort column fully decoded inside open RG / file-granular scheduling can't close the tap mid-file. Pulls in the runtime-cost content that used to be buried in an 'Empirical note' subsection. 3. How does each next-step optimization remove a specific bottleneck? New 'Roadmap' section maps page-level Exact reverse to bottlenecks 1+2, row-group-level dynamic early termination to bottleneck 3, and shows the in-flight 17x-60x pipeline benchmark as a preview of what stacking these mechanisms can deliver. Smaller follow-ups (EnsureRequirements, OFFSET pushdown, multi-column reorder) collected at the end of the roadmap section as a short 'Other follow-ups' bullet list. --- content/blog/2026-05-25-sort-pushdown.md | 365 ++++++++++------------- 1 file changed, 152 insertions(+), 213 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 8e12170c..7306d21a 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -511,118 +511,107 @@ row groups can be skipped via min/max statistics. This ships today and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted files. -To turn this into `Exact` reverse — so the `SortExec` can be removed -outright — each emitted batch itself has to be in DESC order. The -straightforward row-group-level approach (decode an entire RG forward, -materialize all rows, reverse the buffer, then emit) is correct and -was actually proposed first, in an earlier iteration of this work -that was later closed and split into smaller pieces. Review -feedback there — primarily from [@2010YOUY01] — flagged the memory -profile as too aggressive: caching an entire row group's worth of -decoded rows before any batch can be emitted is roughly: - -* **Peak buffer of one whole row group** (~128 MB by default), versus - the few-MB-per-batch streaming profile readers normally have. -* **First-batch latency = full last-row-group decode**. For - `ORDER BY ts DESC LIMIT 10` that means decoding ~1 million rows to - return 10 — defeating the point of the `LIMIT`. - -The agreed direction coming out of that discussion was to ship the -narrower `Inexact` row-group-reverse first, and to build `Exact` -reverse on a finer-grained primitive once `arrow-rs` exposed one. - -### Empirical note — runtime cost of `Inexact` + `TopK` - -We run an internal row-group-level `Exact` reverse implementation in -production and tested swapping in upstream's `Inexact` row-group -reverse + `TopK` on `ORDER BY ts DESC LIMIT N` queries. End-to-end -latency went **up**, not down. A few cost components stack up on the -`Inexact` + `TopK` side: - -* **`LIMIT N` does not propagate as a static stop signal to the - source.** In the `Inexact` path the `SortExec` stays on top and - `TopK`'s fetch belongs to `SortExec`, not to the parquet scan. The - only mechanism that can cut work below the `SortExec` is the - dynamic-filter pushdown: as the heap fills, the filter (`ts > - threshold`) is pushed to the source and its threshold tightens - with every batch. That filter is enough to **stats-prune - subsequent, not-yet-opened row groups** entirely — if a row - group's `max(ts) < threshold` it is skipped without decode. But - inside the row group the source is currently reading, the - filter pushdown does not unwind to "stop": the sort column has - to be **fully decoded** so the filter can be evaluated row by - row, the surviving rows feed the heap to tighten the threshold, - and only then can the resulting `RowSelection` skip the *other* - columns for rows that didn't pass. For - `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that is still - ~1M sort-column decodes regardless of `N`; the LIMIT only saves - work on non-sort columns inside the same row group and on whole - *subsequent* row groups that the tightened threshold can prune. - The internal RG-level `Exact` reverse path, by contrast, deletes - the `SortExec` so the LIMIT becomes a static fetch on the source. - The source still has to decode the target row group in full — - parquet does not allow partial row-group reads, so this part is - the same as `Inexact` — but it then reverses the buffer in - memory, takes the first K rows, and **stops**. No subsequent row - group is opened, no stats check, no filter machinery, no per-row - heap maintenance, no `SortExec` final ordering pass. The wins - come from removing those per-row and per-RG overheads on top, not - from decoding less sort-column data on the target row group. -* **`SortExec` itself adds ordering work on top of `Inexact`.** The - reversed-RG stream is not strictly DESC (rows within each RG are - still forward), so `Inexact` keeps the surrounding `SortExec`. - Even when the heap is settled and the dynamic filter has - pruned the tail, the outer operator does its own final ordering - pass — overhead that `Exact` (which deletes the `SortExec`) - does not pay. - -Why didn't we just upstream the internal `Exact` reverse, then? -**Memory.** Parquet does not allow reading only part of a row -group, so any RG-level `Exact` implementation — ours included — -has to decode the entire row group, reverse the buffer in -memory, and only then emit. That is the same memory profile that -got the earlier RG-level proposal rejected: a peak of one whole -row group (~128 MB) of decoded data, vs. the few-MB-per-batch -streaming profile readers normally have. Our runtime advantage -over `Inexact` + `TopK` does *not* come from decoding less — -both paths decode the relevant row group's sort column in full — -it comes from skipping the per-row heap maintenance, the dynamic -filter evaluation, and the `SortExec` final ordering pass that -`Inexact` keeps on top. So we end up running our `Exact` reverse -in-house but cannot land it as the upstream default, for the -same memory reason that closed the earlier proposal. - -**The fix that keeps both the runtime win and a streaming memory -profile is page-level `Exact` reverse via arrow-rs**, described -next. - -That primitive is the **page-level** reverse traversal. Parquet's -`OffsetIndex` already gives us byte-precise locations for every data -page in a column chunk, so we can `seek` directly to the last page, -decode it forward, reverse the resulting batch, and emit. Peak buffer -drops to one page (~1 MB) and first-batch latency drops to the cost -of one page decode — the row-group-level memory cliff disappears. - -We are landing this primitive upstream in arrow-rs. Early numbers -on a 100k-row, 98-page column chunk show **~50× faster -time-to-first-N** for `n ≤ 1 page` and **~9× faster** for `n` -spanning 10 pages, compared with the row-group-level Exact reverse -described above. The DataFusion-side integration that turns this -primitive into an `Exact` result is a follow-up and is gated on -the arrow-rs merge. +Why not full `Exact` reverse, which would delete the `SortExec` +outright? An earlier proposal — primarily reviewed by +[@2010YOUY01] — that decoded an entire row group forward, +materialized all rows, reversed the buffer, then emitted was +correct but had a prohibitive memory profile: a peak of one whole +row group (~128 MB) of decoded data vs. the few-MB-per-batch +streaming profile readers normally have. The agreed direction was +to ship the narrower `Inexact` row-group-reverse first, and to +build `Exact` reverse on a finer-grained primitive once `arrow-rs` +exposed one. The bottleneck section below details what that +`Inexact`-keeps-`SortExec` decision costs at runtime, and the +roadmap section after it describes how the page-level primitive +removes the cost. [@2010YOUY01]: https://github.com/2010YOUY01 -One natural question: why not reverse the rows *within* a page -directly? Because we can't. Parquet's page encodings (RLE, dictionary, -delta, bit-packing) are all forward streams — you cannot decode the -last value without decoding every value that came before it. The -design therefore is: **reverse the page traversal, forward-decode -each page, reverse the resulting RecordBatch**. This is the algorithm -shape DataFusion's `RecordBatchReader` integration will use once -arrow-rs ships the primitive. +## Current Bottlenecks + +Stats-based **sort elimination** removes the `SortExec` entirely +when ranges are non-overlapping — there's nothing more to optimize +on that path. But the `Inexact` paths (**runtime reorder** for +`TopK`, and **row-group reverse** for `DESC`) leave three concrete +inefficiencies on the table when `Exact` cannot fire: + +### Bottleneck 1: `SortExec` stays on top, so `LIMIT N` does not propagate as a static stop signal + +In the `Inexact` path the `SortExec` stays in the plan and +`TopK`'s fetch belongs to `SortExec`, not to the parquet scan. +The only thing that can cut work below the `SortExec` is the +dynamic-filter pushdown: as the heap fills, the filter +(`ts > threshold`) is pushed to the source and its threshold +tightens with every batch. That filter does **stats-prune +subsequent, not-yet-opened row groups** — if a row group's +`max(ts) < threshold` it is skipped without decode. But the +`SortExec` keeps pulling batches, and the outer operator does its +own final ordering pass on the "RGs descending × rows ascending" +stream even after the heap is settled. We have measured this +in-house: swapping our internal `Exact` reverse for upstream's +`Inexact` reverse + `TopK` on `ORDER BY ts DESC LIMIT N` makes +end-to-end latency go **up**, not down — exactly because the +`SortExec` final pass and the per-row heap maintenance pile up on +top. + +### Bottleneck 2: Inside the currently-open row group, the sort column is fully decoded + +Even with the dynamic filter pushed all the way to parquet, the +filter has to be evaluated row-by-row inside the open row group: +the sort column has to be **fully decoded** so each value can be +compared against the threshold, the surviving rows feed the heap +to tighten the threshold, and only then can the resulting +`RowSelection` skip the *other* columns for rows that didn't +pass. For `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that +is ~1M sort-column decodes regardless of `N`. Parquet doesn't +allow partial row-group reads, so even an RG-level `Exact` +reverse would pay this same cost — the only way to materially +reduce it is to drop to page granularity. + +### Bottleneck 3: File-granular work scheduling can't close the tap mid-file + +Once a `FileStream` picks up a file from the shared work queue, +it has to finish that file. Today's dynamic work scheduling is +**file-granular**: idle partitions stop pulling new files from +the queue once a global limit is satisfied, but the partition +that's currently inside a file decodes that file's remaining row +groups regardless. The work queue holds `PartitionedFile`, not +row-group descriptors. So even with a tight threshold and +aggressive stats pruning of un-opened row groups, the *currently +open* file gets read to completion. + +## Roadmap: Removing the Bottlenecks + +### Page-level `Exact` reverse — addresses bottlenecks 1 + 2 + +Parquet's `OffsetIndex` gives us byte-precise locations for every +data page in a column chunk, so we can `seek` directly to the last +page, decode it forward, reverse the resulting batch, and emit. +Peak buffer drops from ~128 MB (one row group) to ~1 MB (one +page), and first-batch latency drops to the cost of one page +decode — the row-group-level memory cliff disappears. With each +batch already in DESC order, `PushdownSort` can finally return +`Exact` for `DESC` requests, the `SortExec` is removed, and +`LIMIT N` becomes a static fetch on the source. The +`Inexact`-final-ordering-pass overhead from Bottleneck 1 goes +away outright, and the Bottleneck-2 decode reduces to the rows +the page-level seek actually pulls in. + +Why not reverse the rows *within* a page directly? Because we +can't. Parquet's page encodings (RLE, dictionary, delta, +bit-packing) are all forward streams — you cannot decode the last +value without decoding every value that came before it. The +design is: **reverse the page traversal, forward-decode each +page, reverse the resulting `RecordBatch`**. + +The primitive is landing upstream in arrow-rs. Early numbers on a +100k-row, 98-page column chunk show **~50× faster +time-to-first-N** for `n ≤ 1 page` and **~9× faster** for `n` +spanning 10 pages, compared with the row-group-level `Exact` +reverse. The DataFusion-side integration that turns this primitive +into an `Exact` result is a follow-up gated on the arrow-rs merge. -The killer use case is **filtered reverse TopK**: +The killer use case is **filtered reverse `TopK`**: ```sql SELECT * FROM events @@ -631,115 +620,66 @@ ORDER BY ts DESC LIMIT 10 ``` -Here `RowSelection::with_limit` cannot help — you don't know in -advance which rows match `user_id = 42`, so you can't pre-compute a -selection of the "last 10 matching rows". The only correct strategy -is to stream pages backward, evaluate the filter on each, and stop -when 10 matches are collected. Row-group reverse stops at a -~128 MB granularity. Page reverse stops at ~1 MB granularity. For a -selective filter, the saving compounds. - -## What's Next - -Sort pushdown is a long-running line of work and there is more to do. -Beyond the `Exact` path described above, there is a complementary -**dynamic / TopK-driven path** that helps when `Exact` cannot apply — -e.g. when file ranges genuinely overlap, or when the sort is on a -function output rather than a plain column. The two directions are -not alternatives; they compose: - -* [`Exact` reverse for `ORDER BY ... DESC`]. Today's row-group - reverse returns `Inexact` and the `SortExec` stays on top; the - arrow-rs page-level reverse primitive is what unlocks `Exact` - reverse on `DESC` queries (and therefore full `SortExec` - elimination on `DESC`). Memory + first-batch latency rule out - doing the same thing at the row-group level. Gated on the - arrow-rs side. -* **Dynamic / TopK-driven path.** When `Exact` cannot fire, - `TopK`'s [dynamic filter][dyn-filters-blog] still benefits - enormously from reading the *best* data first. This thread also - builds on the [limit pruning][limit-pruning-blog] work that - turned `LIMIT` into an I/O optimization across the pruning - pipeline. The recently-merged [morsel-style work scheduling] in - `FileStream` gives sibling partitions a *shared work queue* with - file-level work-stealing — no CPU sits idle when one partition - runs out of files. The proposed - [global file reorder in the shared queue] sorts files in that - shared queue by per-file statistics *before* any partition - picks, so the first file read is globally optimal and tightens - the dynamic filter immediately. Combined with - [TopK threshold init from parquet statistics] and the runtime - row-group / file reorder + reverse path described above, the - threshold can be set before reading a single byte. The reorder - mechanism applies to any `ORDER BY [LIMIT N]` on - parquet, not just TopK queries with a dynamic filter. The - [combined statistics-driven `TopK` pipeline] is in flight. - - The mechanism here is **RG-level pruning, not mid-stream early - return**. With the threshold known up front, the parquet - `PruningPredicate` rejects entire row groups against their - min/max statistics before any I/O — those row groups are never - decoded. The row group(s) the reader *does* open still have - their sort column decoded in full to feed the dynamic filter. - On the in-flight microbenchmark (single file, 61 sorted row - groups, `--partitions 1`), **60 of the 61 row groups are - skipped** and only one is decoded: - - | Query | Baseline | With pipeline | Speedup | - | ------------------------------ | -------: | ------------: | ------: | - | `ORDER BY col DESC LIMIT 100` | 28.5 ms | 1.64 ms | **17×** | - | `ORDER BY col DESC LIMIT 1000` | 22.2 ms | 0.37 ms | **60×** | - | `SELECT * ORDER BY ... LIMIT 100` | 22.5 ms | 0.66 ms | **34×** | - | `SELECT * ORDER BY ... LIMIT 1000` | 22.4 ms | 0.61 ms | **37×** | - - The stack reports `Inexact` — the `SortExec` stays on top to - enforce correctness across overlapping ranges — so this path - cannot do *true* mid-stream early return. Once the parquet - reader opens a row group, the sort column has to be decoded all - the way through; once a `FileStream` picks up a file from the - shared work queue, it has to finish that file. Today's dynamic - work scheduling is **file-granular**: idle partitions stop - pulling new files from the queue once a global limit is - satisfied, but the partition that's currently inside a file - decodes that file's remaining row groups regardless. Mid-file - RG-level early return on `TopK` convergence is **not implemented - yet** — the work queue holds `PartitionedFile`, not row-group - descriptors. - - Closing the tap the moment `TopK` has K confirmed winners - therefore needs either: - - * the **`Exact` path**, where the `SortExec` is gone entirely - and the data source's own `fetch` becomes a static limit that - the reader can honour at batch granularity; or - * **finer-grained dynamic scheduling** — having the shared queue - hold row-group descriptors instead of whole files, so a - partition can release its current file's remaining row groups - back to the pool once a global signal says enough TopK - winners have been found. A natural extension of the existing - morsel work but not yet on a PR. - - The three mechanisms compose. Stats pruning saves the row - groups that *can't* matter (skipped without I/O). The dynamic - filter narrows what's decoded inside the row groups the reader - does open. `Exact` or finer-grained scheduling is what - eventually closes the tap once `TopK` is satisfied. -* **Filtered reverse `TopK` end-to-end.** `WHERE filter ORDER BY - ts DESC LIMIT N` is the dominant observability query shape and - the one where the arrow-rs page-reverse primitive matters most: - `RowSelection::with_limit` cannot pre-compute the last `N` - matching rows when the filter is selective, so the only correct - strategy is to stream pages backward, evaluate the filter, and - stop when `N` matches are collected. The DataFusion-side - integration is a follow-up to the arrow-rs primitive. +`RowSelection::with_limit` cannot help here — you don't know in +advance which rows match `user_id = 42`, so you can't pre-compute +a selection of the "last 10 matching rows". The only correct +strategy is to stream pages backward, evaluate the filter on +each, and stop when 10 matches are collected. Row-group reverse +stops at a ~128 MB granularity. Page reverse stops at ~1 MB +granularity. For a selective filter, the saving compounds. + +### Row-group-level dynamic early termination — addresses bottleneck 3 + +The work queue today holds `PartitionedFile`. Switching it to +hold **row-group descriptors** instead lets a partition release +its current file's remaining row groups back to the pool the +moment a global signal says enough `TopK` winners have been +found. A natural extension of the existing morsel-style work +scheduling but not yet on a PR. + +The two roadmap items above are *complementary*, not +alternatives: + +* `Exact` reverse closes the tap for `DESC` queries by removing + the `SortExec` entirely. +* Row-group-level scheduling closes the tap for `Inexact` queries + where `Exact` still cannot fire (function-wrapped sorts, + overlapping ranges) — the `SortExec` stays, but the scan stops + pulling row groups once `TopK` is satisfied. + +### Preview: the combined statistics-driven `TopK` pipeline + +The [combined statistics-driven `TopK` pipeline] is the in-flight +work that stacks several of these mechanisms: pre-scan +[TopK threshold init from parquet statistics], +[global file reorder in the shared queue], and the runtime +row-group / file reorder + reverse already merged. On a +microbenchmark (single file, 61 sorted row groups, `--partitions 1`) +**60 of the 61 row groups are skipped**, only one is decoded: + +| Query | Baseline | With pipeline | Speedup | +| ------------------------------ | -------: | ------------: | ------: | +| `ORDER BY col DESC LIMIT 100` | 28.5 ms | 1.64 ms | **17×** | +| `ORDER BY col DESC LIMIT 1000` | 22.2 ms | 0.37 ms | **60×** | +| `SELECT * ORDER BY ... LIMIT 100` | 22.5 ms | 0.66 ms | **34×** | +| `SELECT * ORDER BY ... LIMIT 1000` | 22.4 ms | 0.61 ms | **37×** | + +This pipeline still reports `Inexact` — the `SortExec` stays on +top to enforce correctness across overlapping ranges — so it pays +the Bottleneck-1 and Bottleneck-3 overheads listed above. The +17×–60× is what statistics-driven RG-level pruning alone can +deliver; `Exact` reverse + row-group-level early termination is +what pushes it further. + +### Other follow-ups + * [Unifying `EnforceDistribution` and `EnforceSorting`] into a single `EnsureRequirements` rule. The two existing rules are coupled through `SortExec.preserve_partitioning`, which makes their composition non-idempotent and has caused a class of production bugs. Other engines (Spark's `EnsureRequirements`, - Trino's `AddExchanges`) handle both in a single rule. Merging - them also gives future sort-related optimizations a single - coherent place to live. In progress. + Trino's `AddExchanges`) handle both in a single rule. In + progress. * [OFFSET pushdown to parquet] so `ORDER BY ts LIMIT K OFFSET N` queries can skip the first `N` rows at the row-group level instead of decoding and discarding them. In progress. @@ -751,7 +691,6 @@ not alternatives; they compose: extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs a bit more `EquivalenceProperties` integration but is doable. -[`Exact` reverse for `ORDER BY ... DESC`]: https://github.com/apache/arrow-rs/pull/9937 [morsel-style work scheduling]: https://github.com/apache/datafusion/pull/21351 [global file reorder in the shared queue]: https://github.com/apache/datafusion/issues/21733 [TopK threshold init from parquet statistics]: https://github.com/apache/datafusion/pull/21712 From 2c22ce303efd46dadcedaf5891ec61ef4d2e84ae Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 15:29:27 +0800 Subject: [PATCH 11/14] Tighten verbose paragraphs; drop unrelated limit-pruning blog reference MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - 'Why not full Exact reverse' paragraph: cut reviewer attribution and forward-pointers that were already in the bottleneck/roadmap sections that follow. - TL;DR: trim Runtime Reorder + Reverse Scans bullets to capability and impact; drop implementation mechanics like 'stamps two flags' and 'three-step pipeline'. - 'The PushdownSort Rule' section: cut three paragraphs of 'covered in X below' forward-references that were repeating the section TOC. - Function-wrapped parenthetical in Sort Elimination: 4 lines to 2. - Single-partition vs multi-partition edge case: drop the trailing 'which is why the example is drawn that way' tangent. - 'What this change does not affect' note: trimmed redundant prose. - Remove all references to the limit-pruning blog (intro mention, link definition, References section bullet) — that work is about static LIMIT as an I/O optimization, separate problem from sort ordering. --- content/blog/2026-05-25-sort-pushdown.md | 116 ++++++++--------------- 1 file changed, 41 insertions(+), 75 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 7306d21a..d7bbdbff 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -72,20 +72,16 @@ further still. upgrades the source's ordering claim from `Unsupported` to `Exact` and **removes the `SortExec`** that `EnforceSorting` inserted earlier. - * **Runtime reorder for `TopK` convergence** — whenever the - leading sort key is a plain column in the file schema (or the - source's reversed declared ordering satisfies the request), - `try_pushdown_sort` stamps two flags on the source and the - opener runs a three-step runtime pipeline — file-level reorder - in the shared morsel queue, row-group reorder by min/max stats, - then optional iteration reverse for `DESC` requests. `SortExec` - stays, but `TopK`'s dynamic filter tightens fast on the - most-promising data and the rest is pruned. + * **Runtime reorder for `TopK` convergence** — when the leading + sort key is a plain column (or the reversed source ordering + satisfies the request), the scan reorders files and row groups + by `min/max` stats so the most-promising data is read first. + `SortExec` stays, but `TopK`'s dynamic filter tightens fast + and the rest is pruned. * **Reverse scans for `ORDER BY ... DESC`** — a row-group-level - reverse returns `Inexact` (Sort stays, but `TopK` terminates - early). The page-level reverse primitive needed for `Exact` - reverse — and so for full `SortExec` removal on `DESC` queries - — is in flight in arrow-rs. + reverse returns `Inexact`. Full `SortExec` removal on `DESC` + requires a page-level reverse primitive that's in flight in + arrow-rs. * Real-world benchmarks on the `sort_pushdown` suite (`Exact` path): `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full `ORDER BY` scans get **~2×** faster. @@ -158,8 +154,8 @@ follow: heap tightens, the filter's threshold tightens with it, and entire row groups can be skipped by checking the live threshold against the row group's min/max statistics. (See the earlier - [dynamic filters][dyn-filters-blog] and [limit pruning][limit-pruning-blog] - posts for the full background on this mechanism.) + [dynamic filters][dyn-filters-blog] post for the full background + on this mechanism.) Both paths use the same underlying min/max statistics, but for different purposes: `Exact` uses them at plan time to prove @@ -167,7 +163,6 @@ non-overlap and justify removing the sort; `Inexact` uses them at runtime to skip row groups that can no longer improve the heap. [dyn-filters-blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/ -[limit-pruning-blog]: https://datafusion.apache.org/blog/2026/03/20/limit-pruning/ The diagram above shows the result we want: the plan after sort pushdown loses the `SortExec` node. Everything downstream — the @@ -178,32 +173,19 @@ producing the order requested. ## The `PushdownSort` Rule -The **`PushdownSort`** physical optimizer rule defines a uniform -API for asking each `ExecutionPlan` two questions: +The **`PushdownSort`** physical optimizer rule asks each +`ExecutionPlan` two questions: 1. "Can you produce output in *this* ordering?" 2. "If yes, please rearrange yourself so that it actually does." -The protocol uses three results — `Exact`, `Inexact`, `Unsupported` — -that downstream operators can interpret uniformly. The Parquet -`FileSource` answers by comparing the requested ordering against the -per-file declared ordering: if natural ordering satisfies the request, -it returns `Exact`; if the *reverse* of the declared ordering does, -it returns `Inexact` and flips on `reverse_row_groups=true` so the -scan reads row groups from last to first (the row-group-level reverse -covered later in this post); otherwise it returns `Unsupported`. - -The rule's initial scope was deliberately narrow. It set up the -API and delivered the reverse-scan case end-to-end, but did **not** -add any statistics-based file rearrangement — that came later, -covered in -[Sort Elimination via Statistics](#sort-elimination-via-statistics) -below. A finer-grained extension that broadens this `Inexact` path -with a three-step runtime reorder pipeline is covered in -[Runtime Reorder for TopK Convergence](#runtime-reorder-for-topk-convergence). - -The same rule also handles **reverse-output** for `DESC` queries — -picked up again in the reverse-scan section below. +The answer is one of `Exact`, `Inexact`, `Unsupported`. The Parquet +`FileSource` answers by comparing the requested ordering against +the per-file declared ordering: natural ordering satisfies → +`Exact`; reversed satisfies → `Inexact` (sets +`reverse_row_groups=true`); otherwise → `Unsupported`. The rest of +this post is what each merged capability does on top of this +protocol. ## Sort Elimination via Statistics @@ -264,11 +246,9 @@ is no per-file min/max for the function output to compare against. Extending sort pushdown across monotonic function wrappers is one of the open follow-ups. -*(The runtime reorder path covered later does let function-wrapped -sorts benefit from row-group iteration reverse via -`EquivalenceProperties`'s monotonicity inference, when the source -declares a compatible natural ordering — but stats-based sort -elimination still needs a plain column.)* +(Runtime reorder covered later does handle some function-wrapped +sorts via monotonicity inference — but stats-based sort elimination +still needs a plain column.) Detecting non-overlapping ranges via min/max statistics @@ -304,15 +284,12 @@ The implementation handles a few edge cases worth calling out: `SortPreservingMergeExec` then picks rows across streams in value order to produce the final globally sorted output. The rule only has to prove the per-stream property. -* **Single-partition vs multi-partition execution**. With the default - multi-partition setup, `EnforceDistribution` byte-range-splits files - into single-file groups, after which `validated_output_ordering()` - works correctly on its own. Stats-based reorder only triggers when - files have not been split — typically `--partitions 1` runs, or - files small enough that the splitter leaves them alone. In the - typical `--partitions 1` case the "per-group" distinction collapses - (one group equals the whole table), which is why the example earlier - in this section is drawn that way. +* **Single-partition vs multi-partition execution.** The default + multi-partition setup byte-range-splits files into single-file + groups, after which `validated_output_ordering()` works on its + own. Stats-based reorder only fires when files aren't split — + typically `--partitions 1` or files small enough that the + splitter leaves them alone. [`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs [`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426 @@ -352,12 +329,11 @@ removed: the runtime-difference section above. A 342 ms full-file scan collapses into a 7 ms K-row read. -It is worth saying explicitly what this change does **not** affect. -The default multi-partition execution path is unchanged: those plans -already produced correct orderings via byte-range splitting, so -stats-based sort elimination simply does not trigger. There is no -regression and no behavior change for the typical multi-threaded -query. +The default multi-partition execution path is unaffected: those +plans already produce correct orderings via byte-range splitting, +so stats-based sort elimination simply does not fire there. No +regression and no behavior change for typical multi-threaded +queries. ## Runtime Reorder for TopK Convergence @@ -511,21 +487,12 @@ row groups can be skipped via min/max statistics. This ships today and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted files. -Why not full `Exact` reverse, which would delete the `SortExec` -outright? An earlier proposal — primarily reviewed by -[@2010YOUY01] — that decoded an entire row group forward, -materialized all rows, reversed the buffer, then emitted was -correct but had a prohibitive memory profile: a peak of one whole -row group (~128 MB) of decoded data vs. the few-MB-per-batch -streaming profile readers normally have. The agreed direction was -to ship the narrower `Inexact` row-group-reverse first, and to -build `Exact` reverse on a finer-grained primitive once `arrow-rs` -exposed one. The bottleneck section below details what that -`Inexact`-keeps-`SortExec` decision costs at runtime, and the -roadmap section after it describes how the page-level primitive -removes the cost. - -[@2010YOUY01]: https://github.com/2010YOUY01 +Why not full `Exact` reverse that deletes the `SortExec`? +Decoding a whole row group forward, reversing the buffer, and +emitting works — but peaks at ~128 MB vs. the few-MB-per-batch +streaming profile readers expect. `Exact` reverse waits on a +page-level primitive that keeps the runtime win on a streaming +memory budget — see the roadmap below. ## Current Bottlenecks @@ -725,10 +692,9 @@ invariants — is what made this work possible. ## References -Prior posts this work builds on: +Prior post this work builds on: * [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses. -* [Turning LIMIT into an I/O Optimization: Inside DataFusion's Multi-Layer Pruning Stack][limit-pruning-blog] — the pruning pipeline this work plugs into. Landed PRs that make up this work: From e389c3f3561213abdb03136ad5ae20fd975c085d Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 15:56:55 +0800 Subject: [PATCH 12/14] Fold reverse as case of Inexact runtime reorder; make two-trigger asymmetry explicit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per code in datasource-parquet/src/source.rs:849-870, the reversed- satisfies branch is 'strictly more powerful than the column-in-schema check' — it runs the request through EquivalenceProperties's full reasoning machinery and handles function monotonicity, constants from filters, equivalence relationships, and multi-column composite orderings. The blog had been treating reverse as just step 3 of the runtime pipeline, which undersold its standalone reach. Structural changes: - Drop the standalone 'Reverse Scans for ORDER BY DESC' H2 section; reverse is now a case of the Inexact runtime reorder path. - Rename Runtime Reorder section to 'Runtime Reorder for TopK and DESC Queries'; intro now lists three classes that fall outside Exact (unsorted, overlapping, DESC). - 'try_pushdown_sort' subsection rewritten as 'Two independent triggers for Inexact', describing column-in-schema vs reversed- satisfies as separate signals with the latter being strictly more powerful. - 'Three runtime steps' subsection: step 3 now explicitly notes when steps 1-2 are skipped and only the iteration reverse runs. - New 'ORDER BY DESC in practice' subsection right after the 3-step pipeline, explaining the RGs-descending-x-rows-ascending stream. - Move reverse-scan.svg from the deleted Reverse Scans section into the Roadmap > Page-level Exact reverse subsection where it illustrates the 128 MB vs 1 MB peak comparison directly. Accuracy fix: - 'Multi-column reorder follow-ups' bullet was inaccurate — said the reorder 'only fires on plain columns'. The reverse path does handle function-wrapped and multi-column cases via EquivalenceProperties; only the stats reorder step is restricted. Updated wording to scope the limitation correctly. --- content/blog/2026-05-25-sort-pushdown.md | 278 +++++++++++------------ 1 file changed, 131 insertions(+), 147 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index d7bbdbff..7681e77e 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -44,12 +44,14 @@ already in that order. CPU wasted. Memory wasted. Streaming defeated. [Apache DataFusion]: https://datafusion.apache.org/ -This post walks through the **sort pushdown** work that closed that -gap. It covers two complementary capabilities — sort elimination via -statistics, and runtime reorder for `TopK` convergence — and lands -real benchmark speedups of **2.1×–49× on common queries**. The same -machinery extends to `ORDER BY ... DESC`, and the page-level reverse -primitive we are adding upstream in [arrow-rs] will push the gains +This post walks through the **sort pushdown** work that closed +that gap. It covers two complementary capabilities — **sort +elimination via statistics** (the `Exact` path, which deletes the +`SortExec`) and **runtime reorder** (the `Inexact` path, which +keeps the `SortExec` but reads the most-promising data first for +`TopK` and `DESC` queries) — and lands real benchmark speedups of +**2.1×–49× on common queries**. The page-level reverse primitive +we are adding upstream in [arrow-rs] will push the `DESC` gains further still. [arrow-rs]: https://github.com/apache/arrow-rs @@ -72,16 +74,15 @@ further still. upgrades the source's ordering claim from `Unsupported` to `Exact` and **removes the `SortExec`** that `EnforceSorting` inserted earlier. - * **Runtime reorder for `TopK` convergence** — when the leading - sort key is a plain column (or the reversed source ordering - satisfies the request), the scan reorders files and row groups - by `min/max` stats so the most-promising data is read first. - `SortExec` stays, but `TopK`'s dynamic filter tightens fast - and the rest is pruned. - * **Reverse scans for `ORDER BY ... DESC`** — a row-group-level - reverse returns `Inexact`. Full `SortExec` removal on `DESC` - requires a page-level reverse primitive that's in flight in - arrow-rs. + * **Runtime reorder for `TopK` and `DESC` queries** — when the + leading sort key is a plain column (or the reversed source + ordering satisfies the request), the scan reorders files and + row groups by `min/max` stats so the most-promising data is + read first; for `DESC` requests it additionally flips + iteration. `SortExec` stays `Inexact`, but `TopK`'s dynamic + filter tightens fast and the rest is pruned. Full `SortExec` + removal on `DESC` requires a page-level reverse primitive + that's in flight in arrow-rs. * Real-world benchmarks on the `sort_pushdown` suite (`Exact` path): `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full `ORDER BY` scans get **~2×** faster. @@ -179,13 +180,13 @@ The **`PushdownSort`** physical optimizer rule asks each 1. "Can you produce output in *this* ordering?" 2. "If yes, please rearrange yourself so that it actually does." -The answer is one of `Exact`, `Inexact`, `Unsupported`. The Parquet -`FileSource` answers by comparing the requested ordering against -the per-file declared ordering: natural ordering satisfies → -`Exact`; reversed satisfies → `Inexact` (sets -`reverse_row_groups=true`); otherwise → `Unsupported`. The rest of -this post is what each merged capability does on top of this -protocol. +The answer is one of `Exact`, `Inexact`, `Unsupported`. `Exact` +means the surrounding `SortExec` can be deleted entirely; `Inexact` +means the source will read the data in a near-sorted order so +`TopK` and other consumers benefit, but `SortExec` stays for +strict correctness. The rest of this post is what each merged +capability does on top of this protocol — first the `Exact` path, +then the `Inexact` path. ## Sort Elimination via Statistics @@ -335,12 +336,12 @@ so stats-based sort elimination simply does not fire there. No regression and no behavior change for typical multi-threaded queries. -## Runtime Reorder for TopK Convergence +## Runtime Reorder for `TopK` and `DESC` Queries Stats-based sort elimination handles the `Exact` upgrade — strong correctness, sort elimination — but only when the table has a declared `output_ordering` *and* the files are provably -non-overlapping after sorting by min. Two large classes of queries +non-overlapping after sorting by min. Three classes of queries fall outside that window: * **Unsorted data** — no `WITH ORDER`, no parquet `sorting_columns`. @@ -350,40 +351,58 @@ fall outside that window: jobs share time windows. The `Exact` upgrade keeps the `SortExec` because the global ordering can't be proven, even though the files often do contain large stretches of in-order data. +* **`ORDER BY ... DESC` on ASC-sorted data** — flipping iteration + at the row-group level emits "RGs descending × rows ascending", + close to the requested order but not strictly DESC, so the + `SortExec` has to stay for correctness. -For both, a full external `SortExec` is overkill. The parquet +For all three, a full external `SortExec` is overkill. The parquet metadata is right there, and reading the *most-promising* data first lets `TopK`'s dynamic filter threshold tighten quickly so the rest gets pruned. Runtime reorder wires that up by generalising the `Inexact` path the rule introduced. -### `try_pushdown_sort` — one decision, three outcomes +### Two independent triggers for `Inexact` try_pushdown_sort decision tree: Exact, Inexact, or Unsupported -The `Exact` / `Inexact` / `Unsupported` protocol stays. The -runtime reorder path broadens the **conditions** that route a -query into `Inexact`: - -| Condition | Outcome | -| --- | --- | -| `eq_properties.ordering_satisfy(request)` | `Exact` — sort elimination | -| Leading sort key is a plain `Column` in the file schema, **or** the source's reversed declared ordering satisfies the request | `Inexact` — runtime reorder pipeline | -| Neither | `Unsupported` — `SortExec` stays, no source-side optimisation | - -The "reversed satisfies" branch is what handles function-wrapped -sorts (`date_trunc('day', ts) DESC`, `ceil(value) DESC`, -`CAST(x AS Date) DESC`) — `EquivalenceProperties`'s monotonicity -reasoning recognises that `f(col) DESC` is satisfied by `col ASC` -reversed, even though parquet has no stats keyed by `f(col)` -itself. - -### Two flags on `ParquetSource`, three runtime steps +`try_pushdown_sort` first checks whether the natural ordering +already satisfies the request (→ `Exact`) or whether a non-empty +*proper prefix* of the request is already satisfied (→ +`Unsupported`, so the outer `SortExec`'s `sort_prefix` +optimisation can fire instead). Otherwise it looks at two +**independent** Inexact signals — either one is enough, and they +compose when both apply: + +**Stats-based RG reorder** — fires when the leading sort key is a +plain `Column` in the file schema. The opener sorts row groups by +`min(col)` via parquet statistics. Restrictive (plain physical +column only), but lets the scan globally reorder data so the +most-promising row group is decoded first. + +**Iteration reverse** — fires when the source's declared ordering, +**reversed**, satisfies the request. This goes through the full +`EquivalenceProperties` reasoning machinery and is **strictly more +powerful** than the column-in-schema check above. It fires for: + +* **Function monotonicity** — file declares `ts DESC`, request is + `date_trunc('day', ts) ASC` → reversed `ts ASC` satisfies the + request via monotonicity even though parquet has no stats keyed + by the function. Same for `ceil(value)`, `CAST(x AS Date)`, etc. +* **Constant columns from filters** — `WHERE region = 'us'` marks + `region` as constant in the equivalence class, so a request + involving `region` is trivially satisfied. +* **Equivalence relationships** — `WHERE a = b` transfers a known + ordering on `a` to a request on `b`. +* **Multi-column composite orderings** — the source's declared + multi-key ordering reversed satisfies the multi-key request as a + whole. + +### Three runtime steps in the opener Runtime reorder pipeline: file reorder, RG reorder, then optional reverse -When `try_pushdown_sort` returns `Inexact`, it stamps two fields on -the `ParquetSource`: +The two triggers above set two fields on `ParquetSource`: ```rust struct ParquetSource { @@ -393,114 +412,76 @@ struct ParquetSource { } ``` -The opener reads them at scan time to drive three composable steps: - -1. **File-level reorder.** `FileSource::reorder_files` sits in the - shared morsel queue (a work-stealing primitive that lets sibling - partitions share a single file pool) and sorts the - partitioned-file list by `min(col)`. The first file picked across - all partitions is globally the most-promising one. -2. **Row-group-level reorder.** Once a file is opened, - `PreparedAccessPlan::reorder_by_statistics` sorts that file's - `row_group_indexes` by `min(col)` ASC. The row group most likely - to contribute to `TopK` is decoded first. -3. **Reverse.** For `DESC` requests, - `PreparedAccessPlan::reverse` flips the iteration after the - stats reorder normalises everything to ASC-by-min. Same - primitive the rule originally introduced for declared reverse - scans — the runtime pipeline just routes more queries through - it. - -The two layers compose naturally because they sort by the same -key. A file's `min(col)` is the minimum over its row groups' -`min(col)` values, so the file with the smallest `min` contains -the row group with the smallest `min`. Sorting files by `min(col)` -and then sorting row groups by `min(col)` within each file -produces an approximately min-ordered global stream — the first -batch comes from the most-promising row group in the -most-promising file, exactly what `TopK`'s dynamic filter needs -to tighten its threshold fast. - -`reverse_row_groups`'s meaning depends on which way `Inexact` was -reached. When the column-in-schema condition fires, the stats -reorder produces ASC-by-min, so `reverse_row_groups` simply mirrors -the request direction. When only the reversed-equivalence -condition fires (function-wrapped case with a declared source -ordering), `reverse_row_groups` is `true` unconditionally — there -is no stats reorder to compose with, just a flip of the file's -natural order. - -Both flags surface on the `DataSourceExec` line in `EXPLAIN` so -plan inspection and snapshot tests can confirm the pushdown fired: +The opener consumes them in three composable steps: + +1. **File-level reorder** (`FileSource::reorder_files`). The shared + morsel queue — a work-stealing primitive that lets sibling + partitions share a single file pool — sorts the partitioned-file + list by `min(col)`. The first file picked across all partitions + is globally the most-promising one. Skipped when the stats + reorder trigger didn't fire. +2. **Row-group-level reorder** + (`PreparedAccessPlan::reorder_by_statistics`). Once a file is + opened, sort its row groups by `min(col)` ASC so the most-promising + row group is decoded first. Same trigger as step 1; the two + layers nest because a file's `min(col)` is the minimum over its + row groups' `min(col)` values. +3. **Iteration reverse** (`PreparedAccessPlan::reverse`). Flips the + row-group iteration order. For `DESC` requests on a plain + column the flip composes with steps 1–2 (ASC-by-min → reverse → + DESC-by-min). For the function-wrapped / constants-from-filters / + multi-column cases, steps 1–2 are skipped and this is the only + step that runs — just a flip of the file's natural order. + +Both flags surface on the `DataSourceExec` line in `EXPLAIN`: ```text DataSourceExec: file_groups=..., file_type=parquet, sort_order_for_reorder=[a@0 ASC], reverse_row_groups=true ``` -Absence of either flag means the corresponding runtime step is a -no-op. - -### When runtime reorder does *not* fire - -* **Aggregations on top of the sort key.** `SELECT URL, COUNT(*) AS c - FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench - TopK shape) — the leading sort key (`c`) is an aggregation result - and has no per-RG stats in the parquet file, so the - column-in-schema check fails. Pushing sort metadata through - `AggregateExec` is a separate problem: the aggregated value - doesn't exist before aggregation, so even if the metadata reached - the scan there'd be nothing actionable to do with it. -* **Multi-column sort secondary keys.** The reorder currently only - uses the leading sort expression — secondary keys are ignored. - An open follow-up. -* **Function-wrapped sort without a source-declared ordering.** - Without a declared ordering to invert, the reversed-equivalence - branch has nothing to satisfy. Same follow-up. -* **Source declares a forward prefix of the request.** When the - source's declared `output_ordering` is a non-empty proper prefix - of the request (e.g. source `[a DESC, b ASC]`, request - `[a DESC, b ASC, c DESC]`), `try_pushdown_sort` returns - `Unsupported` so the surrounding `SortExec` can keep its - `sort_prefix` annotation — prefix-aware early termination in - `TopK` is strictly better than the runtime reorder on data that - is already in prefix order on disk. - -## Reverse Scans for `ORDER BY ... DESC` - -Row-group reverse vs page reverse: 128MB and 8 pages vs 1MB and 1 page - -`ORDER BY ts DESC` is the same problem in reverse. If a file is sorted -ascending and the query wants descending, we should be able to skip -the sort — we just need to read the data in the opposite order. - -The first iteration of this operates at the **row group** level: -it reverses the *iteration order of row groups* so the last RG is -opened first, but rows within each RG are still decoded forward. -The resulting stream is "RGs descending × rows ascending" — close -to the requested order, but not strictly DESC. The optimizer -therefore reports this as `Inexact` and leaves the `SortExec` in -place; the win is that `TopK`'s dynamic filter tightens much -faster, because the very first row groups read already contain -values near the final answer. A tight threshold means subsequent -row groups can be skipped via min/max statistics. This ships today -and is what powers fast `ORDER BY ts DESC LIMIT N` on ASC-sorted -files. - -Why not full `Exact` reverse that deletes the `SortExec`? -Decoding a whole row group forward, reversing the buffer, and +### `ORDER BY ... DESC` in practice + +A `DESC` request on an ASC-sorted plain column goes through both +triggers — the stats reorder normalises to ASC-by-min and the +iteration reverse flips to DESC-by-min. The result is *"RGs +descending × rows ascending"* — close to the requested order but +not strictly DESC, hence `Inexact`. The `SortExec` stays for +correctness, but `TopK`'s dynamic filter tightens fast because the +first row groups read already contain values near the final +answer, so subsequent row groups can be skipped via min/max +statistics. This is what powers fast `ORDER BY ts DESC LIMIT N` on +ASC-sorted files today. + +Why not full `Exact` reverse that deletes the `SortExec` outright? +Decoding a whole row group forward, reversing the buffer, then emitting works — but peaks at ~128 MB vs. the few-MB-per-batch streaming profile readers expect. `Exact` reverse waits on a page-level primitive that keeps the runtime win on a streaming -memory budget — see the roadmap below. +memory budget — covered in the roadmap below. + +### When neither Inexact trigger fires + +* **Aggregations on the sort key** — `SELECT URL, COUNT(*) AS c FROM + hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench TopK + shape). The leading sort key `c` is an aggregate result with no + per-RG stats and no equivalence to a file column, so neither + trigger fires. Pushing sort metadata through `AggregateExec` is a + separate problem entirely. +* **Function-wrapped sort with no source-declared ordering** — the + reversed-equivalence branch has nothing to invert. +* **Source declares a forward prefix of the request** — + `try_pushdown_sort` returns `Unsupported` so the surrounding + `SortExec` can keep its `sort_prefix` annotation; prefix-aware + early termination in `TopK` is strictly better than reorder on + data that's already in prefix order on disk. ## Current Bottlenecks -Stats-based **sort elimination** removes the `SortExec` entirely -when ranges are non-overlapping — there's nothing more to optimize -on that path. But the `Inexact` paths (**runtime reorder** for -`TopK`, and **row-group reverse** for `DESC`) leave three concrete -inefficiencies on the table when `Exact` cannot fire: +Sort elimination removes the `SortExec` entirely when ranges are +non-overlapping — there's nothing more to optimize on that path. +The `Inexact` runtime-reorder path is where the merged work still +leaves performance on the table. Three concrete inefficiencies: ### Bottleneck 1: `SortExec` stays on top, so `LIMIT N` does not propagate as a static stop signal @@ -551,6 +532,8 @@ open* file gets read to completion. ### Page-level `Exact` reverse — addresses bottlenecks 1 + 2 +Row-group reverse (128 MB peak, ~8 pages decoded) vs page reverse (1 MB peak, 1 page decoded) + Parquet's `OffsetIndex` gives us byte-precise locations for every data page in a column chunk, so we can `seek` directly to the last page, decode it forward, reverse the resulting batch, and emit. @@ -651,10 +634,11 @@ what pushes it further. queries can skip the first `N` rows at the row-group level instead of decoding and discarding them. In progress. * [Multi-column and function-wrapped reorder follow-ups]. The - reorder mechanism currently only uses the leading sort key and - only fires on plain columns. Lexicographic multi-key reorder - via `arrow::compute::lexsort_to_indices` is low-hanging fruit; - extending to monotonic function wrappers via leaf-column + **stats reorder step** currently only uses the leading sort key + on a plain column (reverse handles the rest via + `EquivalenceProperties` reasoning). Lexicographic multi-key + reorder via `arrow::compute::lexsort_to_indices` is low-hanging + fruit; extending to monotonic function wrappers via leaf-column extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs a bit more `EquivalenceProperties` integration but is doable. From 3d7b080c8172478cdfb0ebcd62dc1f37da3a6131 Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 15:59:17 +0800 Subject: [PATCH 13/14] =?UTF-8?q?Drop=20EnsureRequirements=20+=20OFFSET=20?= =?UTF-8?q?pushdown=20=E2=80=94=20both=20unrelated=20to=20sort=20pushdown?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit EnsureRequirements (#21976) is a rule-unification effort for EnforceDistribution+EnforceSorting. Touches the same area but isn't a sort pushdown optimization. OFFSET pushdown (#21828) is about LIMIT/OFFSET pruning. Same kind of tangent as the limit-pruning blog reference removed earlier — it's LIMIT optimization, not sort pushdown. The remaining 'Multi-column and function-wrapped reorder follow-ups' bullet is actually directly about sort pushdown's reorder step (#22198), so it stays. With the other two removed, 'Other follow-ups' collapsed to a single point — promoted to its own subsection 'Extending the stats reorder step' for clarity. Also dropped the corresponding entries from the References section. --- content/blog/2026-05-25-sort-pushdown.md | 38 ++++++++---------------- 1 file changed, 13 insertions(+), 25 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 7681e77e..6dbd4821 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -621,34 +621,24 @@ the Bottleneck-1 and Bottleneck-3 overheads listed above. The deliver; `Exact` reverse + row-group-level early termination is what pushes it further. -### Other follow-ups - -* [Unifying `EnforceDistribution` and `EnforceSorting`] into a - single `EnsureRequirements` rule. The two existing rules are - coupled through `SortExec.preserve_partitioning`, which makes - their composition non-idempotent and has caused a class of - production bugs. Other engines (Spark's `EnsureRequirements`, - Trino's `AddExchanges`) handle both in a single rule. In - progress. -* [OFFSET pushdown to parquet] so `ORDER BY ts LIMIT K OFFSET N` - queries can skip the first `N` rows at the row-group level - instead of decoding and discarding them. In progress. -* [Multi-column and function-wrapped reorder follow-ups]. The - **stats reorder step** currently only uses the leading sort key - on a plain column (reverse handles the rest via - `EquivalenceProperties` reasoning). Lexicographic multi-key - reorder via `arrow::compute::lexsort_to_indices` is low-hanging - fruit; extending to monotonic function wrappers via leaf-column - extraction (e.g. `date_trunc('day', ts)` → use `min(ts)`) needs - a bit more `EquivalenceProperties` integration but is doable. +### Extending the stats reorder step + +Alongside removing the bottlenecks above, the +[stats reorder step itself has room to grow][stats-reorder-followup]. +Today it only uses the leading sort key on a plain column — reverse +already handles function-wrapped and multi-column cases via +`EquivalenceProperties` reasoning, but stats-based RG ordering only +fires on a plain leading column. Lexicographic multi-key reorder via +`arrow::compute::lexsort_to_indices` is low-hanging fruit; extending +to monotonic function wrappers via leaf-column extraction (e.g. +`date_trunc('day', ts)` → use `min(ts)`) needs a bit more +`EquivalenceProperties` integration but is doable. [morsel-style work scheduling]: https://github.com/apache/datafusion/pull/21351 [global file reorder in the shared queue]: https://github.com/apache/datafusion/issues/21733 [TopK threshold init from parquet statistics]: https://github.com/apache/datafusion/pull/21712 [combined statistics-driven `TopK` pipeline]: https://github.com/apache/datafusion/pull/21580 -[Unifying `EnforceDistribution` and `EnforceSorting`]: https://github.com/apache/datafusion/pull/21976 -[OFFSET pushdown to parquet]: https://github.com/apache/datafusion/pull/21828 -[Multi-column and function-wrapped reorder follow-ups]: https://github.com/apache/datafusion/issues/22198 +[stats-reorder-followup]: https://github.com/apache/datafusion/issues/22198 Concretely useful issues for new contributors: @@ -694,8 +684,6 @@ Landed PRs that make up this work: In flight / open: * Page-level reverse (arrow-rs): [apache/arrow-rs#9937](https://github.com/apache/arrow-rs/pull/9937), discussion in [apache/arrow-rs#9934](https://github.com/apache/arrow-rs/issues/9934) -* `EnsureRequirements`: [apache/datafusion#21976](https://github.com/apache/datafusion/pull/21976) -* OFFSET pushdown to parquet: [apache/datafusion#21828](https://github.com/apache/datafusion/pull/21828) * TopK threshold init from parquet statistics: [apache/datafusion#21712](https://github.com/apache/datafusion/pull/21712) * Combined statistics-driven `TopK` pipeline: [apache/datafusion#21580](https://github.com/apache/datafusion/pull/21580) * Global file reorder in shared queue: [apache/datafusion#21733](https://github.com/apache/datafusion/issues/21733) From 970b6d6f07b586a6c0d44ed968699cef1362961a Mon Sep 17 00:00:00 2001 From: Qi Zhu <821684824@qq.com> Date: Sun, 17 May 2026 16:02:07 +0800 Subject: [PATCH 14/14] Roadmap RG-level early termination: split into non-overlap vs overlap (k-way merge) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 'switch work queue from PartitionedFile to RG descriptors' fix is sufficient for non-overlapping ranges (post-reorder), where the first file globally has the smallest values and subsequent files are already stats-pruned. For overlapping ranges, the next smallest value could sit in any of several open files — matching the non-overlap efficiency requires explicit k-way merge across open files' next-RG mins. The dynamic filter does this implicitly (RGs with max < threshold are dropped), but explicit comparison closes the tap earlier when the filter tightens slowly. --- content/blog/2026-05-25-sort-pushdown.md | 28 +++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md index 6dbd4821..070698d4 100644 --- a/content/blog/2026-05-25-sort-pushdown.md +++ b/content/blog/2026-05-25-sort-pushdown.md @@ -581,11 +581,29 @@ granularity. For a selective filter, the saving compounds. ### Row-group-level dynamic early termination — addresses bottleneck 3 The work queue today holds `PartitionedFile`. Switching it to -hold **row-group descriptors** instead lets a partition release -its current file's remaining row groups back to the pool the -moment a global signal says enough `TopK` winners have been -found. A natural extension of the existing morsel-style work -scheduling but not yet on a PR. +hold **row-group descriptors** lets a partition stop mid-file the +moment a global signal says `TopK` has K confirmed winners. Two +flavors depending on whether file ranges actually overlap after +stats reorder: + +* **Non-overlapping ranges.** The first file globally contains + the smallest values, the second contains the next batch, and so + on. Once `TopK`'s threshold passes file 0's max, every + subsequent file is pruned by stats already — the only fix + needed is the RG-granular queue so the partition currently + inside file 0 also stops at the right RG. +* **Overlapping ranges.** The smallest *next* value could sit in + any of several open files. Matching the non-overlap efficiency + requires actively comparing each open file's next-RG `min` and + pulling from whichever is smallest — a **k-way merge across + files** at RG granularity. The dynamic-filter pushdown already + approximates this implicitly (an RG whose `max < threshold` is + dropped), but explicit k-way comparison would close the tap + earlier when the filter tightens slowly across overlapping + files. + +A natural extension of the existing morsel-style work scheduling +but not yet on a PR. The two roadmap items above are *complementary*, not alternatives: