diff --git a/content/blog/2026-05-25-sort-pushdown.md b/content/blog/2026-05-25-sort-pushdown.md
new file mode 100644
index 00000000..070698d4
--- /dev/null
+++ b/content/blog/2026-05-25-sort-pushdown.md
@@ -0,0 +1,711 @@
+---
+layout: post
+title: Sort Pushdown in DataFusion: Skip Sorts, Skip I/O
+date: 2026-05-25
+author: Qi Zhu
+categories: [performance]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+*Qi Zhu, [Massive](https://www.massive.com/)*
+
+Many [Apache Parquet] datasets are already sorted on disk. Time-series
+files are usually written in ingestion-time order. Event logs are sharded
+and sorted by event id. Partitioned tables come with a natural ordering
+implied by the partition key. The information about that ordering is
+sitting right there in the file metadata.
+
+[Apache Parquet]: https://parquet.apache.org/
+
+Until recently, [Apache DataFusion] would still re-sort those files on
+every `ORDER BY` query. Every `SELECT ... ORDER BY ts LIMIT 100` did a
+full external sort across the entire scan, even though the data was
+already in that order. CPU wasted. Memory wasted. Streaming defeated.
+
+[Apache DataFusion]: https://datafusion.apache.org/
+
+This post walks through the **sort pushdown** work that closed
+that gap. It covers two complementary capabilities — **sort
+elimination via statistics** (the `Exact` path, which deletes the
+`SortExec`) and **runtime reorder** (the `Inexact` path, which
+keeps the `SortExec` but reads the most-promising data first for
+`TopK` and `DESC` queries) — and lands real benchmark speedups of
+**2.1×–49× on common queries**. The page-level reverse primitive
+we are adding upstream in [arrow-rs] will push the `DESC` gains
+further still.
+
+[arrow-rs]: https://github.com/apache/arrow-rs
+
+## TL;DR
+
+* DataFusion can now **skip `SortExec` entirely** when input files are
+  already in the requested order, and **read the most-promising data
+  first** when they aren't — so `TopK` converges fast and the rest
+  gets pruned by statistics.
+* What's supported today:
+  * **The `PushdownSort` rule** — a physical optimizer rule that
+    asks each `ExecutionPlan` "can you produce output in *this*
+    ordering?" and uses the `Exact` / `Inexact` / `Unsupported`
+    answer to decide whether to delete the surrounding `SortExec`,
+    leave it in place with a hint, or give up.
+  * **Sort elimination via statistics** — `PushdownSort` sorts
+    files within each partition by Parquet `min/max` statistics
+    and, when the resulting ranges are provably non-overlapping,
+    upgrades the source's ordering claim from `Unsupported` to
+    `Exact` and **removes the `SortExec`** that `EnforceSorting`
+    inserted earlier.
+  * **Runtime reorder for `TopK` and `DESC` queries** — when the
+    leading sort key is a plain column (or the reversed source
+    ordering satisfies the request), the scan reorders files and
+    row groups by `min/max` stats so the most-promising data is
+    read first; for `DESC` requests it additionally flips
+    iteration. `SortExec` stays `Inexact`, but `TopK`'s dynamic
+    filter tightens fast and the rest is pruned. Full `SortExec`
+    removal on `DESC` requires a page-level reverse primitive
+    that's in flight in arrow-rs.
+* Real-world benchmarks on the `sort_pushdown` suite (`Exact` path):
+  `ORDER BY ... LIMIT` queries get **27× and 49× faster**; full
+  `ORDER BY` scans get **~2×** faster.
+
+## Why Sort Pushdown Matters
+
+`SortExec` is one of the most expensive operators in a query plan.
+It is blocking by construction — no row can leave until every input
+row has been seen and compared — so it tends to dominate both latency
+and peak memory. The cost gets paid even when:
+
+* the file is already ordered by the sort key (very common for
+  timestamp columns);
+* the query only needs the top *N* rows (`ORDER BY ts LIMIT 100`), in
+  which case full sort + truncate is wildly wasteful;
+* the next operator (`SortPreservingMergeExec`, `SortMergeJoinExec`,
+  a window function) was going to consume ordered input anyway.
+
+The data DataFusion needs to avoid this work is **already in the file
+metadata**. Parquet writers can record per-column statistics (`min`,
+`max`) at the row-group level. Files written by Spark, DuckDB,
+arrow-rs, and others routinely include them. And explicit `WITH ORDER`
+clauses in DataFusion's SQL `CREATE EXTERNAL TABLE` give the optimizer
+a direct ordering hint. The job of sort pushdown is to **use that
+information**.
+
+## How DataFusion Tracks Ordering
+
+<img src="/blog/images/sort-pushdown/plan-diff.svg" alt="EXPLAIN before / after: SortExec eliminated once ordering is Exact" width="100%" class="img-fluid"/>
+
+Each `FileScanConfig` carries an `output_ordering` — the ordering
+that the optimizer is willing to claim for the scan's output. There
+are two flavours:
+
+* **`Exact`** — the optimizer is *certain* the output is in this order.
+  Sort-handling rules treat an `Exact` ordering as a proof and **remove
+  the surrounding `SortExec`**. ([`EnforceSorting`] does this when the
+  scan declares `Exact` from the start; the sort pushdown rule covered
+  in this post does the same upgrade later in the pipeline.)
+* **`Inexact`** — the optimizer *believes* the output is probably
+  ordered, but cannot prove it. Downstream operators like
+  `SortPreservingMergeExec` can still benefit from this hint, but the
+  explicit `SortExec` stays for safety.
+
+[`EnforceSorting`]: https://docs.rs/datafusion-physical-optimizer/latest/datafusion_physical_optimizer/enforce_sorting/struct.EnforceSorting.html
+
+A helper called `validated_output_ordering()` is the gatekeeper. It
+walks the list of files inside a partition, checks whether the
+declared per-file ordering is consistent with the file order on disk,
+and either confirms the ordering or **strips it entirely** if it
+sees something ambiguous (e.g. file `b` comes before file `a` in the
+file list but file `a`'s range comes first).
+
+### `Exact` and `Inexact` at runtime
+
+`Exact` and `Inexact` lead to different runtime behaviour, and
+distinguishing them up front makes the rest of this post easier to
+follow:
+
+* With **`Exact`**, the `SortExec` is removed and the LIMIT becomes
+  a **static fetch** on the source. The reader stops the moment the
+  requested number of rows has been emitted — early termination
+  at batch granularity, no dynamic state needed.
+* With **`Inexact`**, the `SortExec` stays in place. The LIMIT
+  materialises inside the sort as a `TopK` heap of size K. `TopK`
+  exposes a [**dynamic filter**][dyn-filters-blog] — a runtime
+  expression of the form *"only rows that could still beat the
+  current K-th-best value are worth considering"* — and pushes it
+  back to the parquet scanner. As more data is processed and the
+  heap tightens, the filter's threshold tightens with it, and entire
+  row groups can be skipped by checking the live threshold against
+  the row group's min/max statistics. (See the earlier
+  [dynamic filters][dyn-filters-blog] post for the full background
+  on this mechanism.)
+
+Both paths use the same underlying min/max statistics, but for
+different purposes: `Exact` uses them at plan time to prove
+non-overlap and justify removing the sort; `Inexact` uses them at
+runtime to skip row groups that can no longer improve the heap.
+
+[dyn-filters-blog]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+
+The diagram above shows the result we want: the plan after sort
+pushdown loses the `SortExec` node. Everything downstream — the
+`SortPreservingMergeExec`, the `RepartitionExec`, the
+`DataSourceExec` — was already in the plan. We just need the
+optimizer to convince itself that the bottom of the plan is
+producing the order requested.
+
+## The `PushdownSort` Rule
+
+The **`PushdownSort`** physical optimizer rule asks each
+`ExecutionPlan` two questions:
+
+1. "Can you produce output in *this* ordering?"
+2. "If yes, please rearrange yourself so that it actually does."
+
+The answer is one of `Exact`, `Inexact`, `Unsupported`. `Exact`
+means the surrounding `SortExec` can be deleted entirely; `Inexact`
+means the source will read the data in a near-sorted order so
+`TopK` and other consumers benefit, but `SortExec` stays for
+strict correctness. The rest of this post is what each merged
+capability does on top of this protocol — first the `Exact` path,
+then the `Inexact` path.
+
+## Sort Elimination via Statistics
+
+<img src="/blog/images/sort-pushdown/phase1-file-reorder.svg" alt="Sort elimination: rearranging files within a partition by min/max statistics so the file list is in range order" width="100%" class="img-fluid"/>
+
+The initial `Inexact`-only path left a sharp edge that motivated
+stats-based sort elimination. Consider this realistic scenario:
+
+* Three files: `a.parquet`, `b.parquet`, `c.parquet`.
+* Each declares `WITH ORDER (ts ASC)`.
+* Internally each file *is* sorted by `ts`.
+* But they were written by different ingestion jobs and end up listed
+  in the **wrong order** on disk (e.g. alphabetical by name, not by
+  time).
+
+`validated_output_ordering()` looks at this, sees that the
+file-internal ordering disagrees with the file-list order, and
+**strips the ordering entirely**. From the optimizer's point of view
+the scan now has no declared ordering, so `EnforceSorting` (which runs
+earlier in the pipeline) inserts a `SortExec`. The data is sorted on
+disk; the optimizer just can't tell.
+
+Stats-based sort elimination fixes this in `PushdownSort`, which
+runs late — after `EnforceDistribution` and `EnforceSorting` have
+already shaped the plan. When `PushdownSort` finds a `SortExec`
+above a file scan whose ordering was stripped (a `FileSource`
+`Unsupported` result), it does three things inside
+`FileScanConfig::try_pushdown_sort`:
+
+1. **Sort the file list by per-file statistics on the sort
+   column(s)** within each file group (the diagram above). The
+   pre-existing [`MinMaxStatistics`] helper reads each file's
+   `column_statistics[c].min_value` / `.max_value` for each sort
+   column `c`, then sorts the file list by the min row.
+   `sort_files_within_groups_by_statistics` does the per-group
+   orchestration and decides whether any group is non-overlapping
+   after the sort.
+2. **Check adjacency within each group**: walk each sorted file group
+   independently and ask whether `file[i].max ≤ file[i+1].min` for
+   every adjacent pair (touching at the boundary is fine — value `v`
+   showing up as the last row of one file and the first row of the
+   next still produces a sorted stream). The check is **per file
+   group**, not across groups; cross-group ordering is the job of
+   `SortPreservingMergeExec` at runtime (more on this below).
+3. **Upgrade `Unsupported` to `Exact`** when adjacency holds, the
+   table has a declared `output_ordering` (from `WITH ORDER` or
+   parquet `sorting_columns`), and the sort columns are null-free —
+   the last condition preserves `NULLS LAST`/`NULLS FIRST` semantics
+   across file boundaries. `PushdownSort` then removes the `SortExec`
+   itself and the plan becomes streamable.
+
+[`MinMaxStatistics`]: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/statistics.rs
+
+One caveat that comes straight from `MinMaxStatistics`: the stats
+sort only fires when every `ORDER BY` expression is a plain column.
+`ORDER BY date_trunc('hour', ts)` silently skips the upgrade — there
+is no per-file min/max for the function output to compare against.
+Extending sort pushdown across monotonic function wrappers is one of
+the open follow-ups.
+
+(Runtime reorder covered later does handle some function-wrapped
+sorts via monotonicity inference — but stats-based sort elimination
+still needs a plain column.)
+
+<img src="/blog/images/sort-pushdown/phase2-stats-overlap.svg" alt="Detecting non-overlapping ranges via min/max statistics" width="100%" class="img-fluid"/>
+
+The diagram above contrasts the two cases. On the left, ranges are
+non-overlapping after sort, so we can guarantee that emitting the
+files in min-order produces a globally sorted stream. On the right,
+the ranges overlap, so even after sorting the files by `min(ts)` we
+cannot guarantee global ordering — the upgrade is skipped and
+`SortExec` stays in place.
+
+The implementation handles a few edge cases worth calling out:
+
+* **Buffering the eliminated `SortExec`.** When the `SortExec` was
+  sitting under a `SortPreservingMergeExec` with
+  `preserve_partitioning=true`, it wasn't just sorting — it was also
+  acting as an *implicit in-memory buffer* for the SPM above it. The
+  SPM picks rows from each partition stream one at a time; without
+  the upstream `SortExec` holding batches in memory, the SPM would
+  read directly from I/O-bound sources and stall on every pick. The
+  rule compensates by inserting a [`BufferExec`] in the `SortExec`'s
+  place — bounded streaming buffer, same throughput shape, no
+  blocking sort. Capacity is configurable via
+  [`sort_pushdown_buffer_capacity`].
+* **`fetch` preservation** through `EnforceDistribution`. The
+  distribution rule sometimes strips a `SortExec`'s `fetch` field and
+  re-adds the node later. The PR plumbs `fetch` through so a
+  surviving `LIMIT` is not lost.
+* **Per-group, not global, non-overlap.** The adjacency check is
+  scoped to each file group. Two file groups can have *overlapping*
+  ranges and the upgrade still fires, as long as each group is
+  internally non-overlapping. That works because each group already
+  produces an independently ordered stream at runtime, and
+  `SortPreservingMergeExec` then picks rows across streams in value
+  order to produce the final globally sorted output. The rule only
+  has to prove the per-stream property.
+* **Single-partition vs multi-partition execution.** The default
+  multi-partition setup byte-range-splits files into single-file
+  groups, after which `validated_output_ordering()` works on its
+  own. Stats-based reorder only fires when files aren't split —
+  typically `--partitions 1` or files small enough that the
+  splitter leaves them alone.
+
+[`BufferExec`]: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/buffer.rs
+[`sort_pushdown_buffer_capacity`]: https://github.com/apache/datafusion/pull/21426
+
+## Benchmarks
+
+<img src="/blog/images/sort-pushdown/benchmark.svg" alt="Sort pushdown benchmark: 2x-49x speedup across four queries" width="100%" class="img-fluid"/>
+
+The [`sort_pushdown`] benchmark suite reproduces the
+"wrong-order file list" scenario by generating Parquet files whose
+names are intentionally reversed against their sort-key ranges. Numbers
+below are `--partitions 1`, release build, with stats-based sort
+elimination enabled, versus `main`:
+
+[`sort_pushdown`]: https://github.com/apache/datafusion/tree/main/benchmarks/queries/sort_pushdown
+
+| Query                                       | Before  | After   | Speedup |
+| ------------------------------------------- | -------:| -------:| -------:|
+| Q1 — `ORDER BY key` (full scan)             | 259 ms  | 122 ms  | **2.1×** |
+| Q2 — `ORDER BY key LIMIT 100`               |  80 ms  |   3 ms  | **27×**  |
+| Q3 — `SELECT * ORDER BY key`                | 700 ms  | 313 ms  | **2.2×** |
+| Q4 — `SELECT * ORDER BY key LIMIT 100`      | 342 ms  |   7 ms  | **49×**  |
+
+The shape of the speedup is what you would expect once `SortExec` is
+removed:
+
+* **Full-scan queries (Q1, Q3)** still have to push every row through
+  the pipeline, so the gain is "just" the cost of the sort itself —
+  roughly half the original time. This matches the rule of thumb that
+  a blocking sort doubles end-to-end latency on data that fits in
+  memory.
+* **`LIMIT` queries (Q2, Q4)** benefit much more because removing
+  `SortExec` converts the LIMIT into a static `fetch` on the data
+  source — the reader stops the moment K rows have been emitted,
+  instead of reading the full file, sorting, and truncating.
+  This is the "early termination at batch granularity" case from
+  the runtime-difference section above. A 342 ms full-file scan
+  collapses into a 7 ms K-row read.
+
+The default multi-partition execution path is unaffected: those
+plans already produce correct orderings via byte-range splitting,
+so stats-based sort elimination simply does not fire there. No
+regression and no behavior change for typical multi-threaded
+queries.
+
+## Runtime Reorder for `TopK` and `DESC` Queries
+
+Stats-based sort elimination handles the `Exact` upgrade — strong
+correctness, sort elimination — but only when the table has a
+declared `output_ordering` *and* the files are provably
+non-overlapping after sorting by min. Three classes of queries
+fall outside that window:
+
+* **Unsorted data** — no `WITH ORDER`, no parquet `sorting_columns`.
+  The `Exact` upgrade cannot fire because there is no ordering
+  claim to upgrade.
+* **Overlapping ranges** — files written by different ingestion
+  jobs share time windows. The `Exact` upgrade keeps the `SortExec`
+  because the global ordering can't be proven, even though the
+  files often do contain large stretches of in-order data.
+* **`ORDER BY ... DESC` on ASC-sorted data** — flipping iteration
+  at the row-group level emits "RGs descending × rows ascending",
+  close to the requested order but not strictly DESC, so the
+  `SortExec` has to stay for correctness.
+
+For all three, a full external `SortExec` is overkill. The parquet
+metadata is right there, and reading the *most-promising* data
+first lets `TopK`'s dynamic filter threshold tighten quickly so the
+rest gets pruned. Runtime reorder wires that up by generalising
+the `Inexact` path the rule introduced.
+
+### Two independent triggers for `Inexact`
+
+<img src="/blog/images/sort-pushdown/pr21956-decision.svg" alt="try_pushdown_sort decision tree: Exact, Inexact, or Unsupported" width="100%" class="img-fluid"/>
+
+`try_pushdown_sort` first checks whether the natural ordering
+already satisfies the request (→ `Exact`) or whether a non-empty
+*proper prefix* of the request is already satisfied (→
+`Unsupported`, so the outer `SortExec`'s `sort_prefix`
+optimisation can fire instead). Otherwise it looks at two
+**independent** Inexact signals — either one is enough, and they
+compose when both apply:
+
+**Stats-based RG reorder** — fires when the leading sort key is a
+plain `Column` in the file schema. The opener sorts row groups by
+`min(col)` via parquet statistics. Restrictive (plain physical
+column only), but lets the scan globally reorder data so the
+most-promising row group is decoded first.
+
+**Iteration reverse** — fires when the source's declared ordering,
+**reversed**, satisfies the request. This goes through the full
+`EquivalenceProperties` reasoning machinery and is **strictly more
+powerful** than the column-in-schema check above. It fires for:
+
+* **Function monotonicity** — file declares `ts DESC`, request is
+  `date_trunc('day', ts) ASC` → reversed `ts ASC` satisfies the
+  request via monotonicity even though parquet has no stats keyed
+  by the function. Same for `ceil(value)`, `CAST(x AS Date)`, etc.
+* **Constant columns from filters** — `WHERE region = 'us'` marks
+  `region` as constant in the equivalence class, so a request
+  involving `region` is trivially satisfied.
+* **Equivalence relationships** — `WHERE a = b` transfers a known
+  ordering on `a` to a request on `b`.
+* **Multi-column composite orderings** — the source's declared
+  multi-key ordering reversed satisfies the multi-key request as a
+  whole.
+
+### Three runtime steps in the opener
+
+<img src="/blog/images/sort-pushdown/pr21956-runtime-pipeline.svg" alt="Runtime reorder pipeline: file reorder, RG reorder, then optional reverse" width="100%" class="img-fluid"/>
+
+The two triggers above set two fields on `ParquetSource`:
+
+```rust
+struct ParquetSource {
+    sort_order_for_reorder: Option<LexOrdering>,  // what to reorder by
+    reverse_row_groups:     bool,                 // whether to flip iteration
+    // ...
+}
+```
+
+The opener consumes them in three composable steps:
+
+1. **File-level reorder** (`FileSource::reorder_files`). The shared
+   morsel queue — a work-stealing primitive that lets sibling
+   partitions share a single file pool — sorts the partitioned-file
+   list by `min(col)`. The first file picked across all partitions
+   is globally the most-promising one. Skipped when the stats
+   reorder trigger didn't fire.
+2. **Row-group-level reorder**
+   (`PreparedAccessPlan::reorder_by_statistics`). Once a file is
+   opened, sort its row groups by `min(col)` ASC so the most-promising
+   row group is decoded first. Same trigger as step 1; the two
+   layers nest because a file's `min(col)` is the minimum over its
+   row groups' `min(col)` values.
+3. **Iteration reverse** (`PreparedAccessPlan::reverse`). Flips the
+   row-group iteration order. For `DESC` requests on a plain
+   column the flip composes with steps 1–2 (ASC-by-min → reverse →
+   DESC-by-min). For the function-wrapped / constants-from-filters /
+   multi-column cases, steps 1–2 are skipped and this is the only
+   step that runs — just a flip of the file's natural order.
+
+Both flags surface on the `DataSourceExec` line in `EXPLAIN`:
+
+```text
+DataSourceExec: file_groups=..., file_type=parquet,
+  sort_order_for_reorder=[a@0 ASC], reverse_row_groups=true
+```
+
+### `ORDER BY ... DESC` in practice
+
+A `DESC` request on an ASC-sorted plain column goes through both
+triggers — the stats reorder normalises to ASC-by-min and the
+iteration reverse flips to DESC-by-min. The result is *"RGs
+descending × rows ascending"* — close to the requested order but
+not strictly DESC, hence `Inexact`. The `SortExec` stays for
+correctness, but `TopK`'s dynamic filter tightens fast because the
+first row groups read already contain values near the final
+answer, so subsequent row groups can be skipped via min/max
+statistics. This is what powers fast `ORDER BY ts DESC LIMIT N` on
+ASC-sorted files today.
+
+Why not full `Exact` reverse that deletes the `SortExec` outright?
+Decoding a whole row group forward, reversing the buffer, then
+emitting works — but peaks at ~128 MB vs. the few-MB-per-batch
+streaming profile readers expect. `Exact` reverse waits on a
+page-level primitive that keeps the runtime win on a streaming
+memory budget — covered in the roadmap below.
+
+### When neither Inexact trigger fires
+
+* **Aggregations on the sort key** — `SELECT URL, COUNT(*) AS c FROM
+  hits GROUP BY URL ORDER BY c DESC LIMIT 10` (the ClickBench TopK
+  shape). The leading sort key `c` is an aggregate result with no
+  per-RG stats and no equivalence to a file column, so neither
+  trigger fires. Pushing sort metadata through `AggregateExec` is a
+  separate problem entirely.
+* **Function-wrapped sort with no source-declared ordering** — the
+  reversed-equivalence branch has nothing to invert.
+* **Source declares a forward prefix of the request** —
+  `try_pushdown_sort` returns `Unsupported` so the surrounding
+  `SortExec` can keep its `sort_prefix` annotation; prefix-aware
+  early termination in `TopK` is strictly better than reorder on
+  data that's already in prefix order on disk.
+
+## Current Bottlenecks
+
+Sort elimination removes the `SortExec` entirely when ranges are
+non-overlapping — there's nothing more to optimize on that path.
+The `Inexact` runtime-reorder path is where the merged work still
+leaves performance on the table. Three concrete inefficiencies:
+
+### Bottleneck 1: `SortExec` stays on top, so `LIMIT N` does not propagate as a static stop signal
+
+In the `Inexact` path the `SortExec` stays in the plan and
+`TopK`'s fetch belongs to `SortExec`, not to the parquet scan.
+The only thing that can cut work below the `SortExec` is the
+dynamic-filter pushdown: as the heap fills, the filter
+(`ts > threshold`) is pushed to the source and its threshold
+tightens with every batch. That filter does **stats-prune
+subsequent, not-yet-opened row groups** — if a row group's
+`max(ts) < threshold` it is skipped without decode. But the
+`SortExec` keeps pulling batches, and the outer operator does its
+own final ordering pass on the "RGs descending × rows ascending"
+stream even after the heap is settled. We have measured this
+in-house: swapping our internal `Exact` reverse for upstream's
+`Inexact` reverse + `TopK` on `ORDER BY ts DESC LIMIT N` makes
+end-to-end latency go **up**, not down — exactly because the
+`SortExec` final pass and the per-row heap maintenance pile up on
+top.
+
+### Bottleneck 2: Inside the currently-open row group, the sort column is fully decoded
+
+Even with the dynamic filter pushed all the way to parquet, the
+filter has to be evaluated row-by-row inside the open row group:
+the sort column has to be **fully decoded** so each value can be
+compared against the threshold, the surviving rows feed the heap
+to tighten the threshold, and only then can the resulting
+`RowSelection` skip the *other* columns for rows that didn't
+pass. For `ORDER BY ts DESC LIMIT 10` on a 1M-row row group that
+is ~1M sort-column decodes regardless of `N`. Parquet doesn't
+allow partial row-group reads, so even an RG-level `Exact`
+reverse would pay this same cost — the only way to materially
+reduce it is to drop to page granularity.
+
+### Bottleneck 3: File-granular work scheduling can't close the tap mid-file
+
+Once a `FileStream` picks up a file from the shared work queue,
+it has to finish that file. Today's dynamic work scheduling is
+**file-granular**: idle partitions stop pulling new files from
+the queue once a global limit is satisfied, but the partition
+that's currently inside a file decodes that file's remaining row
+groups regardless. The work queue holds `PartitionedFile`, not
+row-group descriptors. So even with a tight threshold and
+aggressive stats pruning of un-opened row groups, the *currently
+open* file gets read to completion.
+
+## Roadmap: Removing the Bottlenecks
+
+### Page-level `Exact` reverse — addresses bottlenecks 1 + 2
+
+<img src="/blog/images/sort-pushdown/reverse-scan.svg" alt="Row-group reverse (128 MB peak, ~8 pages decoded) vs page reverse (1 MB peak, 1 page decoded)" width="100%" class="img-fluid"/>
+
+Parquet's `OffsetIndex` gives us byte-precise locations for every
+data page in a column chunk, so we can `seek` directly to the last
+page, decode it forward, reverse the resulting batch, and emit.
+Peak buffer drops from ~128 MB (one row group) to ~1 MB (one
+page), and first-batch latency drops to the cost of one page
+decode — the row-group-level memory cliff disappears. With each
+batch already in DESC order, `PushdownSort` can finally return
+`Exact` for `DESC` requests, the `SortExec` is removed, and
+`LIMIT N` becomes a static fetch on the source. The
+`Inexact`-final-ordering-pass overhead from Bottleneck 1 goes
+away outright, and the Bottleneck-2 decode reduces to the rows
+the page-level seek actually pulls in.
+
+Why not reverse the rows *within* a page directly? Because we
+can't. Parquet's page encodings (RLE, dictionary, delta,
+bit-packing) are all forward streams — you cannot decode the last
+value without decoding every value that came before it. The
+design is: **reverse the page traversal, forward-decode each
+page, reverse the resulting `RecordBatch`**.
+
+The primitive is landing upstream in arrow-rs. Early numbers on a
+100k-row, 98-page column chunk show **~50× faster
+time-to-first-N** for `n ≤ 1 page` and **~9× faster** for `n`
+spanning 10 pages, compared with the row-group-level `Exact`
+reverse. The DataFusion-side integration that turns this primitive
+into an `Exact` result is a follow-up gated on the arrow-rs merge.
+
+The killer use case is **filtered reverse `TopK`**:
+
+```sql
+SELECT * FROM events
+WHERE user_id = 42
+ORDER BY ts DESC
+LIMIT 10
+```
+
+`RowSelection::with_limit` cannot help here — you don't know in
+advance which rows match `user_id = 42`, so you can't pre-compute
+a selection of the "last 10 matching rows". The only correct
+strategy is to stream pages backward, evaluate the filter on
+each, and stop when 10 matches are collected. Row-group reverse
+stops at a ~128 MB granularity. Page reverse stops at ~1 MB
+granularity. For a selective filter, the saving compounds.
+
+### Row-group-level dynamic early termination — addresses bottleneck 3
+
+The work queue today holds `PartitionedFile`. Switching it to
+hold **row-group descriptors** lets a partition stop mid-file the
+moment a global signal says `TopK` has K confirmed winners. Two
+flavors depending on whether file ranges actually overlap after
+stats reorder:
+
+* **Non-overlapping ranges.** The first file globally contains
+  the smallest values, the second contains the next batch, and so
+  on. Once `TopK`'s threshold passes file 0's max, every
+  subsequent file is pruned by stats already — the only fix
+  needed is the RG-granular queue so the partition currently
+  inside file 0 also stops at the right RG.
+* **Overlapping ranges.** The smallest *next* value could sit in
+  any of several open files. Matching the non-overlap efficiency
+  requires actively comparing each open file's next-RG `min` and
+  pulling from whichever is smallest — a **k-way merge across
+  files** at RG granularity. The dynamic-filter pushdown already
+  approximates this implicitly (an RG whose `max < threshold` is
+  dropped), but explicit k-way comparison would close the tap
+  earlier when the filter tightens slowly across overlapping
+  files.
+
+A natural extension of the existing morsel-style work scheduling
+but not yet on a PR.
+
+The two roadmap items above are *complementary*, not
+alternatives:
+
+* `Exact` reverse closes the tap for `DESC` queries by removing
+  the `SortExec` entirely.
+* Row-group-level scheduling closes the tap for `Inexact` queries
+  where `Exact` still cannot fire (function-wrapped sorts,
+  overlapping ranges) — the `SortExec` stays, but the scan stops
+  pulling row groups once `TopK` is satisfied.
+
+### Preview: the combined statistics-driven `TopK` pipeline
+
+The [combined statistics-driven `TopK` pipeline] is the in-flight
+work that stacks several of these mechanisms: pre-scan
+[TopK threshold init from parquet statistics],
+[global file reorder in the shared queue], and the runtime
+row-group / file reorder + reverse already merged. On a
+microbenchmark (single file, 61 sorted row groups, `--partitions 1`)
+**60 of the 61 row groups are skipped**, only one is decoded:
+
+| Query                          | Baseline | With pipeline | Speedup |
+| ------------------------------ | -------: | ------------: | ------: |
+| `ORDER BY col DESC LIMIT 100`  | 28.5 ms  | 1.64 ms       | **17×** |
+| `ORDER BY col DESC LIMIT 1000` | 22.2 ms  | 0.37 ms       | **60×** |
+| `SELECT * ORDER BY ... LIMIT 100`  | 22.5 ms  | 0.66 ms       | **34×** |
+| `SELECT * ORDER BY ... LIMIT 1000` | 22.4 ms  | 0.61 ms       | **37×** |
+
+This pipeline still reports `Inexact` — the `SortExec` stays on
+top to enforce correctness across overlapping ranges — so it pays
+the Bottleneck-1 and Bottleneck-3 overheads listed above. The
+17×–60× is what statistics-driven RG-level pruning alone can
+deliver; `Exact` reverse + row-group-level early termination is
+what pushes it further.
+
+### Extending the stats reorder step
+
+Alongside removing the bottlenecks above, the
+[stats reorder step itself has room to grow][stats-reorder-followup].
+Today it only uses the leading sort key on a plain column — reverse
+already handles function-wrapped and multi-column cases via
+`EquivalenceProperties` reasoning, but stats-based RG ordering only
+fires on a plain leading column. Lexicographic multi-key reorder via
+`arrow::compute::lexsort_to_indices` is low-hanging fruit; extending
+to monotonic function wrappers via leaf-column extraction (e.g.
+`date_trunc('day', ts)` → use `min(ts)`) needs a bit more
+`EquivalenceProperties` integration but is doable.
+
+[morsel-style work scheduling]: https://github.com/apache/datafusion/pull/21351
+[global file reorder in the shared queue]: https://github.com/apache/datafusion/issues/21733
+[TopK threshold init from parquet statistics]: https://github.com/apache/datafusion/pull/21712
+[combined statistics-driven `TopK` pipeline]: https://github.com/apache/datafusion/pull/21580
+[stats-reorder-followup]: https://github.com/apache/datafusion/issues/22198
+
+Concretely useful issues for new contributors:
+
+* [Umbrella issue for sort pushdown][umbrella-issue].
+* [Reorder row groups by statistics within each file][rg-reorder-issue].
+* [Add more `ExecutionPlan` impls to support sort pushdown][more-impls-issue].
+
+[umbrella-issue]: https://github.com/apache/datafusion/issues/17348
+[rg-reorder-issue]: https://github.com/apache/datafusion/issues/21317
+[more-impls-issue]: https://github.com/apache/datafusion/issues/19394
+
+## Acknowledgements
+
+Thank you to [@alamb], [@adriangb], [@xudong963], [@2010YOUY01], and
+[@Dandandan] for reviewing the design and the patches across many
+iterations. The DataFusion community's willingness to engage deeply
+with optimizer changes — including the ones that touch foundational
+invariants — is what made this work possible.
+
+[@alamb]: https://github.com/alamb
+[@adriangb]: https://github.com/adriangb
+[@xudong963]: https://github.com/xudong963
+[@2010YOUY01]: https://github.com/2010YOUY01
+[@Dandandan]: https://github.com/Dandandan
+
+## References
+
+Prior post this work builds on:
+
+* [Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries][dyn-filters-blog] — the dynamic filter primitive `TopK` uses.
+
+Landed PRs that make up this work:
+
+* `MinMaxStatistics` foundation: [apache/datafusion#9593](https://github.com/apache/datafusion/pull/9593)
+* `PushdownSort` rule + row-group reverse: [apache/datafusion#19064](https://github.com/apache/datafusion/pull/19064)
+* Reverse-output redesign: [apache/datafusion#19446](https://github.com/apache/datafusion/pull/19446), [apache/datafusion#19557](https://github.com/apache/datafusion/pull/19557)
+* Sort elimination via statistics: [apache/datafusion#21182](https://github.com/apache/datafusion/pull/21182)
+* `BufferExec` capacity for sort elimination: [apache/datafusion#21426](https://github.com/apache/datafusion/pull/21426)
+* Morsel-style work scheduling: [apache/datafusion#21351](https://github.com/apache/datafusion/pull/21351)
+* Runtime reorder for `TopK` convergence: [apache/datafusion#21956](https://github.com/apache/datafusion/pull/21956)
+* Row-group-level `Inexact` reverse: [apache/datafusion#18817](https://github.com/apache/datafusion/pull/18817)
+
+In flight / open:
+
+* Page-level reverse (arrow-rs): [apache/arrow-rs#9937](https://github.com/apache/arrow-rs/pull/9937), discussion in [apache/arrow-rs#9934](https://github.com/apache/arrow-rs/issues/9934)
+* TopK threshold init from parquet statistics: [apache/datafusion#21712](https://github.com/apache/datafusion/pull/21712)
+* Combined statistics-driven `TopK` pipeline: [apache/datafusion#21580](https://github.com/apache/datafusion/pull/21580)
+* Global file reorder in shared queue: [apache/datafusion#21733](https://github.com/apache/datafusion/issues/21733)
+* Multi-column / function-wrapped reorder follow-ups: [apache/datafusion#22198](https://github.com/apache/datafusion/issues/22198)
+* Umbrella issue for sort pushdown: [apache/datafusion#17348](https://github.com/apache/datafusion/issues/17348)
+
+Benchmark suite: [`sort_pushdown`]
diff --git a/content/images/sort-pushdown/benchmark.svg b/content/images/sort-pushdown/benchmark.svg
new file mode 100644
index 00000000..30afb7b2
--- /dev/null
+++ b/content/images/sort-pushdown/benchmark.svg
@@ -0,0 +1,75 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 820 380">
+  <style>
+    text { font-family: 'Segoe UI', Arial, sans-serif; }
+    .title { font-size: 14px; font-weight: 700; fill: #222; }
+    .axis-label { font-size: 11px; fill: #555; }
+    .qlabel { font-size: 12px; font-weight: 600; fill: #222; }
+    .qdesc { font-size: 10px; fill: #777; font-family: 'Courier New', monospace; }
+    .bartext-light { font-size: 10px; fill: #333; }
+    .bartext-on { font-size: 10px; font-weight: 700; fill: #fff; }
+    .speedup { font-size: 13px; font-weight: 700; fill: #27ae60; }
+    .legend { font-size: 11px; fill: #444; }
+  </style>
+
+  <text class="title" x="410" y="22" text-anchor="middle">sort_pushdown benchmark (single partition, release, reversed-name data)</text>
+
+  <!-- y-axis line + ticks (log-ish scale via manual placement) -->
+  <line x1="160" y1="60" x2="160" y2="320" stroke="#888" stroke-width="1"/>
+  <text class="axis-label" x="155" y="65" text-anchor="end">700ms</text>
+  <line x1="155" y1="62" x2="160" y2="62" stroke="#888"/>
+  <text class="axis-label" x="155" y="135" text-anchor="end">259ms</text>
+  <line x1="155" y1="132" x2="160" y2="132" stroke="#888"/>
+  <text class="axis-label" x="155" y="200" text-anchor="end">80ms</text>
+  <line x1="155" y1="197" x2="160" y2="197" stroke="#888"/>
+  <text class="axis-label" x="155" y="260" text-anchor="end">7ms</text>
+  <line x1="155" y1="257" x2="160" y2="257" stroke="#888"/>
+  <text class="axis-label" x="155" y="320" text-anchor="end">0</text>
+
+  <!-- bars per query: HEAD (gray) and PR (green), side-by-side -->
+  <!-- Q1: 259 -> 122, full ORDER BY -->
+  <text class="qlabel" x="225" y="345" text-anchor="middle">Q1</text>
+  <text class="qdesc" x="225" y="358" text-anchor="middle">ORDER BY full</text>
+  <rect x="180" y="135" width="40" height="185" fill="#bdc3c7"/>
+  <text class="bartext-on" x="200" y="155" text-anchor="middle">259</text>
+  <rect x="230" y="222" width="40" height="98" fill="#27ae60"/>
+  <text class="bartext-on" x="250" y="242" text-anchor="middle">122</text>
+  <text class="speedup" x="225" y="120" text-anchor="middle">2.1×</text>
+
+  <!-- Q2: 80 -> 3, ORDER BY LIMIT -->
+  <text class="qlabel" x="345" y="345" text-anchor="middle">Q2</text>
+  <text class="qdesc" x="345" y="358" text-anchor="middle">ORDER BY LIMIT</text>
+  <rect x="300" y="199" width="40" height="121" fill="#bdc3c7"/>
+  <text class="bartext-on" x="320" y="219" text-anchor="middle">80</text>
+  <rect x="350" y="313" width="40" height="7" fill="#27ae60"/>
+  <text class="bartext-light" x="370" y="313" text-anchor="start">3</text>
+  <text class="speedup" x="345" y="186" text-anchor="middle">27×</text>
+
+  <!-- Q3: 700 -> 313, SELECT * ORDER BY -->
+  <text class="qlabel" x="465" y="345" text-anchor="middle">Q3</text>
+  <text class="qdesc" x="465" y="358" text-anchor="middle">SELECT * ORDER BY</text>
+  <rect x="420" y="62" width="40" height="258" fill="#bdc3c7"/>
+  <text class="bartext-on" x="440" y="82" text-anchor="middle">700</text>
+  <rect x="470" y="174" width="40" height="146" fill="#27ae60"/>
+  <text class="bartext-on" x="490" y="194" text-anchor="middle">313</text>
+  <text class="speedup" x="465" y="50" text-anchor="middle">2.2×</text>
+
+  <!-- Q4: 342 -> 7, SELECT * LIMIT -->
+  <text class="qlabel" x="585" y="345" text-anchor="middle">Q4</text>
+  <text class="qdesc" x="585" y="358" text-anchor="middle">SELECT * ORDER BY LIMIT</text>
+  <rect x="540" y="98" width="40" height="222" fill="#bdc3c7"/>
+  <text class="bartext-on" x="560" y="118" text-anchor="middle">342</text>
+  <rect x="590" y="313" width="40" height="7" fill="#27ae60"/>
+  <text class="bartext-light" x="610" y="313" text-anchor="start">7</text>
+  <text class="speedup" x="585" y="85" text-anchor="middle">49×</text>
+
+  <!-- y-axis label -->
+  <text class="axis-label" x="78" y="190" text-anchor="middle" transform="rotate(-90 78 190)">latency (ms)</text>
+
+  <!-- legend -->
+  <rect x="660" y="80" width="14" height="14" fill="#bdc3c7"/>
+  <text class="legend" x="680" y="92">main (before)</text>
+  <rect x="660" y="105" width="14" height="14" fill="#27ae60"/>
+  <text class="legend" x="680" y="117">sort pushdown phase 2</text>
+  <text class="legend" x="660" y="145" fill="#777">Lower is better</text>
+  <text class="legend" x="660" y="160" fill="#777">--partitions 1, release</text>
+</svg>
diff --git a/content/images/sort-pushdown/phase1-file-reorder.svg b/content/images/sort-pushdown/phase1-file-reorder.svg
new file mode 100644
index 00000000..9ae798ba
--- /dev/null
+++ b/content/images/sort-pushdown/phase1-file-reorder.svg
@@ -0,0 +1,88 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 820 320">
+  <style>
+    text { font-family: 'Segoe UI', Arial, sans-serif; }
+    .title { font-size: 14px; font-weight: 700; fill: #222; }
+    .sub { font-size: 12px; fill: #555; }
+    .filename { font-size: 11px; font-weight: 600; fill: #fff; }
+    .range { font-size: 11px; fill: #333; font-family: 'Courier New', monospace; }
+    .label { font-size: 11px; fill: #555; }
+    .arrow { fill: none; stroke: #555; stroke-width: 1.6; marker-end: url(#arrow); }
+    .verdict-good { font-size: 12px; font-weight: 600; fill: #27ae60; }
+    .verdict-bad { font-size: 12px; font-weight: 600; fill: #c0392b; }
+  </style>
+  <defs>
+    <marker id="arrow" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">
+      <path d="M0,0 L10,5 L0,10 Z" fill="#555"/>
+    </marker>
+  </defs>
+
+  <!-- Title -->
+  <text class="title" x="410" y="22" text-anchor="middle">Phase 1: file rearrangement by declared ordering</text>
+
+  <!-- BEFORE column -->
+  <text class="sub" x="20" y="55">Before — directory order:</text>
+
+  <!-- file boxes (before) -->
+  <rect x="20" y="65" width="160" height="36" rx="4" fill="#5b8def"/>
+  <text class="filename" x="100" y="80" text-anchor="middle">a.parquet</text>
+  <text class="range" x="100" y="95" text-anchor="middle" fill="#fff">ts ∈ [200, 300]</text>
+
+  <rect x="20" y="111" width="160" height="36" rx="4" fill="#5b8def"/>
+  <text class="filename" x="100" y="126" text-anchor="middle">b.parquet</text>
+  <text class="range" x="100" y="141" text-anchor="middle" fill="#fff">ts ∈ [100, 200]</text>
+
+  <rect x="20" y="157" width="160" height="36" rx="4" fill="#5b8def"/>
+  <text class="filename" x="100" y="172" text-anchor="middle">c.parquet</text>
+  <text class="range" x="100" y="187" text-anchor="middle" fill="#fff">ts ∈ [0, 100]</text>
+
+  <text class="verdict-bad" x="100" y="220" text-anchor="middle">validated_output_ordering() = None</text>
+  <text class="label" x="100" y="238" text-anchor="middle">→ SortExec required</text>
+
+  <!-- arrow -->
+  <line class="arrow" x1="210" y1="130" x2="320" y2="130"/>
+  <text class="label" x="265" y="120" text-anchor="middle">PushdownSort</text>
+  <text class="label" x="265" y="146" text-anchor="middle">sort by min(ts)</text>
+
+  <!-- AFTER column -->
+  <text class="sub" x="350" y="55">After — sorted by stats:</text>
+
+  <rect x="350" y="65" width="160" height="36" rx="4" fill="#27ae60"/>
+  <text class="filename" x="430" y="80" text-anchor="middle">c.parquet</text>
+  <text class="range" x="430" y="95" text-anchor="middle" fill="#fff">ts ∈ [0, 100]</text>
+
+  <rect x="350" y="111" width="160" height="36" rx="4" fill="#27ae60"/>
+  <text class="filename" x="430" y="126" text-anchor="middle">b.parquet</text>
+  <text class="range" x="430" y="141" text-anchor="middle" fill="#fff">ts ∈ [100, 200]</text>
+
+  <rect x="350" y="157" width="160" height="36" rx="4" fill="#27ae60"/>
+  <text class="filename" x="430" y="172" text-anchor="middle">a.parquet</text>
+  <text class="range" x="430" y="187" text-anchor="middle" fill="#fff">ts ∈ [200, 300]</text>
+
+  <text class="verdict-good" x="430" y="220" text-anchor="middle">validated_output_ordering() = Exact</text>
+  <text class="label" x="430" y="238" text-anchor="middle">→ SortExec removed</text>
+
+  <!-- Right column: number line -->
+  <text class="sub" x="640" y="55" text-anchor="middle">Range layout</text>
+  <line x1="560" y1="130" x2="800" y2="130" stroke="#888" stroke-width="1"/>
+  <line x1="560" y1="125" x2="560" y2="135" stroke="#888"/>
+  <line x1="640" y1="125" x2="640" y2="135" stroke="#888"/>
+  <line x1="720" y1="125" x2="720" y2="135" stroke="#888"/>
+  <line x1="800" y1="125" x2="800" y2="135" stroke="#888"/>
+  <text class="range" x="560" y="148" text-anchor="middle">0</text>
+  <text class="range" x="640" y="148" text-anchor="middle">100</text>
+  <text class="range" x="720" y="148" text-anchor="middle">200</text>
+  <text class="range" x="800" y="148" text-anchor="middle">300</text>
+
+  <!-- range bars -->
+  <rect x="560" y="80" width="80" height="14" rx="3" fill="#27ae60"/>
+  <text class="range" x="600" y="91" text-anchor="middle" fill="#fff">c</text>
+  <rect x="640" y="98" width="80" height="14" rx="3" fill="#27ae60"/>
+  <text class="range" x="680" y="109" text-anchor="middle" fill="#fff">b</text>
+  <rect x="720" y="116" width="80" height="14" rx="3" fill="#27ae60"/>
+  <text class="range" x="760" y="127" text-anchor="middle" fill="#fff">a</text>
+
+  <text class="label" x="680" y="170" text-anchor="middle">Non-overlapping → ordering provable</text>
+
+  <!-- Bottom: SQL hint -->
+  <text x="410" y="290" text-anchor="middle" font-size="12" fill="#555" font-family="'Courier New', monospace">SELECT * FROM events ORDER BY ts</text>
+</svg>
diff --git a/content/images/sort-pushdown/phase2-stats-overlap.svg b/content/images/sort-pushdown/phase2-stats-overlap.svg
new file mode 100644
index 00000000..027860ef
--- /dev/null
+++ b/content/images/sort-pushdown/phase2-stats-overlap.svg
@@ -0,0 +1,79 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 820 360">
+  <style>
+    text { font-family: 'Segoe UI', Arial, sans-serif; }
+    .title { font-size: 14px; font-weight: 700; fill: #222; }
+    .panel-title { font-size: 13px; font-weight: 700; }
+    .label { font-size: 11px; fill: #555; }
+    .range { font-size: 11px; fill: #333; font-family: 'Courier New', monospace; }
+    .verdict-good { font-size: 12px; font-weight: 700; fill: #27ae60; }
+    .verdict-bad { font-size: 12px; font-weight: 700; fill: #c0392b; }
+    .axis { stroke: #888; stroke-width: 1; }
+    .tick { stroke: #888; stroke-width: 1; }
+  </style>
+
+  <text class="title" x="410" y="22" text-anchor="middle">Phase 2: use min/max statistics to prove non-overlap</text>
+
+  <!-- LEFT: Non-overlapping (Exact) -->
+  <rect x="20" y="40" width="380" height="280" rx="8" fill="#e8f5e9" stroke="#27ae60" stroke-width="1.5"/>
+  <text class="panel-title" x="210" y="62" text-anchor="middle" fill="#27ae60">Non-overlapping ranges</text>
+
+  <!-- axis -->
+  <line class="axis" x1="50" y1="240" x2="380" y2="240"/>
+  <line class="tick" x1="50" y1="235" x2="50" y2="245"/>
+  <line class="tick" x1="160" y1="235" x2="160" y2="245"/>
+  <line class="tick" x1="270" y1="235" x2="270" y2="245"/>
+  <line class="tick" x1="380" y1="235" x2="380" y2="245"/>
+  <text class="range" x="50" y="258" text-anchor="middle">0</text>
+  <text class="range" x="160" y="258" text-anchor="middle">100</text>
+  <text class="range" x="270" y="258" text-anchor="middle">200</text>
+  <text class="range" x="380" y="258" text-anchor="middle">300</text>
+  <text class="label" x="215" y="278" text-anchor="middle">min(ts) / max(ts)</text>
+
+  <!-- file ranges -->
+  <rect x="50" y="100" width="110" height="22" rx="3" fill="#27ae60"/>
+  <text class="range" x="105" y="115" text-anchor="middle" fill="#fff">file_c  [0..100]</text>
+  <rect x="160" y="140" width="110" height="22" rx="3" fill="#27ae60"/>
+  <text class="range" x="215" y="155" text-anchor="middle" fill="#fff">file_b  [100..200]</text>
+  <rect x="270" y="180" width="110" height="22" rx="3" fill="#27ae60"/>
+  <text class="range" x="325" y="195" text-anchor="middle" fill="#fff">file_a  [200..300]</text>
+
+  <!-- gap markers -->
+  <line x1="160" y1="92" x2="160" y2="210" stroke="#27ae60" stroke-dasharray="3,3" stroke-width="1"/>
+  <line x1="270" y1="132" x2="270" y2="210" stroke="#27ae60" stroke-dasharray="3,3" stroke-width="1"/>
+
+  <text class="verdict-good" x="210" y="305" text-anchor="middle">Ordering: Exact ✓</text>
+  <text class="label" x="210" y="320" text-anchor="middle">SortExec can be removed</text>
+
+  <!-- RIGHT: Overlapping (Inexact) -->
+  <rect x="420" y="40" width="380" height="280" rx="8" fill="#fde8e8" stroke="#c0392b" stroke-width="1.5"/>
+  <text class="panel-title" x="610" y="62" text-anchor="middle" fill="#c0392b">Overlapping ranges</text>
+
+  <line class="axis" x1="450" y1="240" x2="780" y2="240"/>
+  <line class="tick" x1="450" y1="235" x2="450" y2="245"/>
+  <line class="tick" x1="560" y1="235" x2="560" y2="245"/>
+  <line class="tick" x1="670" y1="235" x2="670" y2="245"/>
+  <line class="tick" x1="780" y1="235" x2="780" y2="245"/>
+  <text class="range" x="450" y="258" text-anchor="middle">0</text>
+  <text class="range" x="560" y="258" text-anchor="middle">100</text>
+  <text class="range" x="670" y="258" text-anchor="middle">200</text>
+  <text class="range" x="780" y="258" text-anchor="middle">300</text>
+  <text class="label" x="615" y="278" text-anchor="middle">min(ts) / max(ts)</text>
+
+  <!-- overlapping file bars -->
+  <rect x="450" y="100" width="200" height="22" rx="3" fill="#c0392b" opacity="0.85"/>
+  <text class="range" x="550" y="115" text-anchor="middle" fill="#fff">file_x  [0..180]</text>
+  <rect x="540" y="140" width="200" height="22" rx="3" fill="#c0392b" opacity="0.85"/>
+  <text class="range" x="640" y="155" text-anchor="middle" fill="#fff">file_y  [80..260]</text>
+  <rect x="620" y="180" width="160" height="22" rx="3" fill="#c0392b" opacity="0.85"/>
+  <text class="range" x="700" y="195" text-anchor="middle" fill="#fff">file_z  [140..300]</text>
+
+  <!-- overlap shading -->
+  <rect x="540" y="100" width="110" height="22" fill="#c0392b" opacity="0.25"/>
+  <rect x="620" y="140" width="120" height="22" fill="#c0392b" opacity="0.25"/>
+
+  <text class="verdict-bad" x="610" y="305" text-anchor="middle">Ordering: Inexact (or stripped)</text>
+  <text class="label" x="610" y="320" text-anchor="middle">SortExec stays</text>
+
+  <!-- footer -->
+  <text x="410" y="348" text-anchor="middle" font-size="11" fill="#555">PushdownSort sorts files by min, checks adjacency, upgrades to Exact only when ranges don't overlap.</text>
+</svg>
diff --git a/content/images/sort-pushdown/plan-diff.svg b/content/images/sort-pushdown/plan-diff.svg
new file mode 100644
index 00000000..a4d08673
--- /dev/null
+++ b/content/images/sort-pushdown/plan-diff.svg
@@ -0,0 +1,70 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 820 340">
+  <style>
+    text { font-family: 'Segoe UI', Arial, sans-serif; }
+    .title { font-size: 14px; font-weight: 700; fill: #222; }
+    .panel { font-size: 13px; font-weight: 700; }
+    .node { font-size: 12px; font-weight: 600; fill: #fff; }
+    .detail { font-size: 10px; font-family: 'Courier New', monospace; fill: #333; }
+    .label { font-size: 11px; fill: #555; }
+    .arrow { fill: none; stroke: #555; stroke-width: 1.4; marker-end: url(#arr2); }
+    .crossed { stroke: #c0392b; stroke-width: 2; }
+  </style>
+  <defs>
+    <marker id="arr2" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">
+      <path d="M0,0 L10,5 L0,10 Z" fill="#555"/>
+    </marker>
+  </defs>
+
+  <text class="title" x="410" y="22" text-anchor="middle">EXPLAIN before / after sort pushdown</text>
+
+  <!-- BEFORE -->
+  <rect x="20" y="40" width="370" height="280" rx="8" fill="#fff" stroke="#c0392b" stroke-width="1.5"/>
+  <text class="panel" x="205" y="62" text-anchor="middle" fill="#c0392b">Before — SortExec on top</text>
+
+  <rect x="105" y="78" width="200" height="34" rx="4" fill="#34495e"/>
+  <text class="node" x="205" y="100" text-anchor="middle">CoalescePartitionsExec</text>
+
+  <line class="arrow" x1="205" y1="118" x2="205" y2="132"/>
+
+  <rect x="105" y="134" width="200" height="44" rx="4" fill="#c0392b"/>
+  <text class="node" x="205" y="156" text-anchor="middle">SortExec</text>
+  <text class="detail" x="205" y="170" text-anchor="middle" fill="#fff">expr=[ts ASC], full sort</text>
+
+  <line class="arrow" x1="205" y1="184" x2="205" y2="200"/>
+
+  <rect x="105" y="202" width="200" height="34" rx="4" fill="#34495e"/>
+  <text class="node" x="205" y="223" text-anchor="middle">RepartitionExec</text>
+
+  <line class="arrow" x1="205" y1="240" x2="205" y2="254"/>
+
+  <rect x="65" y="256" width="280" height="44" rx="4" fill="#5b8def"/>
+  <text class="node" x="205" y="278" text-anchor="middle">DataSourceExec</text>
+  <text class="detail" x="205" y="292" text-anchor="middle" fill="#fff">files: [a.parquet, b.parquet, c.parquet]</text>
+
+  <!-- AFTER -->
+  <rect x="430" y="40" width="370" height="280" rx="8" fill="#fff" stroke="#27ae60" stroke-width="1.5"/>
+  <text class="panel" x="615" y="62" text-anchor="middle" fill="#27ae60">After — SortExec eliminated</text>
+
+  <rect x="515" y="78" width="200" height="34" rx="4" fill="#34495e"/>
+  <text class="node" x="615" y="100" text-anchor="middle">SortPreservingMergeExec</text>
+
+  <line class="arrow" x1="615" y1="118" x2="615" y2="132"/>
+
+  <!-- "removed" placeholder, crossed-out SortExec -->
+  <rect x="515" y="134" width="200" height="44" rx="4" fill="#f9f9f9" stroke="#c0392b" stroke-dasharray="4,3"/>
+  <text class="node" x="615" y="156" text-anchor="middle" fill="#c0392b">SortExec (removed)</text>
+  <text class="detail" x="615" y="170" text-anchor="middle" fill="#c0392b">no longer needed</text>
+  <line class="crossed" x1="515" y1="134" x2="715" y2="178"/>
+  <line class="crossed" x1="715" y1="134" x2="515" y2="178"/>
+
+  <line class="arrow" x1="615" y1="184" x2="615" y2="200"/>
+
+  <rect x="515" y="202" width="200" height="34" rx="4" fill="#34495e"/>
+  <text class="node" x="615" y="223" text-anchor="middle">RepartitionExec</text>
+
+  <line class="arrow" x1="615" y1="240" x2="615" y2="254"/>
+
+  <rect x="475" y="256" width="280" height="44" rx="4" fill="#27ae60"/>
+  <text class="node" x="615" y="278" text-anchor="middle">DataSourceExec</text>
+  <text class="detail" x="615" y="292" text-anchor="middle" fill="#fff">files: [c.parquet, b.parquet, a.parquet]</text>
+</svg>
diff --git a/content/images/sort-pushdown/pr21956-decision.svg b/content/images/sort-pushdown/pr21956-decision.svg
new file mode 100644
index 00000000..a8203241
--- /dev/null
+++ b/content/images/sort-pushdown/pr21956-decision.svg
@@ -0,0 +1,66 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 820 420">
+  <style>
+    text { font-family: 'Segoe UI', Arial, sans-serif; }
+    .title { font-size: 14px; font-weight: 700; fill: #222; }
+    .node { font-size: 12px; font-weight: 600; fill: #fff; }
+    .detail { font-size: 10px; font-family: 'Courier New', monospace; fill: #333; }
+    .branchlabel { font-size: 11px; font-weight: 600; fill: #555; }
+    .arrow { fill: none; stroke: #555; stroke-width: 1.4; marker-end: url(#arr); }
+  </style>
+  <defs>
+    <marker id="arr" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">
+      <path d="M0,0 L10,5 L0,10 Z" fill="#555"/>
+    </marker>
+  </defs>
+
+  <text class="title" x="410" y="24" text-anchor="middle">try_pushdown_sort: Exact / Inexact / Unsupported decision</text>
+
+  <!-- Entry node -->
+  <rect x="290" y="44" width="240" height="44" rx="8" fill="#34495e"/>
+  <text class="node" x="410" y="66" text-anchor="middle">PushdownSort rule</text>
+  <text class="detail" x="410" y="80" text-anchor="middle" fill="#fff">source.try_pushdown_sort(req, eq)</text>
+
+  <line class="arrow" x1="410" y1="92" x2="410" y2="110"/>
+
+  <!-- Diamond 1: natural ordering check -->
+  <polygon points="410,114 600,168 410,222 220,168" fill="#fff" stroke="#5b8def" stroke-width="1.6"/>
+  <text class="detail" x="410" y="160" text-anchor="middle">eq.ordering_satisfy(req)?</text>
+  <text class="detail" x="410" y="174" text-anchor="middle">(natural ordering already matches?)</text>
+
+  <!-- yes → Exact -->
+  <line class="arrow" x1="600" y1="168" x2="690" y2="168"/>
+  <text class="branchlabel" x="640" y="158" text-anchor="middle">yes</text>
+  <rect x="690" y="146" width="110" height="44" rx="8" fill="#27ae60"/>
+  <text class="node" x="745" y="168" text-anchor="middle">Exact</text>
+  <text class="detail" x="745" y="182" text-anchor="middle" fill="#fff">drop SortExec</text>
+
+  <!-- no → next diamond -->
+  <line class="arrow" x1="410" y1="222" x2="410" y2="246"/>
+  <text class="branchlabel" x="400" y="240" text-anchor="end">no</text>
+
+  <!-- Diamond 2: column_in_file_schema || reversed_satisfies -->
+  <polygon points="410,250 620,310 410,370 200,310" fill="#fff" stroke="#5b8def" stroke-width="1.6"/>
+  <text class="detail" x="410" y="302" text-anchor="middle">column_in_file_schema</text>
+  <text class="detail" x="410" y="316" text-anchor="middle">|| reversed_satisfies ?</text>
+
+  <!-- yes → Inexact -->
+  <line class="arrow" x1="620" y1="310" x2="690" y2="310"/>
+  <text class="branchlabel" x="655" y="300" text-anchor="middle">yes</text>
+  <rect x="690" y="288" width="110" height="44" rx="8" fill="#e67e22"/>
+  <text class="node" x="745" y="310" text-anchor="middle">Inexact</text>
+  <text class="detail" x="745" y="324" text-anchor="middle" fill="#fff">set both flags</text>
+
+  <!-- no → Unsupported -->
+  <line class="arrow" x1="200" y1="310" x2="130" y2="310"/>
+  <text class="branchlabel" x="165" y="300" text-anchor="middle">no</text>
+  <rect x="20" y="288" width="110" height="44" rx="8" fill="#95a5a6"/>
+  <text class="node" x="75" y="310" text-anchor="middle">Unsupported</text>
+  <text class="detail" x="75" y="324" text-anchor="middle" fill="#fff">SortExec stays</text>
+
+  <!-- Legend / explanation strip -->
+  <rect x="20" y="380" width="780" height="34" rx="6" fill="#f6f7fa" stroke="#cfd6dc"/>
+  <text class="detail" x="30" y="400">Exact</text>
+  <text class="detail" x="80" y="400">→ Phase 2 sort elimination · fetch becomes static limit</text>
+  <text class="detail" x="30" y="412">Inexact</text>
+  <text class="detail" x="80" y="412">→ #21956 runtime pipeline: file reorder + RG reorder + reverse · SortExec / TopK kept on top for correctness</text>
+</svg>
diff --git a/content/images/sort-pushdown/pr21956-runtime-pipeline.svg b/content/images/sort-pushdown/pr21956-runtime-pipeline.svg
new file mode 100644
index 00000000..5bb8d678
--- /dev/null
+++ b/content/images/sort-pushdown/pr21956-runtime-pipeline.svg
@@ -0,0 +1,69 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 820 560">
+  <style>
+    text { font-family: 'Segoe UI', Arial, sans-serif; }
+    .title { font-size: 14px; font-weight: 700; fill: #222; }
+    .stage-title { font-size: 13px; font-weight: 700; fill: #222; }
+    .step-num { font-size: 13px; font-weight: 700; }
+    .node { font-size: 12px; font-weight: 600; fill: #fff; }
+    .detail { font-size: 10px; font-family: 'Courier New', monospace; fill: #fff; }
+    .small-detail { font-size: 10px; font-family: 'Courier New', monospace; fill: #333; }
+    .arrow { fill: none; stroke: #555; stroke-width: 1.6; marker-end: url(#arr); }
+    .panel-label { font-size: 11px; font-weight: 600; fill: #888; }
+  </style>
+  <defs>
+    <marker id="arr" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">
+      <path d="M0,0 L10,5 L0,10 Z" fill="#555"/>
+    </marker>
+  </defs>
+
+  <text class="title" x="410" y="24" text-anchor="middle">Inexact pushdown: two flags drive a three-step runtime pipeline</text>
+
+  <!-- Source state box -->
+  <rect x="140" y="46" width="540" height="86" rx="8" fill="#f6f7fa" stroke="#34495e" stroke-width="1.5"/>
+  <text class="stage-title" x="410" y="66" text-anchor="middle">ParquetSource carries the inexact-pushdown decision</text>
+  <text class="small-detail" x="170" y="88">sort_order_for_reorder = Some([req_col ASC | DESC])</text>
+  <text class="small-detail" x="170" y="104">reverse_row_groups     = bool</text>
+  <text class="small-detail" x="170" y="124">// set by try_pushdown_sort, read by the opener at scan time</text>
+
+  <line class="arrow" x1="410" y1="136" x2="410" y2="154"/>
+
+  <!-- Step 1: file reorder -->
+  <rect x="100" y="156" width="620" height="82" rx="8" fill="#5b8def"/>
+  <circle cx="132" cy="197" r="16" fill="#fff"/>
+  <text class="step-num" x="132" y="202" text-anchor="middle" fill="#5b8def">1</text>
+  <text class="node" x="162" y="184">File-level reorder · shared morsel queue</text>
+  <text class="detail" x="162" y="206">FileSource::reorder_files</text>
+  <text class="detail" x="162" y="222">→ sort files by min(col); first file picked across all</text>
+  <text class="detail" x="162" y="238" fill="#fff">  partitions is globally the most-promising one</text>
+
+  <line class="arrow" x1="410" y1="240" x2="410" y2="258"/>
+  <text class="panel-label" x="440" y="256">for each opened file</text>
+
+  <!-- Step 2: RG reorder -->
+  <rect x="100" y="260" width="620" height="82" rx="8" fill="#27ae60"/>
+  <circle cx="132" cy="301" r="16" fill="#fff"/>
+  <text class="step-num" x="132" y="306" text-anchor="middle" fill="#27ae60">2</text>
+  <text class="node" x="162" y="288">Row-group-level reorder · per file</text>
+  <text class="detail" x="162" y="310">PreparedAccessPlan::reorder_by_statistics</text>
+  <text class="detail" x="162" y="326">→ row_group_indexes sorted ASC by min(col)</text>
+  <text class="detail" x="162" y="342">  using parquet column statistics</text>
+
+  <line class="arrow" x1="410" y1="344" x2="410" y2="362"/>
+  <text class="panel-label" x="440" y="360">if reverse_row_groups</text>
+
+  <!-- Step 3: reverse -->
+  <rect x="100" y="364" width="620" height="82" rx="8" fill="#e67e22"/>
+  <circle cx="132" cy="405" r="16" fill="#fff"/>
+  <text class="step-num" x="132" y="410" text-anchor="middle" fill="#e67e22">3</text>
+  <text class="node" x="162" y="392">Reverse iteration · DESC requests</text>
+  <text class="detail" x="162" y="414">PreparedAccessPlan::reverse</text>
+  <text class="detail" x="162" y="430">→ row_group_indexes.into_iter().rev()</text>
+
+  <line class="arrow" x1="410" y1="448" x2="410" y2="466"/>
+
+  <!-- Decoder + nesting note -->
+  <rect x="100" y="468" width="620" height="78" rx="8" fill="#34495e"/>
+  <text class="node" x="410" y="490" text-anchor="middle">Decoder reads row groups in this order</text>
+  <text class="detail" x="410" y="510" text-anchor="middle">SortExec / TopK above the source still enforces final ordering</text>
+  <text class="detail" x="410" y="528" text-anchor="middle">— the stats reorder is approximate, not strict —</text>
+</svg>
diff --git a/content/images/sort-pushdown/reverse-scan.svg b/content/images/sort-pushdown/reverse-scan.svg
new file mode 100644
index 00000000..443a0a1c
--- /dev/null
+++ b/content/images/sort-pushdown/reverse-scan.svg
@@ -0,0 +1,100 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 820 380">
+  <style>
+    text { font-family: 'Segoe UI', Arial, sans-serif; }
+    .title { font-size: 14px; font-weight: 700; fill: #222; }
+    .panel-title { font-size: 13px; font-weight: 700; }
+    .label { font-size: 11px; fill: #555; }
+    .small { font-size: 10px; fill: #444; }
+    .badge-good { font-size: 12px; font-weight: 700; fill: #27ae60; }
+    .badge-bad { font-size: 12px; font-weight: 700; fill: #c0392b; }
+    .arrow { fill: none; stroke: #c0392b; stroke-width: 1.8; marker-end: url(#arr3); }
+    .arrow-green { fill: none; stroke: #27ae60; stroke-width: 1.8; marker-end: url(#arr3g); }
+  </style>
+  <defs>
+    <marker id="arr3" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">
+      <path d="M0,0 L10,5 L0,10 Z" fill="#c0392b"/>
+    </marker>
+    <marker id="arr3g" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="6" markerHeight="6" orient="auto">
+      <path d="M0,0 L10,5 L0,10 Z" fill="#27ae60"/>
+    </marker>
+  </defs>
+
+  <text class="title" x="410" y="22" text-anchor="middle">ORDER BY ts DESC LIMIT 10 — row-group reverse vs page reverse</text>
+
+  <!-- row group reverse panel -->
+  <rect x="20" y="40" width="380" height="320" rx="8" fill="#fff8e1" stroke="#e67e22" stroke-width="1.5"/>
+  <text class="panel-title" x="210" y="62" text-anchor="middle" fill="#e67e22">Row-group reverse (today, merged)</text>
+
+  <!-- row group block with pages -->
+  <rect x="40" y="90" width="340" height="120" rx="6" fill="#fff" stroke="#bbb"/>
+  <text class="small" x="50" y="106">RowGroup (last, ~128 MB)</text>
+
+  <!-- pages inside the row group -->
+  <g>
+    <rect x="50"  y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="92"  y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="134" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="176" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="218" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="260" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="302" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="344" y="115" width="36" height="80" fill="#bdc3c7"/>
+    <text class="small" x="69" y="160" text-anchor="middle">P0</text>
+    <text class="small" x="111" y="160" text-anchor="middle">P1</text>
+    <text class="small" x="153" y="160" text-anchor="middle">P2</text>
+    <text class="small" x="195" y="160" text-anchor="middle">P3</text>
+    <text class="small" x="237" y="160" text-anchor="middle">P4</text>
+    <text class="small" x="279" y="160" text-anchor="middle">P5</text>
+    <text class="small" x="321" y="160" text-anchor="middle">P6</text>
+    <text class="small" x="362" y="160" text-anchor="middle">P7</text>
+  </g>
+
+  <text class="label" x="210" y="235" text-anchor="middle">Decode the entire row group, reverse in memory, take 10.</text>
+  <text class="badge-bad" x="210" y="265" text-anchor="middle">Peak buffer: ~128 MB</text>
+  <text class="badge-bad" x="210" y="285" text-anchor="middle">Pages decoded: 8</text>
+  <text class="badge-bad" x="210" y="305" text-anchor="middle">Time-to-first-N: ~29 µs</text>
+
+  <!-- arrows: all pages read -->
+  <line class="arrow" x1="362" y1="80" x2="362" y2="115"/>
+  <line class="arrow" x1="69"  y1="80" x2="69"  y2="115"/>
+  <line class="arrow" x1="111" y1="80" x2="111" y2="115"/>
+  <line class="arrow" x1="153" y1="80" x2="153" y2="115"/>
+  <line class="arrow" x1="195" y1="80" x2="195" y2="115"/>
+  <line class="arrow" x1="237" y1="80" x2="237" y2="115"/>
+  <line class="arrow" x1="279" y1="80" x2="279" y2="115"/>
+  <line class="arrow" x1="321" y1="80" x2="321" y2="115"/>
+
+  <!-- PAGE REVERSE panel -->
+  <rect x="420" y="40" width="380" height="320" rx="8" fill="#e8f5e9" stroke="#27ae60" stroke-width="1.5"/>
+  <text class="panel-title" x="610" y="62" text-anchor="middle" fill="#27ae60">Page reverse (upstream POC, arrow-rs #9937)</text>
+
+  <rect x="440" y="90" width="340" height="120" rx="6" fill="#fff" stroke="#bbb"/>
+  <text class="small" x="450" y="106">RowGroup (last)</text>
+
+  <g>
+    <rect x="450" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="492" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="534" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="576" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="618" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="660" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="702" y="115" width="38" height="80" fill="#bdc3c7"/>
+    <rect x="744" y="115" width="36" height="80" fill="#27ae60"/>
+    <text class="small" x="469" y="160" text-anchor="middle">P0</text>
+    <text class="small" x="511" y="160" text-anchor="middle">P1</text>
+    <text class="small" x="553" y="160" text-anchor="middle">P2</text>
+    <text class="small" x="595" y="160" text-anchor="middle">P3</text>
+    <text class="small" x="637" y="160" text-anchor="middle">P4</text>
+    <text class="small" x="679" y="160" text-anchor="middle">P5</text>
+    <text class="small" x="721" y="160" text-anchor="middle">P6</text>
+    <text class="small" x="762" y="160" text-anchor="middle" fill="#fff">P7</text>
+  </g>
+
+  <!-- only one green arrow on last page -->
+  <line class="arrow-green" x1="762" y1="80" x2="762" y2="115"/>
+
+  <text class="label" x="610" y="235" text-anchor="middle">Seek to last page only via OffsetIndex, decode, reverse, return.</text>
+  <text class="badge-good" x="610" y="265" text-anchor="middle">Peak buffer: ~1 MB</text>
+  <text class="badge-good" x="610" y="285" text-anchor="middle">Pages decoded: 1</text>
+  <text class="badge-good" x="610" y="305" text-anchor="middle">Time-to-first-N: ~565 ns  (≈ 50× faster)</text>
+</svg>