Skip to content

Implement min, max, sum for run-end-encoded arrays.#9409

Open
brunal wants to merge 3 commits intoapache:mainfrom
brunal:ree-agg
Open

Implement min, max, sum for run-end-encoded arrays.#9409
brunal wants to merge 3 commits intoapache:mainfrom
brunal:ree-agg

Conversation

@brunal
Copy link
Contributor

@brunal brunal commented Feb 13, 2026

Efficient implementations:

  • min & max work directly on the values child array.
  • sum folds over run lengths & values, without decompressing the array.

In particular, those implementations takes care of the logical offset & len of the run-end-encoded arrays. This is non-trivial:

  • We get the physical start & end indices in O(log(#runs)), but those are incorrect for empty arrays.
  • Slicing can happen in the middle of a run. For sum, we need to track the logical start & end and reduce the run length accordingly.

Finally, one caveat: the aggregation functions only work when the child values array is a primitive array. That's fine ~always, but some client might store the values in an unexpected type. They'll either get None or an Error, depending on the aggregation function used.

This feature is tracked in #3520.

Efficient implementations:
* min & max work directly on the values child array.
* sum folds over run lengths & values, without decompressing the array.

In particular, those implementations takes care of the logical offset & len of the run-end-encoded
arrays. This is non-trivial:
* We get the physical start & end indices in O(log(#runs)), but those
  are incorrect for empty arrays.
* Slicing can happen in the middle of a run. For sum, we need to track
  the logical start & end and reduce the run length accordingly.

Finally, one caveat: the aggregation functions only work when the child
values array is a primitive array. That's fine ~always, but some client
might store the values in an unexpected type. They'll either get None or
an Error, depending on the aggregation function used.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 13, 2026
@brunal brunal marked this pull request as ready for review February 13, 2026 12:51
@brunal
Copy link
Contributor Author

brunal commented Feb 13, 2026

Note that in a future MR, I'm likely to move most of ree::fold() into run_array.rs, providing an iterator over (run_idx_start, run_idx_end, value), and use that in cmp.rs.

@brunal
Copy link
Contributor Author

brunal commented Feb 13, 2026

I'm not handling null values properly when computing sums. Back to draft.

@brunal brunal marked this pull request as draft February 13, 2026 15:01
@brunal
Copy link
Contributor Author

brunal commented Feb 14, 2026

Thank you for the review. values_slice/sliced_values are very helpful and clean up nicely the implementations. They do have a performance downside (looking for physical indices twice) but that seems worth the clean code.

@brunal brunal marked this pull request as ready for review February 14, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants