New CTable.group_by() implementation by FrancescAlted · Pull Request #636 · Blosc/python-blosc2

FrancescAlted · 2026-05-15T12:25:15Z

This PR adds CTable.group_by() with grouped reductions, optimized Cython kernels, and a public blosc2.group_reduce() array API.

Highlights:

Added CTable.group_by() with:
- size(), count(), agg()
- convenience methods: sum(), mean(), min(), max()
- sort= and dropna= support
- persistent output via urlpath=
Added optimized Cython group-by paths:
- fused integer-key kernels for all integer widths
- size, count, sum, mean, min, max
- arbitrary float-key hash path
- conservative two-key integer/dictionary hash path

Added public:

  blosc2.group_reduce(keys, values=None, op="sum")

Extended benchmarks and documentation.
Added tests for group-by semantics, optimized paths, persistent output, and group_reduce().

Implemented: - fused dense integer-key Cython kernels covering `int8`, `uint8`, `int16`, `uint16`, `int32`, `uint32`, `int64`, and `uint64` keys; - dense integer/dictionary-code Cython path for `size`, `count`, `sum`, `mean`, `min`, and `max`; - float64 value kernels with NaN-null skipping where applicable; - int64 value kernels for integer/bool `sum`, `min`, and `max`; - shared key-presence tracking so groups with all-null values are still emitted correctly for `count` and nullable float aggregations.

Implement fused integer-key dense kernels for all integer widths, add Cython aggregations beyond sum, introduce arbitrary float-key hashing, and add a conservative two-key integer/dictionary hash path. Extend group-by benchmarks and tests for the new optimized cases.

…ances

Copilot

Pull request overview

This PR introduces group-by functionality for blosc2.CTable and a new array-oriented grouped reduction API, backed by optimized Cython kernels and covered by new tests/benchmarks/docs.

Changes:

Added CTable.group_by() returning a deferred CTableGroupBy with grouped reductions (size, count, agg, plus sum/mean/min/max) and optional persistent output via urlpath=.
Added public blosc2.group_reduce(keys, values=None, op=..., sort=..., dropna=...) plus NumPy fallback and optimized Cython execution paths.
Integrated a new groupby_ext Cython extension into the build, and added tests/benchmarks/documentation for the new APIs.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_group_reduce.py	New unit tests for `blosc2.group_reduce()` semantics, NaN handling, sorting, and input validation.
tests/ctable/test_object_spec.py	New tests for schema-less object columns (heterogeneous values, persistence, Arrow export rejection).
tests/ctable/test_nested_append.py	New tests for appending/extending nested dict rows into dotted (nested) column names.
tests/ctable/test_groupby.py	New comprehensive tests for `CTable.group_by()` behavior, fast paths, dropna/sort, multi-key, persistence.
src/blosc2/indexing_ext.pyx	Minor cleanup / section delimiter after removing group-by related kernels.
src/blosc2/groupby.py	New Phase-1 group-by implementation, multiple optimized execution paths, and `group_reduce()` public API.
src/blosc2/groupby_ext.pyx	New Cython kernels for dense integer keys, integral-float dense keys, arbitrary-float hash, and two-key hash.
src/blosc2/ctable.py	Added `CTable.group_by()` API and adjusted dictionary-column initialization sizing.
src/blosc2/init.py	Exported `CTableGroupBy` and `group_reduce` at the top level.
plans/ctable-groupby.md	Added/updated implementation plan and current status write-up.
doc/reference/reduction_functions.rst	Documented `group_reduce` in the reduction functions reference.
doc/reference/ctable.rst	Documented `CTable.group_by()` and `CTableGroupBy` API and examples.
CMakeLists.txt	Added build/install integration for the new `groupby_ext` extension module.
bench/ctable/groupby.py	New benchmark harness for group-by performance comparisons and options.
bench/ctable/bench_nested_filter_index.py	New benchmark for nested (dotted) vs flat column filter/index/aggregate overhead.

Comments suppressed due to low confidence (2)

src/blosc2/groupby_ext.pyx:623

groupby_dense_int_f64_max_checked is missing the basic input/state shape validation that the other "*_checked" kernels perform. Because the function runs with boundscheck=False, length mismatches can cause memory corruption. Please add explicit checks for keys/values/valid lengths and for maxs/has_value/keys_present lengths (and ideally dtype checks for public robustness).

def groupby_dense_int_f64_max_checked(
    dense_int_key_t[:] keys,
    double[:] values,
    np.npy_bool[:] valid,
    double[:] maxs,

src/blosc2/groupby_ext.pyx:694

groupby_dense_int_i64_max_checked lacks the standard length checks for inputs and state arrays. Since this runs in nogil with bounds checking disabled, a mismatch can crash Python. Add shape validation (and keep it consistent with sum/mean kernels) before calling _dense_int_key_scan.

def groupby_dense_int_i64_max_checked(
    dense_int_key_t[:] keys,
    int64_t[:] values,
    np.npy_bool[:] valid,
    int64_t[:] maxs,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def groupby_dense_int_f64_min_checked(
+    dense_int_key_t[:] keys,
+    double[:] values,
+    np.npy_bool[:] valid,
+    double[:] mins,


+def groupby_dense_int_i64_min_checked(
+    dense_int_key_t[:] keys,
+    int64_t[:] values,
+    np.npy_bool[:] valid,
+    int64_t[:] mins,


+
+    order = list(acc)
+    if sort:
+        order.sort(key=lambda k: (1, "") if k is _NAN_KEY else (0, display[k]))


FrancescAlted added 17 commits May 15, 2026 07:02

Initial basic functionality (phase 1 of plan completed)

cfdd1b6

Fast path for one dense integer/dictionary-code key (8x speedup)

110f3e9

Optimizing thing in cython (phase 2 going on). More than 2x speedup.

40e58bd

Acceleration path for float32/64 as groupby keys. 25x speedup.

4bbd843

Factorized all groupby accelerations in groupby_ext.pyx

02e284b

Add new tests and bench files

e414af3

Updated plan

eadd91a

Implemented arbitrary float hash table for floating point acceleration

03b3583

Aborted the full index sorted group_by path, and an explanation on why

5df3737

Added a new public blosc2.group_reduce() that is apt for NDArray inst…

0a459a5

…ances

Add forgotten test unit

0185e79

Implemented group-by convenience methods (sum, mean, min, max)

98bb140

Updated plan with current state

9f964c9

Implemented persistent grouped output via urlpath=

74b6ee2

Protect tests when pyarrow is not installed

4e13713

FrancescAlted requested a review from Copilot May 15, 2026 16:19

Copilot started reviewing on behalf of FrancescAlted May 15, 2026 16:20 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New CTable.group_by() implementation#636

New CTable.group_by() implementation#636
FrancescAlted wants to merge 17 commits into
mainfrom
ctable-groupby

FrancescAlted commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

FrancescAlted commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FrancescAlted commented May 15, 2026 •

edited

Loading