Skip to content

New CTable.group_by() implementation#636

Open
FrancescAlted wants to merge 17 commits into
mainfrom
ctable-groupby
Open

New CTable.group_by() implementation#636
FrancescAlted wants to merge 17 commits into
mainfrom
ctable-groupby

Conversation

@FrancescAlted
Copy link
Copy Markdown
Member

@FrancescAlted FrancescAlted commented May 15, 2026

This PR adds CTable.group_by() with grouped reductions, optimized Cython kernels, and a public blosc2.group_reduce() array API.

Highlights:

  • Added CTable.group_by() with:
    • size(), count(), agg()
    • convenience methods: sum(), mean(), min(), max()
    • sort= and dropna= support
    • persistent output via urlpath=
  • Added optimized Cython group-by paths:
    • fused integer-key kernels for all integer widths
    • size, count, sum, mean, min, max
    • arbitrary float-key hash path
    • conservative two-key integer/dictionary hash path
  • Added public:
      blosc2.group_reduce(keys, values=None, op="sum")
  • Extended benchmarks and documentation.
  • Added tests for group-by semantics, optimized paths, persistent output, and group_reduce().

Implemented:

- fused dense integer-key Cython kernels covering `int8`, `uint8`,
  `int16`, `uint16`, `int32`, `uint32`, `int64`, and `uint64` keys;
- dense integer/dictionary-code Cython path for `size`, `count`, `sum`,
  `mean`, `min`, and `max`;
- float64 value kernels with NaN-null skipping where applicable;
- int64 value kernels for integer/bool `sum`, `min`, and `max`;
- shared key-presence tracking so groups with all-null values are still
  emitted correctly for `count` and nullable float aggregations.
 Implement fused integer-key dense kernels for all integer widths, add Cython aggregations beyond sum, introduce arbitrary float-key hashing, and add a conservative
 two-key integer/dictionary hash path. Extend group-by benchmarks and tests for the new optimized cases.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces group-by functionality for blosc2.CTable and a new array-oriented grouped reduction API, backed by optimized Cython kernels and covered by new tests/benchmarks/docs.

Changes:

  • Added CTable.group_by() returning a deferred CTableGroupBy with grouped reductions (size, count, agg, plus sum/mean/min/max) and optional persistent output via urlpath=.
  • Added public blosc2.group_reduce(keys, values=None, op=..., sort=..., dropna=...) plus NumPy fallback and optimized Cython execution paths.
  • Integrated a new groupby_ext Cython extension into the build, and added tests/benchmarks/documentation for the new APIs.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_group_reduce.py New unit tests for blosc2.group_reduce() semantics, NaN handling, sorting, and input validation.
tests/ctable/test_object_spec.py New tests for schema-less object columns (heterogeneous values, persistence, Arrow export rejection).
tests/ctable/test_nested_append.py New tests for appending/extending nested dict rows into dotted (nested) column names.
tests/ctable/test_groupby.py New comprehensive tests for CTable.group_by() behavior, fast paths, dropna/sort, multi-key, persistence.
src/blosc2/indexing_ext.pyx Minor cleanup / section delimiter after removing group-by related kernels.
src/blosc2/groupby.py New Phase-1 group-by implementation, multiple optimized execution paths, and group_reduce() public API.
src/blosc2/groupby_ext.pyx New Cython kernels for dense integer keys, integral-float dense keys, arbitrary-float hash, and two-key hash.
src/blosc2/ctable.py Added CTable.group_by() API and adjusted dictionary-column initialization sizing.
src/blosc2/init.py Exported CTableGroupBy and group_reduce at the top level.
plans/ctable-groupby.md Added/updated implementation plan and current status write-up.
doc/reference/reduction_functions.rst Documented group_reduce in the reduction functions reference.
doc/reference/ctable.rst Documented CTable.group_by() and CTableGroupBy API and examples.
CMakeLists.txt Added build/install integration for the new groupby_ext extension module.
bench/ctable/groupby.py New benchmark harness for group-by performance comparisons and options.
bench/ctable/bench_nested_filter_index.py New benchmark for nested (dotted) vs flat column filter/index/aggregate overhead.
Comments suppressed due to low confidence (2)

src/blosc2/groupby_ext.pyx:623

  • groupby_dense_int_f64_max_checked is missing the basic input/state shape validation that the other "*_checked" kernels perform. Because the function runs with boundscheck=False, length mismatches can cause memory corruption. Please add explicit checks for keys/values/valid lengths and for maxs/has_value/keys_present lengths (and ideally dtype checks for public robustness).
def groupby_dense_int_f64_max_checked(
    dense_int_key_t[:] keys,
    double[:] values,
    np.npy_bool[:] valid,
    double[:] maxs,

src/blosc2/groupby_ext.pyx:694

  • groupby_dense_int_i64_max_checked lacks the standard length checks for inputs and state arrays. Since this runs in nogil with bounds checking disabled, a mismatch can crash Python. Add shape validation (and keep it consistent with sum/mean kernels) before calling _dense_int_key_scan.
def groupby_dense_int_i64_max_checked(
    dense_int_key_t[:] keys,
    int64_t[:] values,
    np.npy_bool[:] valid,
    int64_t[:] maxs,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +582 to +586
def groupby_dense_int_f64_min_checked(
dense_int_key_t[:] keys,
double[:] values,
np.npy_bool[:] valid,
double[:] mins,
Comment on lines +656 to +660
def groupby_dense_int_i64_min_checked(
dense_int_key_t[:] keys,
int64_t[:] values,
np.npy_bool[:] valid,
int64_t[:] mins,
Comment thread src/blosc2/groupby.py

order = list(acc)
if sort:
order.sort(key=lambda k: (1, "") if k is _NAN_KEY else (0, display[k]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants