New CTable.group_by() implementation#636
Open
FrancescAlted wants to merge 17 commits into
Open
Conversation
Implemented: - fused dense integer-key Cython kernels covering `int8`, `uint8`, `int16`, `uint16`, `int32`, `uint32`, `int64`, and `uint64` keys; - dense integer/dictionary-code Cython path for `size`, `count`, `sum`, `mean`, `min`, and `max`; - float64 value kernels with NaN-null skipping where applicable; - int64 value kernels for integer/bool `sum`, `min`, and `max`; - shared key-presence tracking so groups with all-null values are still emitted correctly for `count` and nullable float aggregations.
Implement fused integer-key dense kernels for all integer widths, add Cython aggregations beyond sum, introduce arbitrary float-key hashing, and add a conservative two-key integer/dictionary hash path. Extend group-by benchmarks and tests for the new optimized cases.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces group-by functionality for blosc2.CTable and a new array-oriented grouped reduction API, backed by optimized Cython kernels and covered by new tests/benchmarks/docs.
Changes:
- Added
CTable.group_by()returning a deferredCTableGroupBywith grouped reductions (size,count,agg, plussum/mean/min/max) and optional persistent output viaurlpath=. - Added public
blosc2.group_reduce(keys, values=None, op=..., sort=..., dropna=...)plus NumPy fallback and optimized Cython execution paths. - Integrated a new
groupby_extCython extension into the build, and added tests/benchmarks/documentation for the new APIs.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_group_reduce.py | New unit tests for blosc2.group_reduce() semantics, NaN handling, sorting, and input validation. |
| tests/ctable/test_object_spec.py | New tests for schema-less object columns (heterogeneous values, persistence, Arrow export rejection). |
| tests/ctable/test_nested_append.py | New tests for appending/extending nested dict rows into dotted (nested) column names. |
| tests/ctable/test_groupby.py | New comprehensive tests for CTable.group_by() behavior, fast paths, dropna/sort, multi-key, persistence. |
| src/blosc2/indexing_ext.pyx | Minor cleanup / section delimiter after removing group-by related kernels. |
| src/blosc2/groupby.py | New Phase-1 group-by implementation, multiple optimized execution paths, and group_reduce() public API. |
| src/blosc2/groupby_ext.pyx | New Cython kernels for dense integer keys, integral-float dense keys, arbitrary-float hash, and two-key hash. |
| src/blosc2/ctable.py | Added CTable.group_by() API and adjusted dictionary-column initialization sizing. |
| src/blosc2/init.py | Exported CTableGroupBy and group_reduce at the top level. |
| plans/ctable-groupby.md | Added/updated implementation plan and current status write-up. |
| doc/reference/reduction_functions.rst | Documented group_reduce in the reduction functions reference. |
| doc/reference/ctable.rst | Documented CTable.group_by() and CTableGroupBy API and examples. |
| CMakeLists.txt | Added build/install integration for the new groupby_ext extension module. |
| bench/ctable/groupby.py | New benchmark harness for group-by performance comparisons and options. |
| bench/ctable/bench_nested_filter_index.py | New benchmark for nested (dotted) vs flat column filter/index/aggregate overhead. |
Comments suppressed due to low confidence (2)
src/blosc2/groupby_ext.pyx:623
groupby_dense_int_f64_max_checkedis missing the basic input/state shape validation that the other "*_checked" kernels perform. Because the function runs withboundscheck=False, length mismatches can cause memory corruption. Please add explicit checks forkeys/values/validlengths and formaxs/has_value/keys_presentlengths (and ideally dtype checks for public robustness).
def groupby_dense_int_f64_max_checked(
dense_int_key_t[:] keys,
double[:] values,
np.npy_bool[:] valid,
double[:] maxs,
src/blosc2/groupby_ext.pyx:694
groupby_dense_int_i64_max_checkedlacks the standard length checks for inputs and state arrays. Since this runs innogilwith bounds checking disabled, a mismatch can crash Python. Add shape validation (and keep it consistent with sum/mean kernels) before calling_dense_int_key_scan.
def groupby_dense_int_i64_max_checked(
dense_int_key_t[:] keys,
int64_t[:] values,
np.npy_bool[:] valid,
int64_t[:] maxs,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+582
to
+586
| def groupby_dense_int_f64_min_checked( | ||
| dense_int_key_t[:] keys, | ||
| double[:] values, | ||
| np.npy_bool[:] valid, | ||
| double[:] mins, |
Comment on lines
+656
to
+660
| def groupby_dense_int_i64_min_checked( | ||
| dense_int_key_t[:] keys, | ||
| int64_t[:] values, | ||
| np.npy_bool[:] valid, | ||
| int64_t[:] mins, |
|
|
||
| order = list(acc) | ||
| if sort: | ||
| order.sort(key=lambda k: (1, "") if k is _NAN_KEY else (0, display[k])) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds CTable.group_by() with grouped reductions, optimized Cython kernels, and a public blosc2.group_reduce() array API.
Highlights:
sort=anddropna=supporturlpath=