-
-
Notifications
You must be signed in to change notification settings - Fork 379
Feature: support rectilinear chunk grid extension #3534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature: support rectilinear chunk grid extension #3534
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3534 +/- ##
==========================================
+ Coverage 60.89% 61.13% +0.23%
==========================================
Files 86 86
Lines 10182 10767 +585
==========================================
+ Hits 6200 6582 +382
- Misses 3982 4185 +203
🚀 New features to boost your workflow:
|
…ature/rectilinear-chunk-grid
| With variable chunking, the standard `.chunks` property is not available since chunks | ||
| have different sizes. Instead, access chunk information through the chunk grid: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if .chunks just had a different type (tuple of tuples of ints)
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class RectilinearChunkGrid(ChunkGrid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on just calling this class Rectilinear, and renaming the RegularChunkGrid to Regular? We could keep around a RegularChunkGrid class for compatibility. But I feel like people know these are chunk grids when they import them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50/50. I think the more descriptive class is useful when looking at a tracebacks. Plus, this is currently in .core so its not meant to be used directly by users.
| ) | ||
|
|
||
| @cached_property | ||
| def _cumulative_sizes(self) -> tuple[tuple[int, ...], ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this is cached
src/zarr/testing/strategies.py
Outdated
| chunk_shapes_rle = [ | ||
| [[c, r] for c, r in zip(draw(dim_chunks), draw(repeats), strict=True)] | ||
| for _ in range(ndim) | ||
| ] | ||
| return RectilinearChunkGrid(chunk_shapes=chunk_shapes_rle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs fixing
| @given(data=st.data()) | ||
| async def test_basic_indexing(data: st.DataObject) -> None: | ||
| zarray = data.draw(simple_arrays()) | ||
| @given(data=st.data(), zarray=st.one_of([simple_arrays(), complex_chunked_arrays()])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the search space for the standard arrays strategy is so large, i made a different one complex_chunked_arrays that purely checks different chunk grids
with simple_arrays() we are only spending 10% of our time trying RectilinearChunkGrid so using this approach. We should boost number of examples too.
| 2. **Not compatible with sharding**: You cannot use variable chunking together with | ||
| the sharding feature. Arrays must use either variable chunking or sharding, but not both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this is a temporary limitation! There's a natural extension of rectilinear chunk grids to rectilinear shard grids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
…arr-python into feature/rectilinear-chunk-grid
…ature/rectilinear-chunk-grid
…ature/rectilinear-chunk-grid
|
@d-v-b, @maxrjones, et al - this is ready for a real review. I'd be happy if we merged this in using an "experimental" warning of some kind and let it bake from 3.2->3.3. Suggestions on the best way to communicate the experimental nature of this are welcome. |
|
is it intentional that the chunk grid is import zarr
store = zarr.storage.MemoryStore()
arr = zarr.create_array(
store,
shape=(60, 100),
chunks=[[20, 20, 20], [25, 25, 25, 25]],
zarr_format=3,
dtype="uint8"
)
arr.metadata.chunk_grid # RectilinearChunkGridOr is Not sure how much this matters in practice ( |
@keewis - yes, this is intentional. We decided this was preferable because it will allow the user to add variable lengths later (through an extend/append workflow). |
CodSpeed Performance ReportMerging this PR will not alter performanceComparing Summary
|
|
|
||
| @property | ||
| @abstractmethod | ||
| def dimension_names(self) -> tuple[str | None, ...] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used? I feel like integer indexing should be fine for the chunk grid.
| chunk_shapes: Sequence[Sequence[ChunkEdgeLength]] | ||
|
|
||
|
|
||
| class ChunksType(tuple[Any, ...], ABC): # noqa: SLOT001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sold on subclassing tuple here. It's going to lead to a lot of surprises unless we carefully implement a lot of stuff that we could just use tuple for:
>>> RegularChunks((10,10,1))[:2]
(10, 10) # tuple, not RegularChunks
>>> RegularChunks((10,10,1)) + (1,)
(10, 10, 1, 1) # tuple, not RegularChunksI would much rather make a breaking change than introduce something super confusing
| @abstractmethod | ||
| def keys(self) -> Iterator[str]: | ||
| """Return dimension names, enabling dict(chunks) conversion.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def values(self) -> Iterator[Any]: | ||
| """Return chunk values, enabling dict(chunks) conversion.""" | ||
| ... | ||
|
|
||
| @abstractmethod | ||
| def items(self) -> Iterator[tuple[str, Any]]: | ||
| """Return (name, value) pairs, enabling dict(chunks) conversion.""" | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need key: value access?
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ResolvedChunkSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be a class? It has no methods. Seems like either a TypedDict or just a tuple would work
| shards: tuple[int, ...] | None | ||
|
|
||
|
|
||
| def _validate_zarr_format_compatibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this here? Can't we check when we construct the array metadata document?
| ) | ||
|
|
||
|
|
||
| def _validate_sharding_compatibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be defined over in the array v3 metadata module, and used there to check that the array metadata document is valid?
| ) | ||
|
|
||
|
|
||
| def _validate_data_compatibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be stand-alone function? are we going to use this logic anywhere other than it's current usage site (resolve_chunk_spec)
|
|
||
| @property | ||
| def chunks(self) -> tuple[int, ...]: | ||
| def chunks(self) -> ChunksType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be willing to drop this attribute entirely if it's too complicated to support in a clean way.
| out_dict["data_type"] = dtype_meta.to_json(zarr_format=3) # type: ignore[unreachable] | ||
| return out_dict | ||
|
|
||
| def update_shape(self, shape: tuple[int, ...]) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this needs to take a parameter that can define the new chunks (which can default to None or some other sentinel flagging "default behavior" semantics)
Summary
Adds support for
RectilinearChunkGridextension (Zarr v3), enabling arrays with variable chunk sizes per dimension, plus an enhanced type system for chunk specifications.Closes: #1595 | Replaces: #1483 | Related: zarr-extensions#25
Key Features
1. RectilinearChunkGrid
2. Enhanced Chunk Types
Design Decisions for
ChunksTypeReview Focus Areas
High Priority
Breaking Changes
None. Fully backward compatible.
TODO:
docs/user-guide/*.mdchanges/