From dbbec79fc2372a016c60bba17f6ec0d5c6a4779e Mon Sep 17 00:00:00 2001 From: Max Jones <14077947+maxrjones@users.noreply.github.com> Date: Wed, 11 Mar 2026 13:20:08 -0400 Subject: [PATCH 1/4] docs: add glossary --- docs/user-guide/index.md | 4 ++++ mkdocs.yml | 1 + 2 files changed, 5 insertions(+) diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md index fda9bcaa90..ff6e354d80 100644 --- a/docs/user-guide/index.md +++ b/docs/user-guide/index.md @@ -35,6 +35,10 @@ Take your skills to the next level: - **[Extending](extending.md)** - Extend functionality with custom code - **[Consolidated Metadata](consolidated_metadata.md)** - Advanced metadata management +## Reference + +- **[Glossary](glossary.md)** - Definitions of key terms (chunks, shards, codecs, etc.) + ## Need Help? - Browse the [API Reference](../api/zarr/index.md) for detailed function documentation diff --git a/mkdocs.yml b/mkdocs.yml index 61872b6234..c654d86223 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -27,6 +27,7 @@ nav: - user-guide/gpu.md - user-guide/consolidated_metadata.md - user-guide/experimental.md + - user-guide/glossary.md - Examples: - user-guide/examples/custom_dtype.md - API Reference: From a311cfed2c38ec717b5b7ee0cf49a3e77a5a31d1 Mon Sep 17 00:00:00 2001 From: Max Jones <14077947+maxrjones@users.noreply.github.com> Date: Wed, 11 Mar 2026 13:21:05 -0400 Subject: [PATCH 2/4] Add glossary --- docs/user-guide/glossary.md | 103 ++++++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 docs/user-guide/glossary.md diff --git a/docs/user-guide/glossary.md b/docs/user-guide/glossary.md new file mode 100644 index 0000000000..cdb2ce3103 --- /dev/null +++ b/docs/user-guide/glossary.md @@ -0,0 +1,103 @@ +# Glossary + +This page defines key terms used throughout the zarr-python documentation and API. + +## Array Structure + +### Array + +An N-dimensional typed array stored in a Zarr [store](#store). An array's +[metadata](#metadata) defines its shape, data type, chunk layout, and codecs. + +### Chunk + +The fundamental unit of data in a Zarr array. An array is divided into chunks +along each dimension according to the [chunk grid](#chunk-grid), which is currently +part of Zarr's private API. Each chunk is independently compressed and encoded +through the array's [codec](#codec) pipeline. + +When [sharding](#shard) is used, "chunk" refers to the inner chunks within each +shard, because those are the compressible units. The chunks are the smallest units +that can be read independently. + +**API**: [`Array.chunks`][zarr.Array.chunks] returns the chunk shape. When +sharding is used, this is the inner chunk shape. + +### Chunk Grid + +The partitioning of an array's elements into [chunks](#chunk). In Zarr V3, the +chunk grid is defined in the array [metadata](#metadata) and determines the +boundaries of each storage object. + +When sharding is used, the chunk grid defines the [shard](#shard) boundaries, +not the inner chunk boundaries. The inner chunk shape is defined within the +[sharding codec](#shard). + +**API**: The `chunk_grid` field in array metadata contains the storage-level +grid. + +### Shard + +A storage object that contains multiple [chunks](#chunk). Sharding reduces the +number of objects in a [store](#store) by grouping chunks together, which +improves performance on file systems and object storage. + +Within each shard, chunks are compressed independently and can be read +individually. However, writing requires updating the full shard for consistency, +making shards the unit of writing and chunks the unit of reading. + +Sharding is implemented as a [codec](#codec) (the sharding indexed codec). +When sharding is used: + +- The [chunk grid](#chunk-grid) in metadata defines the shard boundaries +- The sharding codec's `chunk_shape` defines the inner chunk size +- Each shard contains `shard_shape / chunk_shape` chunks per dimension + +**API**: [`Array.shards`][zarr.Array.shards] returns the shard shape, or `None` +if sharding is not used. [`Array.chunks`][zarr.Array.chunks] returns the inner +chunk shape. + +## Storage + +### Store + +A key-value storage backend that holds Zarr data and metadata. Stores implement +the [`zarr.abc.store.Store`][] interface. Examples include local file systems, +cloud object storage (S3, GCS, Azure), zip files, and in-memory dictionaries. + +Each [chunk](#chunk) or [shard](#shard) is stored as a single value (object or +file) in the store, addressed by a key derived from its grid coordinates. + +### Metadata + +The JSON document (`zarr.json`) that describes an [array](#array) or group. For +arrays, metadata includes the shape, data type, [chunk grid](#chunk-grid), fill +value, and [codec](#codec) pipeline. Metadata is stored alongside the data in +the [store](#store). Zarr-Python does not yet expose its internal metadata +representation as part of its public API. + +## Codecs + +### Codec + +A transformation applied to array data during reading and writing. Codecs are +chained into a pipeline and come in three types: + +- **Array-to-array**: Transforms like transpose that rearrange array elements +- **Array-to-bytes**: Serialization that converts an array to a byte sequence + (exactly one required) +- **Bytes-to-bytes**: Compression or checksums applied to the serialized bytes + +The [sharding indexed codec](#shard) is a special array-to-bytes codec that +groups multiple [chunks](#chunk) into a single storage object. + +## API Properties + +The following properties are available on [`zarr.Array`][]: + +| Property | Description | +|----------|-------------| +| `.chunks` | Chunk shape — the inner chunk shape when sharding is used | +| `.shards` | Shard shape, or `None` if no sharding | +| `.nchunks` | Total number of independently compressible units across the array | +| `.cdata_shape` | Number of independently compressible units per dimension | From 63ff1bf42bd4e4f9a55373e655892b1dda0c7438 Mon Sep 17 00:00:00 2001 From: Max Jones <14077947+maxrjones@users.noreply.github.com> Date: Wed, 11 Mar 2026 13:27:14 -0400 Subject: [PATCH 3/4] Update docs/user-guide/glossary.md Co-authored-by: Davis Bennett --- docs/user-guide/glossary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user-guide/glossary.md b/docs/user-guide/glossary.md index cdb2ce3103..ad3bf2ba0e 100644 --- a/docs/user-guide/glossary.md +++ b/docs/user-guide/glossary.md @@ -38,7 +38,7 @@ grid. ### Shard -A storage object that contains multiple [chunks](#chunk). Sharding reduces the +A storage object that contains one or more [chunks](#chunk). Sharding reduces the number of objects in a [store](#store) by grouping chunks together, which improves performance on file systems and object storage. From d3e7f0bc2abbbe4e18291e2eee928b186b375295 Mon Sep 17 00:00:00 2001 From: Max Jones <14077947+maxrjones@users.noreply.github.com> Date: Wed, 11 Mar 2026 13:39:45 -0400 Subject: [PATCH 4/4] Add caveat --- docs/user-guide/glossary.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/user-guide/glossary.md b/docs/user-guide/glossary.md index ad3bf2ba0e..a490b7c341 100644 --- a/docs/user-guide/glossary.md +++ b/docs/user-guide/glossary.md @@ -20,6 +20,13 @@ When [sharding](#shard) is used, "chunk" refers to the inner chunks within each shard, because those are the compressible units. The chunks are the smallest units that can be read independently. +!!! warning "Convention specific to zarr-python" + The use of "chunk" to mean the inner sub-chunk within a shard is a convention + adopted by zarr-python's `Array` API. In the Zarr V3 specification and in other + Zarr implementations, "chunk" may refer to the top-level grid cells (which + zarr-python calls "shards" when the sharding codec is used). Be aware of this + distinction when working across libraries. + **API**: [`Array.chunks`][zarr.Array.chunks] returns the chunk shape. When sharding is used, this is the inner chunk shape.