Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions docs/user-guide/glossary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Glossary

This page defines key terms used throughout the zarr-python documentation and API.

## Array Structure

### Array

An N-dimensional typed array stored in a Zarr [store](#store). An array's
[metadata](#metadata) defines its shape, data type, chunk layout, and codecs.

### Chunk

The fundamental unit of data in a Zarr array. An array is divided into chunks
along each dimension according to the [chunk grid](#chunk-grid), which is currently
part of Zarr's private API. Each chunk is independently compressed and encoded
through the array's [codec](#codec) pipeline.

When [sharding](#shard) is used, "chunk" refers to the inner chunks within each
shard, because those are the compressible units. The chunks are the smallest units
that can be read independently.

!!! warning "Convention specific to zarr-python"
The use of "chunk" to mean the inner sub-chunk within a shard is a convention
adopted by zarr-python's `Array` API. In the Zarr V3 specification and in other
Zarr implementations, "chunk" may refer to the top-level grid cells (which
zarr-python calls "shards" when the sharding codec is used). Be aware of this
distinction when working across libraries.

**API**: [`Array.chunks`][zarr.Array.chunks] returns the chunk shape. When
sharding is used, this is the inner chunk shape.

### Chunk Grid

The partitioning of an array's elements into [chunks](#chunk). In Zarr V3, the
chunk grid is defined in the array [metadata](#metadata) and determines the
boundaries of each storage object.

When sharding is used, the chunk grid defines the [shard](#shard) boundaries,
not the inner chunk boundaries. The inner chunk shape is defined within the
[sharding codec](#shard).

**API**: The `chunk_grid` field in array metadata contains the storage-level
grid.

### Shard

A storage object that contains one or more [chunks](#chunk). Sharding reduces the
number of objects in a [store](#store) by grouping chunks together, which
improves performance on file systems and object storage.

Within each shard, chunks are compressed independently and can be read
individually. However, writing requires updating the full shard for consistency,
making shards the unit of writing and chunks the unit of reading.

Sharding is implemented as a [codec](#codec) (the sharding indexed codec).
When sharding is used:

- The [chunk grid](#chunk-grid) in metadata defines the shard boundaries
- The sharding codec's `chunk_shape` defines the inner chunk size
- Each shard contains `shard_shape / chunk_shape` chunks per dimension

**API**: [`Array.shards`][zarr.Array.shards] returns the shard shape, or `None`
if sharding is not used. [`Array.chunks`][zarr.Array.chunks] returns the inner
chunk shape.

## Storage

### Store

A key-value storage backend that holds Zarr data and metadata. Stores implement
the [`zarr.abc.store.Store`][] interface. Examples include local file systems,
cloud object storage (S3, GCS, Azure), zip files, and in-memory dictionaries.

Each [chunk](#chunk) or [shard](#shard) is stored as a single value (object or
file) in the store, addressed by a key derived from its grid coordinates.

### Metadata

The JSON document (`zarr.json`) that describes an [array](#array) or group. For
arrays, metadata includes the shape, data type, [chunk grid](#chunk-grid), fill
value, and [codec](#codec) pipeline. Metadata is stored alongside the data in
the [store](#store). Zarr-Python does not yet expose its internal metadata
representation as part of its public API.

## Codecs

### Codec

A transformation applied to array data during reading and writing. Codecs are
chained into a pipeline and come in three types:

- **Array-to-array**: Transforms like transpose that rearrange array elements
- **Array-to-bytes**: Serialization that converts an array to a byte sequence
(exactly one required)
- **Bytes-to-bytes**: Compression or checksums applied to the serialized bytes

The [sharding indexed codec](#shard) is a special array-to-bytes codec that
groups multiple [chunks](#chunk) into a single storage object.

## API Properties

The following properties are available on [`zarr.Array`][]:

| Property | Description |
|----------|-------------|
| `.chunks` | Chunk shape — the inner chunk shape when sharding is used |
| `.shards` | Shard shape, or `None` if no sharding |
| `.nchunks` | Total number of independently compressible units across the array |
| `.cdata_shape` | Number of independently compressible units per dimension |
4 changes: 4 additions & 0 deletions docs/user-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ Take your skills to the next level:
- **[Extending](extending.md)** - Extend functionality with custom code
- **[Consolidated Metadata](consolidated_metadata.md)** - Advanced metadata management

## Reference

- **[Glossary](glossary.md)** - Definitions of key terms (chunks, shards, codecs, etc.)

## Need Help?

- Browse the [API Reference](../api/zarr/index.md) for detailed function documentation
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ nav:
- user-guide/gpu.md
- user-guide/consolidated_metadata.md
- user-guide/experimental.md
- user-guide/glossary.md
- Examples:
- user-guide/examples/custom_dtype.md
- API Reference:
Expand Down
Loading