Skip to content

Latest commit

 

History

History
58 lines (47 loc) · 5.5 KB

File metadata and controls

58 lines (47 loc) · 5.5 KB
layout single
author_profile false
title Zarr's Flexibility
sidebar
title nav
Flexibility
sidebar

One of Zarr's greatest strengths is its flexibility, or "hackability". This largely comes from the separation of distinct Zarr Components, but there are a range of other properties that make zarr flexible too.

Types of flexibility

This flexibility comes in several forms:

  • The Zarr protocol is device agnostic.
  • The Zarr data model is domain agnostic.
  • Key-value stores are an almost universal abstraction in data systems, and so can almost always be mapped to existing system interfaces.
  • The Zarr format on-disk is extremely simple.
  • Storing each chunk under a different key allows implementations to scale their IO throughput in a variety of simple ways.
  • The reference Zarr implementation is written in Python, a very hackable language, with ABCs you can use when creating new store implementations.
  • Components are seperated: the protocol, file format, standard API, ABC, and store implementations are all separate.
  • There is no requirement to use more than one zarr component - individual projects can achieve powerful functionality by intelligently using only some of the Zarr components.
  • You can define your own codecs.
  • You are free to create your own domain-specific metadata standard and enforce it upon zarr stores however you like.
  • Zarr v3 has nascent support for other extension points, including defining your own type of chunk grid, data types, and more.
  • Zarr Enhancement Proposals (or "ZEPs") provide a mechanism for enhancing or adding to the specification in a community-standardized way.

Examples

Here are a few zarr-related software projects, which each make use of a selected subset of different zarr components to achieve interesting functionality. These particular projects are more than simply zarr implementations written in a different language (you can find a list of implementations here).

  • MongoDBStore is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys. It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format.

  • VirtualiZarr provides a concrete store implementation in python (the ManifestStore) which stores references to locations and byte ranges of chunks on disk inside "chunk manifests", which reside inside files stored in other binary formats such as netCDF. These references are generated by "readers", which do the job of parsing the file structure and mapping the contents to the zarr data model. VirtualiZarr therefore eschews the native zarr format but still provides spec-compliant access to non-zarr-formatted data using zarr-python's API, without duplicating the original data. The manifests effectively act as an indirection layer between the zarr-spec-compliant key interface, and the actual location of the chunks in storage.

  • NCZarr and Lindi can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API. Lindi maps zarr's data model to the HDF data model and allows access to via the h5py library through the LindiH5pyFile class. NCZarr allows interacting with zarr-formatted data via the netcdf-c library. Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended.

  • Tensorstore is a general storage library written in C++ that can write to the Zarr format (so is a spec-compliant non-python "native" store implementation) but also to other array formats such as N5. As it can write to multiple different storage sytems, it effectively has its own set of concrete store implementations. Additional features are provided, notably using an Optionally-Cooperative Distributed B+Tree (OCDBT) on top of a base key-value store to implement ACID transactions. It still stores all data using the native Zarr Format, but versions keys at the store level.

  • Icechunk is a cloud-native tensor storage engine which also provides ACID transactions, but does so via indirection between a zarr-spec-compliant key-value store interface and a specialized non-zarr-native storage layout on-disk (for which Icechunk has it's own format specification). Whilst the core icechunk client is written in rust, the icechunk-python client implements a concrete subclass of the zarr-python Store ABC. Therefore libraries such as xarray can use the zarr-python user API to read and write to icechunk stores, effectively treating them as version-controlled zarr stores. Icechunk also integrates with VirtualiZarr as a serialization format for byte range references. Together they allow data stored in non-zarr formats to be committed to a persistent icechunk store and read back later via the zarr-python API without duplicating the original data chunks.

We also have a full list of zarr implementations.