Add arrow-ipc Array -> Byes codec by rabernat · Pull Request #41 · zarr-developers/zarr-extensions

rabernat · 2025-12-03T19:49:38Z

This adds an Array -> Bytes codec which uses the Arrow IPC protocol for serialization. Further context available in the design doc from the Zarr Summit.

This is a step towards better Arrow interoperability. In Zarr Python, this only works with Numpy arrays whose dtypes map trivially to Arrow dtypes.

Implementation: zarr-developers/zarr-python#3613

LDeakin · 2025-12-03T20:10:41Z

Nice one, Ryan. This looks easy enough to support.

Two thoughts:

Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used
Permit a 2D array input as well, so that multiple columns of the same data type could be stored

d-v-b · 2025-12-03T20:16:15Z

+
+## Configuration parameters
+
+- `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here.


why should implementations use the name of the array here? an array only gets a name when its stored, so IMO it might be better to recommend a default that can be determined purely from information available when creating array metadata.

Yes, deciding on the "name" of the array would depend on various heuristics depending on how it is stored. It might be simpler to eliminate this parameter altogether and always use a fixed name like data.

jbms · 2025-12-03T20:46:20Z

In general the parquet format seems to be more in line with normal usage of zarr (long-term persistent storage) than the Arrow IPC format, and in particular supports various compression strategies.

While you could compose this codec with additional bytes -> bytes compression codecs, parquet might offer some advantages, like permitting random access to individual fields/sub-fields while still supporting compression.

rabernat · 2025-12-03T21:16:41Z

Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

Fail at runtime when the Arrow IPC codec is unable to process the ND array
Implement some kind of dependency system between codecs to make sure they are compatible

So I'd prefer to keep flattening as part of the codec itself.

jbms · 2025-12-03T21:52:08Z

Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

Fail at runtime when the Arrow IPC codec is unable to process the ND array

The mismatch in number of dimensions could be detected when creating or opening the array, depending on what sort of validation the implementation does. That would certainly be better than waiting until the first read or write operation.

Implement some kind of dependency system between codecs to make sure they are compatible

Yes, optionally the implementation could automatically insert a reshape codec.

So I'd prefer to keep flattening as part of the codec itself.

Nonetheless it seems reasonable to convert to 1d automatically, just like the bytes and packbits codecs.

Note: There are also the arrow Tensor and SparseTensor message types (https://github.com/apache/arrow/blob/main/format/Tensor.fbs) --- they seem to be independent of the Schema and RecordBatch messages. While they provide a direct arrow representation of multi-dimensional arrays, I imagine they are not widely supported by other software, and more importantly, they only support simple, fixed-size data types and it seems that this codec is most useful for variable-length and struct data types.

jbms · 2025-12-03T21:54:57Z

The codec description should say explicitly that the encoded representation should start with a Schema message if that is the intention. The Schema can anyway be determined from the array data type but including it would presumably allow the raw chunk file to be more easily consumed by other software.

ilan-gold · 2026-02-26T13:36:54Z

While they provide a direct arrow representation of multi-dimensional arrays, I imagine they are not widely supported by other software, and more importantly, they only support simple, fixed-size data types and it seems that this codec is most useful for variable-length and struct data types.

Continuing on this line of thinking, what would be the meaning of an ND arrow representation without this feature? Does it even exist? I am definitely mostly interested in this feature personally because of arrow's flexibility with data types and less about ND-edness i.e., storing 1D arrow types directly in zarr. Looking at the documentation, the memory layout of arrow arrays doesn't seem to allow for ND-edness since it only has length so that would be our personal feature? I suppose we could handle nesting as NDness but distinguishing this from other kinds of ragged-shape nesting seems hard.

Furthermore, as a consumer of this codec, I would expect 0-copy, pure arrow arrays to be spat out of zarr.Array.__getitem__ (or a buffer class containing them) and not a numpy-ified version

For example:

import pyarrow as pa
pa.array([{'x': 1, 'y': True}, {'z': 3.4, 'x': 4}]).to_numpy()

Fails because 0-copy is the default and this object can't be 0-copied into numpy.ndarray.

So why wouldn't we restrict to 1d arrays here if there is no natural in-memory representation of nd arrays? Or is there?

rabernat · 2026-02-26T14:19:25Z

I agree fully @ilan-gold - to take full advantage of this, we would need the materialized, in-memory result of an Array selection operation to be both Arrow-like AND ND-array like.

The Zarr spec doesn't really say what the in-memory representation should be; this is implementation specific. Zarr Python is quite constrained by its assumption that everything is NumPy arrays, but that could be related.

As for ND Arrow arrays, I've always thought this was conceptually pretty trivial, at least for a single contiguous array. One would just need a lightweight wrapper over the flat 1D array to reshape it into an ND array (e.g. Arrow array of length 6, actual shape is (3, 2)). Again it comes back to ecosystem support; one could make that abstraction, but what software would work with it? This object would be neither an Arrow array or a NumPy array.

ilan-gold · 2026-02-26T14:36:44Z

Taking into account

result of an Array selection operation to be both Arrow-like AND ND-array like.

The Zarr spec doesn't really say what the in-memory representation should be; this is implementation specific. Zarr Python is quite constrained by its assumption that everything is NumPy arrays, but that could be related.

Again it comes back to ecosystem support; one could make that abstraction, but what software would work with it? This object would be neither an Arrow array or a NumPy array.

I think a custom buffer class sounds like a strong candidate and it would

a) wrap arrow arrays to make them internally compatible with zarr-python for i/o, metadata etc.
b) enforce 1d-edness (hypothetically, if everyone agrees that's a good idea)
c) Potentially enforce zero-copy-ness (i.e., no one should call to_numpy on a arrow class etc.)

I don't think zarr-python on the whole is tied into numpy, only some/all current codecs are: https://github.com/search?q=repo%3Azarr-developers%2Fzarr-python%20%22as_numpy_array%22&type=code. A potential hiccup is the contract of NDArrayLike, but I wonder how much of this contract is codec-specific. For example def view(...) only appears to be used inside other codecs and ravel() appears only in the BytesCodec. But I don't think this would be a problem in theory because the codec mechanism proposed here would not hook into these other codecs probably in a way that would require an arrow-buffer-wrapper-thing to satisfy this contract.

add arrow-ipc codec

82e488b

rabernat mentioned this pull request Dec 3, 2025

Arrow ipc Array -> Bytes codec zarr-developers/zarr-python#3613

Open

6 tasks

d-v-b reviewed Dec 3, 2025

View reviewed changes

keller-mark mentioned this pull request Jan 6, 2026

Related libraries keller-mark/parquet-as-virtual-zarr.js#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add arrow-ipc Array -> Byes codec#41

Add arrow-ipc Array -> Byes codec#41
rabernat wants to merge 1 commit intozarr-developers:mainfrom
rabernat:codecs/arrow-ipc

rabernat commented Dec 3, 2025 •

edited

Loading

Uh oh!

LDeakin commented Dec 3, 2025

Uh oh!

d-v-b Dec 3, 2025

Uh oh!

jbms Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

rabernat commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

ilan-gold commented Feb 26, 2026

Uh oh!

rabernat commented Feb 26, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Feb 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		## Configuration parameters

		- `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here.

Conversation

rabernat commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LDeakin commented Dec 3, 2025

Uh oh!

d-v-b Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

jbms Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

rabernat commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

jbms commented Dec 3, 2025

Uh oh!

ilan-gold commented Feb 26, 2026

Uh oh!

rabernat commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rabernat commented Dec 3, 2025 •

edited

Loading

rabernat commented Feb 26, 2026 •

edited

Loading

ilan-gold commented Feb 26, 2026 •

edited

Loading