Skip to content

Add arrow-ipc Array -> Byes codec#41

Open
rabernat wants to merge 1 commit intozarr-developers:mainfrom
rabernat:codecs/arrow-ipc
Open

Add arrow-ipc Array -> Byes codec#41
rabernat wants to merge 1 commit intozarr-developers:mainfrom
rabernat:codecs/arrow-ipc

Conversation

@rabernat
Copy link
Copy Markdown

@rabernat rabernat commented Dec 3, 2025

This adds an Array -> Bytes codec which uses the Arrow IPC protocol for serialization. Further context available in the design doc from the Zarr Summit.

This is a step towards better Arrow interoperability. In Zarr Python, this only works with Numpy arrays whose dtypes map trivially to Arrow dtypes.

Implementation: zarr-developers/zarr-python#3613

@LDeakin
Copy link
Copy Markdown
Member

LDeakin commented Dec 3, 2025

Nice one, Ryan. This looks easy enough to support.

Two thoughts:

  • Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used
  • Permit a 2D array input as well, so that multiple columns of the same data type could be stored


## Configuration parameters

- `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should implementations use the name of the array here? an array only gets a name when its stored, so IMO it might be better to recommend a default that can be determined purely from information available when creating array metadata.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, deciding on the "name" of the array would depend on various heuristics depending on how it is stored. It might be simpler to eliminate this parameter altogether and always use a fixed name like data.

@jbms
Copy link
Copy Markdown
Contributor

jbms commented Dec 3, 2025

In general the parquet format seems to be more in line with normal usage of zarr (long-term persistent storage) than the Arrow IPC format, and in particular supports various compression strategies.

While you could compose this codec with additional bytes -> bytes compression codecs, parquet might offer some advantages, like permitting random access to individual fields/sub-fields while still supporting compression.

@rabernat
Copy link
Copy Markdown
Author

rabernat commented Dec 3, 2025

  • Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

  • Fail at runtime when the Arrow IPC codec is unable to process the ND array
  • Implement some kind of dependency system between codecs to make sure they are compatible

So I'd prefer to keep flattening as part of the codec itself.

@jbms
Copy link
Copy Markdown
Contributor

jbms commented Dec 3, 2025

  • Rather than embedding the flattening in this codec, the reshape array-to-array codec could be used

That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:

  • Fail at runtime when the Arrow IPC codec is unable to process the ND array

The mismatch in number of dimensions could be detected when creating or opening the array, depending on what sort of validation the implementation does. That would certainly be better than waiting until the first read or write operation.

  • Implement some kind of dependency system between codecs to make sure they are compatible

Yes, optionally the implementation could automatically insert a reshape codec.

So I'd prefer to keep flattening as part of the codec itself.

Nonetheless it seems reasonable to convert to 1d automatically, just like the bytes and packbits codecs.

Note: There are also the arrow Tensor and SparseTensor message types (https://github.com/apache/arrow/blob/main/format/Tensor.fbs) --- they seem to be independent of the Schema and RecordBatch messages. While they provide a direct arrow representation of multi-dimensional arrays, I imagine they are not widely supported by other software, and more importantly, they only support simple, fixed-size data types and it seems that this codec is most useful for variable-length and struct data types.

@jbms
Copy link
Copy Markdown
Contributor

jbms commented Dec 3, 2025

The codec description should say explicitly that the encoded representation should start with a Schema message if that is the intention. The Schema can anyway be determined from the array data type but including it would presumably allow the raw chunk file to be more easily consumed by other software.

@ilan-gold
Copy link
Copy Markdown

While they provide a direct arrow representation of multi-dimensional arrays, I imagine they are not widely supported by other software, and more importantly, they only support simple, fixed-size data types and it seems that this codec is most useful for variable-length and struct data types.

Continuing on this line of thinking, what would be the meaning of an ND arrow representation without this feature? Does it even exist? I am definitely mostly interested in this feature personally because of arrow's flexibility with data types and less about ND-edness i.e., storing 1D arrow types directly in zarr. Looking at the documentation, the memory layout of arrow arrays doesn't seem to allow for ND-edness since it only has length so that would be our personal feature? I suppose we could handle nesting as NDness but distinguishing this from other kinds of ragged-shape nesting seems hard.

Furthermore, as a consumer of this codec, I would expect 0-copy, pure arrow arrays to be spat out of zarr.Array.__getitem__ (or a buffer class containing them) and not a numpy-ified version

For example:

import pyarrow as pa
pa.array([{'x': 1, 'y': True}, {'z': 3.4, 'x': 4}]).to_numpy()

Fails because 0-copy is the default and this object can't be 0-copied into numpy.ndarray.

So why wouldn't we restrict to 1d arrays here if there is no natural in-memory representation of nd arrays? Or is there?

@rabernat
Copy link
Copy Markdown
Author

rabernat commented Feb 26, 2026

I agree fully @ilan-gold - to take full advantage of this, we would need the materialized, in-memory result of an Array selection operation to be both Arrow-like AND ND-array like.

The Zarr spec doesn't really say what the in-memory representation should be; this is implementation specific. Zarr Python is quite constrained by its assumption that everything is NumPy arrays, but that could be related.

As for ND Arrow arrays, I've always thought this was conceptually pretty trivial, at least for a single contiguous array. One would just need a lightweight wrapper over the flat 1D array to reshape it into an ND array (e.g. Arrow array of length 6, actual shape is (3, 2)). Again it comes back to ecosystem support; one could make that abstraction, but what software would work with it? This object would be neither an Arrow array or a NumPy array.

@ilan-gold
Copy link
Copy Markdown

ilan-gold commented Feb 26, 2026

Taking into account

result of an Array selection operation to be both Arrow-like AND ND-array like.

The Zarr spec doesn't really say what the in-memory representation should be; this is implementation specific. Zarr Python is quite constrained by its assumption that everything is NumPy arrays, but that could be related.

Again it comes back to ecosystem support; one could make that abstraction, but what software would work with it? This object would be neither an Arrow array or a NumPy array.

I think a custom buffer class sounds like a strong candidate and it would

a) wrap arrow arrays to make them internally compatible with zarr-python for i/o, metadata etc.
b) enforce 1d-edness (hypothetically, if everyone agrees that's a good idea)
c) Potentially enforce zero-copy-ness (i.e., no one should call to_numpy on a arrow class etc.)

I don't think zarr-python on the whole is tied into numpy, only some/all current codecs are: https://github.com/search?q=repo%3Azarr-developers%2Fzarr-python%20%22as_numpy_array%22&type=code. A potential hiccup is the contract of NDArrayLike, but I wonder how much of this contract is codec-specific. For example def view(...) only appears to be used inside other codecs and ravel() appears only in the BytesCodec. But I don't think this would be a problem in theory because the codec mechanism proposed here would not hook into these other codecs probably in a way that would require an arrow-buffer-wrapper-thing to satisfy this contract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants