Add arrow-ipc Array -> Byes codec#41
Conversation
|
Nice one, Ryan. This looks easy enough to support. Two thoughts:
|
|
|
||
| ## Configuration parameters | ||
|
|
||
| - `column_name`: the name of column used for generating an Arrow record batch from the Zarr array data. Implementations SHOULD use the name of the Zarr array here. |
There was a problem hiding this comment.
why should implementations use the name of the array here? an array only gets a name when its stored, so IMO it might be better to recommend a default that can be determined purely from information available when creating array metadata.
There was a problem hiding this comment.
Yes, deciding on the "name" of the array would depend on various heuristics depending on how it is stored. It might be simpler to eliminate this parameter altogether and always use a fixed name like data.
|
In general the parquet format seems to be more in line with normal usage of zarr (long-term persistent storage) than the Arrow IPC format, and in particular supports various compression strategies. While you could compose this codec with additional bytes -> bytes compression codecs, parquet might offer some advantages, like permitting random access to individual fields/sub-fields while still supporting compression. |
That's an interesting suggestion. My inclination is to resist it for the following reason: it introduces required coupling between codecs. Arrow Arrays MUST be 1D, full stop. What happens if we accept this suggestion and then I set up a Zarr Array without the reshape codec? It will either:
So I'd prefer to keep flattening as part of the codec itself. |
The mismatch in number of dimensions could be detected when creating or opening the array, depending on what sort of validation the implementation does. That would certainly be better than waiting until the first read or write operation.
Yes, optionally the implementation could automatically insert a
Nonetheless it seems reasonable to convert to 1d automatically, just like the Note: There are also the arrow |
|
The codec description should say explicitly that the encoded representation should start with a |
Continuing on this line of thinking, what would be the meaning of an ND arrow representation without this feature? Does it even exist? I am definitely mostly interested in this feature personally because of arrow's flexibility with data types and less about ND-edness i.e., storing 1D arrow types directly in zarr. Looking at the documentation, the memory layout of arrow arrays doesn't seem to allow for ND-edness since it only has length so that would be our personal feature? I suppose we could handle nesting as NDness but distinguishing this from other kinds of ragged-shape nesting seems hard. Furthermore, as a consumer of this codec, I would expect 0-copy, pure arrow arrays to be spat out of For example: Fails because 0-copy is the default and this object can't be 0-copied into So why wouldn't we restrict to 1d arrays here if there is no natural in-memory representation of nd arrays? Or is there? |
|
I agree fully @ilan-gold - to take full advantage of this, we would need the materialized, in-memory result of an Array selection operation to be both Arrow-like AND ND-array like. The Zarr spec doesn't really say what the in-memory representation should be; this is implementation specific. Zarr Python is quite constrained by its assumption that everything is NumPy arrays, but that could be related. As for ND Arrow arrays, I've always thought this was conceptually pretty trivial, at least for a single contiguous array. One would just need a lightweight wrapper over the flat 1D array to reshape it into an ND array (e.g. Arrow array of length 6, actual shape is (3, 2)). Again it comes back to ecosystem support; one could make that abstraction, but what software would work with it? This object would be neither an Arrow array or a NumPy array. |
|
Taking into account
I think a custom buffer class sounds like a strong candidate and it would a) wrap arrow arrays to make them internally compatible with zarr-python for i/o, metadata etc. I don't think zarr-python on the whole is tied into |
This adds an Array -> Bytes codec which uses the Arrow IPC protocol for serialization. Further context available in the design doc from the Zarr Summit.
This is a step towards better Arrow interoperability. In Zarr Python, this only works with Numpy arrays whose dtypes map trivially to Arrow dtypes.
Implementation: zarr-developers/zarr-python#3613