-
Notifications
You must be signed in to change notification settings - Fork 110
GH-946: Add Variant extension type support #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
|
We're about to overhaul the API for implementing extension writers so you may want to hold off |
Thanks @lidavidm, I will wait for that, then rebase. You are talking about the |
2b45ca4 to
a297cc6
Compare
| @Override | ||
| public Object getObject(int index) { | ||
| return getUnderlyingVector().getObject(index); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is useful this way. I don't have a strong opinion though what should it return instead. @lidavidm, do you have an idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands, this returns the raw struct row? I suppose decoding to a Java object may be what's expected but that may be expensive. Or returning an explicit Variant object that has methods for inspecting/manipulating the data in that row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the variant representation of parquet-java would be suitable here. That is a java object backed by nio ByteBuffers. It would be beneficial as arrow-java would not need to maintain/test the related code. Meanwhile, it does not support returning "rich" objects (e.g. timestamps) but the primitive representations only (e.g. long for a timestamp). I'm not sure how it would fit the current concept.
Alternatively, we would be able to wrap the parquet-java Variant object and extend it with those rich types the different Arrow vectors support (e.g. LocalDateTime for a timestamp nano value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we go this route, I'd rather have an Arrow Variant to avoid having to break APIs down the line if we decide to extend functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sure I understand you correctly, @lidavidm, do you agree with the following?
Add the new parquet-java dependency and use its variant implementation as a bases of a new Arrow Variant class to be returned by VariantVector.getObject. This class can return the corresponding java objects for the related primitives according to the java objects returned by their typed vectors. No parquet-java classes shall face the public API of Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with arrow-parquet-variant approach. It cleanly isolated the dependency (to avoid any side effect in other parts of Arrow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've misinterpreted @jbonofre's suggestion. I wanted to reference the parquet-java's parquet-variant module. That contains only 10 prod classes. But right, it has a dependency on other parquet-java modules as well. So, we have two options:
- Create the separate
arrow-variantmodule that contains any variant related classes, including the ones already included in this PR. - Duplicate the variant related code from
parquet-java, and implement our own classes based on it in the existingvectormodule.
I do not have a strong opinion which one would be the best choice for Arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my concern about parquet-variant: the "parent" and transitive dependencies. I don't think we should fetch a large part of Parquet in Arrow.
I have a preference to create arrow-variant module with clearly defined dependencies.
Thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. @tmater, could you update your PR accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the discussion! I've updated the PR to introduce a separate arrow-variant module that depends on parquet-variant for the binary format handling. The Variant class wraps the Parquet implementation and is returned by VariantVector.getObject, keeping Parquet classes out of the public Arrow API.
Let me know what you think!
Implements VariantType extension type with VariantVector for storing variant data with metadata and value buffers. Includes reader/writer implementations and comprehensive test coverage.
a297cc6 to
3a0d927
Compare
gszadovszky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All variant related classes and code changes shall be placed in the new arrow-variant module so the user can decide if they need variant and accept the related transitive dependencies. No other arrow-java module shall depend on arrow-variant.
arrow-variant/src/main/java/org/apache/arrow/variant/extension/VariantType.java
Show resolved
Hide resolved
arrow-variant/src/main/java/org/apache/arrow/variant/extension/VariantVector.java
Show resolved
Hide resolved
vector/src/test/java/org/apache/arrow/vector/types/pojo/TestExtensionType.java
Show resolved
Hide resolved
| public ByteBuffer getValueBuffer() { | ||
| return delegate.getValueBuffer(); | ||
| } | ||
|
|
||
| public ByteBuffer getMetadataBuffer() { | ||
| return delegate.getMetadataBuffer(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should expose these to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This follows how parquet-variant defines it, the ByteBuffer is read-only.
I’m happy to change it; I just wanted to call it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it is useful for the users, but do not have a strong opinion to remove it.
gszadovszky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@gszadovszky thanks ! I just ran the CI and fix the PR meta (milestone). I will do a new pass asap. |
Summary
This PR adds support for the Variant extension type in Arrow Java, enabling storage and manipulation of semi-structured variant data with metadata and value buffers.
Changes
A new
arrow-variantmodule introduces theVariantclass for parsing and working with variant data. This module is separated from the core vector module to isolate theparquet-variantdependency, so users of the Arrow vector library don't have to depend on Parquet. This also maintains a clean API boundary between Arrow's core functionality and variant-specific parsing logic.The core vector module gains
VariantTypeas an extension type along withVariantVectorfor storing variant data as metadata/value buffer pairs. The implementation includes reader and writer support throughVariantReaderImpl,VariantWriterImpl, andNullableVariantHolderReaderImpl, with corresponding holder classes for use in generated code paths.Testing
VariantType,VariantVector, andVariantparsingListVectorandMapVectorCloses #946