Skip to content

Conversation

@tmater
Copy link
Contributor

@tmater tmater commented Jan 7, 2026

Summary

This PR adds support for the Variant extension type in Arrow Java, enabling storage and manipulation of semi-structured variant data with metadata and value buffers.

Changes

A new arrow-variant module introduces the Variant class for parsing and working with variant data. This module is separated from the core vector module to isolate the parquet-variant dependency, so users of the Arrow vector library don't have to depend on Parquet. This also maintains a clean API boundary between Arrow's core functionality and variant-specific parsing logic.

The core vector module gains VariantType as an extension type along with VariantVector for storing variant data as metadata/value buffer pairs. The implementation includes reader and writer support through VariantReaderImpl, VariantWriterImpl, and NullableVariantHolderReaderImpl, with corresponding holder classes for use in generated code paths.

Testing

  • Unit tests for VariantType, VariantVector, and Variant parsing
  • Integration tests with ListVector and MapVector
  • Extension type round-trip tests

Closes #946

@github-actions

This comment has been minimized.

@lidavidm
Copy link
Member

lidavidm commented Jan 7, 2026

We're about to overhaul the API for implementing extension writers so you may want to hold off

@tmater
Copy link
Contributor Author

tmater commented Jan 7, 2026

We're about to overhaul the API for implementing extension writers so you may want to hold off

Thanks @lidavidm, I will wait for that, then rebase. You are talking about the ExtensionTypeWriterFactory related changes, right?

Comment on lines 148 to 171
@Override
public Object getObject(int index) {
return getUnderlyingVector().getObject(index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is useful this way. I don't have a strong opinion though what should it return instead. @lidavidm, do you have an idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, this returns the raw struct row? I suppose decoding to a Java object may be what's expected but that may be expensive. Or returning an explicit Variant object that has methods for inspecting/manipulating the data in that row

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the variant representation of parquet-java would be suitable here. That is a java object backed by nio ByteBuffers. It would be beneficial as arrow-java would not need to maintain/test the related code. Meanwhile, it does not support returning "rich" objects (e.g. timestamps) but the primitive representations only (e.g. long for a timestamp). I'm not sure how it would fit the current concept.
Alternatively, we would be able to wrap the parquet-java Variant object and extend it with those rich types the different Arrow vectors support (e.g. LocalDateTime for a timestamp nano value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we go this route, I'd rather have an Arrow Variant to avoid having to break APIs down the line if we decide to extend functionality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure I understand you correctly, @lidavidm, do you agree with the following?

Add the new parquet-java dependency and use its variant implementation as a bases of a new Arrow Variant class to be returned by VariantVector.getObject. This class can return the corresponding java objects for the related primitives according to the java objects returned by their typed vectors. No parquet-java classes shall face the public API of Arrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with arrow-parquet-variant approach. It cleanly isolated the dependency (to avoid any side effect in other parts of Arrow).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've misinterpreted @jbonofre's suggestion. I wanted to reference the parquet-java's parquet-variant module. That contains only 10 prod classes. But right, it has a dependency on other parquet-java modules as well. So, we have two options:

  • Create the separate arrow-variant module that contains any variant related classes, including the ones already included in this PR.
  • Duplicate the variant related code from parquet-java, and implement our own classes based on it in the existing vector module.

I do not have a strong opinion which one would be the best choice for Arrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my concern about parquet-variant: the "parent" and transitive dependencies. I don't think we should fetch a large part of Parquet in Arrow.

I have a preference to create arrow-variant module with clearly defined dependencies.

Thoughts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. @tmater, could you update your PR accordingly?

Copy link
Contributor Author

@tmater tmater Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion! I've updated the PR to introduce a separate arrow-variant module that depends on parquet-variant for the binary format handling. The Variant class wraps the Parquet implementation and is returned by VariantVector.getObject, keeping Parquet classes out of the public Arrow API.
Let me know what you think!

@lidavidm lidavidm added the enhancement PRs that add or improve features. label Jan 22, 2026
Implements VariantType extension type with VariantVector for storing
variant data with metadata and value buffers. Includes reader/writer
implementations and comprehensive test coverage.
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All variant related classes and code changes shall be placed in the new arrow-variant module so the user can decide if they need variant and accept the related transitive dependencies. No other arrow-java module shall depend on arrow-variant.

Comment on lines +61 to +68
public ByteBuffer getValueBuffer() {
return delegate.getValueBuffer();
}

public ByteBuffer getMetadataBuffer() {
return delegate.getMetadataBuffer();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should expose these to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows how parquet-variant defines it, the ByteBuffer is read-only.
I’m happy to change it; I just wanted to call it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is useful for the users, but do not have a strong opinion to remove it.

@tmater tmater requested a review from gszadovszky January 28, 2026 13:26
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes, @tmater. It looks good to me.
@lidavidm, @jbonofre, could you check it out and also approve the awaiting workflows to validate the change?

@jbonofre jbonofre added this to the 19.0.0 milestone Jan 28, 2026
@jbonofre
Copy link
Member

@gszadovszky thanks ! I just ran the CI and fix the PR meta (milestone). I will do a new pass asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement PRs that add or improve features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical type definition for parquet.variant in arrow-java

4 participants