GH-3414: Add parseJson to VariantBuilder for JSON-to-Variant conversion#3415
Open
gaurav7261 wants to merge 1 commit intoapache:masterfrom
Open
GH-3414: Add parseJson to VariantBuilder for JSON-to-Variant conversion#3415gaurav7261 wants to merge 1 commit intoapache:masterfrom
gaurav7261 wants to merge 1 commit intoapache:masterfrom
Conversation
Author
|
@alamb @aihuaxu read https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/ and check the feasibility of having our S3 Sink connector write variant, found out that parseJson can be a better fit here, wdyt? is it making sense |
gaurav7261
added a commit
to gaurav7261/kafka-connect-storage-cloud
that referenced
this pull request
Mar 8, 2026
…uctured fields Write Kafka Connect JSON fields (Debezium CDC, Confluent Protobuf Struct, custom messages, maps, arrays) as native Parquet VARIANT columns instead of plain STRING, improving storage efficiency and query performance. - Upgrade parquet-java to 1.17.0 for VARIANT logical type support - https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/ - Add config: parquet.variant.enabled, parquet.variant.connect.names, parquet.variant.field.names - Auto-detect recursive schemas (google.protobuf.Struct) as VARIANT - Stream JSON-to-Variant conversion ported from Apache Spark (PR also raised in parquet-java, once approved, code will be neat here in this repo: apache/parquet-java#3415) - Unwrap Protobuf Struct/Value/ListValue/map-as-array to clean JSON - Graceful fallback for non-JSON values (e.g. __debezium_unavailable_value) - Feature is fully opt-in (disabled by default), zero impact on existing connectors
6 tasks
Author
|
@Fokko can you please review, is it looking good to you? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Every consumer of
parquet-variantcurrently has to independently implement JSON-to-Variant parsing. Apache Spark has one in itscommon/variantmodule(source), our Kafka Connect S3 sink connector had to write one, and any other project (Flink, Trino, DuckDB-Java, etc.) would need to do the
same. Since
VariantBuilderalready provides all the low-levelappend*()primitives,parseJson()is the natural completion of that API — a canonical, reusable entry point for the most common usecase: converting a JSON string into a Variant.
What changes are included in this PR?
parquet-variant/pom.xml: Addedjackson-core(compile) andparquet-jackson(runtime) dependencies, following the same pattern asparquet-hadoop.VariantBuilder.java: Added two public static methods:parseJson(String json)— convenience method that creates a Jackson streaming parser internally.parseJson(JsonParser parser)— for callers who already have a positioned parser (e.g., reading from a stream).VariantBuilder.buildJson:buildJson()— recursive single-pass streaming parser handling OBJECT, ARRAY, STRING, NUMBER_INT, NUMBER_FLOAT, TRUE, FALSE, NULL.appendSmallestLong()— selects the smallest integer type (BYTE/SHORT/INT/LONG) based on value range.tryAppendDecimal()— decimal-first encoding for floating-point numbers; falls back to double only for scientific notation or values exceeding DECIMAL16 precision (38 digits).TestVariantParseJson.java: 32 new tests covering all primitive types, objects (empty, simple, nested, null values, sorted keys, duplicate keys), arrays (empty, simple, nested, mixed types), andedge cases (unicode, escaped strings, deeply nested documents, scientific notation, integer overflow to decimal, malformed JSON).
Are these changes tested?
Yes. 32 new tests in
TestVariantParseJson.Are there any user-facing changes?
Yes. Two new public static methods on
VariantBuilder:VariantBuilder.parseJson(String json)— returns aVariantVariantBuilder.parseJson(JsonParser parser)— returns aVariantThese are additive API additions with no breaking changes to existing APIs.
Closes parseJson implementation in VariantBuilder #3414