Skip to content

GH-3414: Add parseJson to VariantBuilder for JSON-to-Variant conversion#3415

Open
gaurav7261 wants to merge 1 commit intoapache:masterfrom
gaurav7261:gh-3414-variant-parse-json
Open

GH-3414: Add parseJson to VariantBuilder for JSON-to-Variant conversion#3415
gaurav7261 wants to merge 1 commit intoapache:masterfrom
gaurav7261:gh-3414-variant-parse-json

Conversation

@gaurav7261
Copy link

Rationale for this change

Every consumer of parquet-variant currently has to independently implement JSON-to-Variant parsing. Apache Spark has one in its common/variant module
(source), our Kafka Connect S3 sink connector had to write one, and any other project (Flink, Trino, DuckDB-Java, etc.) would need to do the
same. Since VariantBuilder already provides all the low-level append*() primitives, parseJson() is the natural completion of that API — a canonical, reusable entry point for the most common use
case: converting a JSON string into a Variant.

What changes are included in this PR?

  • parquet-variant/pom.xml: Added jackson-core (compile) and parquet-jackson (runtime) dependencies, following the same pattern as parquet-hadoop.
  • VariantBuilder.java: Added two public static methods:
    • parseJson(String json) — convenience method that creates a Jackson streaming parser internally.
    • parseJson(JsonParser parser) — for callers who already have a positioned parser (e.g., reading from a stream).
    • Internal helpers ported from Apache Spark's production VariantBuilder.buildJson:
      • buildJson() — recursive single-pass streaming parser handling OBJECT, ARRAY, STRING, NUMBER_INT, NUMBER_FLOAT, TRUE, FALSE, NULL.
      • appendSmallestLong() — selects the smallest integer type (BYTE/SHORT/INT/LONG) based on value range.
      • tryAppendDecimal() — decimal-first encoding for floating-point numbers; falls back to double only for scientific notation or values exceeding DECIMAL16 precision (38 digits).
  • TestVariantParseJson.java: 32 new tests covering all primitive types, objects (empty, simple, nested, null values, sorted keys, duplicate keys), arrays (empty, simple, nested, mixed types), and
    edge cases (unicode, escaped strings, deeply nested documents, scientific notation, integer overflow to decimal, malformed JSON).

Are these changes tested?

Yes. 32 new tests in TestVariantParseJson.

Are there any user-facing changes?

Yes. Two new public static methods on VariantBuilder:

  • VariantBuilder.parseJson(String json) — returns a Variant
  • VariantBuilder.parseJson(JsonParser parser) — returns a Variant
    These are additive API additions with no breaking changes to existing APIs.
    Closes parseJson implementation in VariantBuilder #3414

@gaurav7261
Copy link
Author

@alamb @aihuaxu read https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/ and check the feasibility of having our S3 Sink connector write variant, found out that parseJson can be a better fit here, wdyt? is it making sense

gaurav7261 added a commit to gaurav7261/kafka-connect-storage-cloud that referenced this pull request Mar 8, 2026
…uctured fields

Write Kafka Connect JSON fields (Debezium CDC, Confluent Protobuf Struct,
custom messages, maps, arrays) as native Parquet VARIANT columns instead of
plain STRING, improving storage efficiency and query performance.

- Upgrade parquet-java to 1.17.0 for VARIANT logical type support - https://parquet.apache.org/blog/2026/02/27/variant-type-in-apache-parquet-for-semi-structured-data/
- Add config: parquet.variant.enabled, parquet.variant.connect.names,
  parquet.variant.field.names
- Auto-detect recursive schemas (google.protobuf.Struct) as VARIANT
- Stream JSON-to-Variant conversion ported from Apache Spark (PR also raised in parquet-java, once approved, code will be neat here in this repo: apache/parquet-java#3415)
- Unwrap Protobuf Struct/Value/ListValue/map-as-array to clean JSON
- Graceful fallback for non-JSON values (e.g. __debezium_unavailable_value)
- Feature is fully opt-in (disabled by default), zero impact on existing connectors
@gaurav7261
Copy link
Author

@Fokko can you please review, is it looking good to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parseJson implementation in VariantBuilder

1 participant