Flink: Support writing shredded variant in Flink by Guosmilesmile · Pull Request #15596 · apache/iceberg

Guosmilesmile · 2026-03-12T08:09:38Z

This PR is mainly to add support in Flink for writing shredding-variant data to Iceberg tables, based on #14297.

This PR is based on #14297 and will be adjusted in sync with it.

Guosmilesmile · 2026-05-07T08:36:30Z

Hi @aihuaxu @nssalian @pvary @mxm . Since the Spark part has been merged, the Flink part has been adjusted accordingly. If you have time, please help review it.

Thanks!
GuoYu.

pvary · 2026-05-08T13:49:46Z

+        .tableProperty(TableProperties.PARQUET_SHRED_VARIANTS)
+        .defaultValue(TableProperties.PARQUET_SHRED_VARIANTS_DEFAULT)


How will we handle when ORC supports shredding variants?

Good catch . I rename shred-variants to parquet-shred-variants to clarify this feature is only support parquet . If orc support this, then we can add another config.

Let's do parquet for now since we followed that pattern for the Spark implementation.

pvary · 2026-05-08T14:07:37Z

-                FlinkParquetReaders.buildReader(icebergSchema, fileSchema, idToConstant)));
+                FlinkParquetReaders.buildReader(icebergSchema, fileSchema, idToConstant),
+            new FlinkVariantShreddingAnalyzer(),
+            (row, rowType) -> new RowDataSerializer(rowType).copy(row)));


Isn't this costly to recreate every time when we copy a row?

It will increase the cost, but without copying, there would be issues with data corruption when buffer data. We ran into this during early development, and the unit tests can reproduce it.

Can we reuse the RowDataSerializer?

With the current BiFunction, (row, rowType) -> new RowDataSerializer(rowType).copy(row) creates a new RowDataSerializer for every buffered row (default buffer = 100). This construction is not free, as it involves walking rowType.getChildren(), building a TypeSerializer[] via InternalSerializers.create, a BinaryRowDataSerializer, and a RowData.FieldGetter[]. Since the engine schema is fixed for the entire file, a factory allows us to build it once and reuse it. Using the Factory Pattern, we can avoid recreating the serializer for a given table schema with every incoming record.

Yes, we can use Function<S, UnaryOperator<D>> instead of BiFunction<D, S, D> to implement this.

+1. We should be able to reuse RowDataSerializer so we don't need to create new instance for every row.

talatuyarer · 2026-05-15T01:08:12Z

+
+  @Override
+  protected int resolveColumnIndex(RowType flinkSchema, String columnName) {
+    try {


This is redundant try catch. The catch is unreachable; the whole method should just be return
flinkSchema.getFieldIndex(columnName); I checked RowType code it does not throw IllegalArgumentException. It returns -1 https://github.com/apache/flink/blob/release-2.1/flink-table/flink-table-common/src/main/java/org/apache/flink/table/types/logical/RowType.java#L187

Good Catche, remove try catch now.

talatuyarer · 2026-05-15T01:11:38Z

+
+    String variantNullAbleTableName = "test_all_null_variant_column";
+    sql(
+        "CREATE TABLE %s (id int NOT NULL, address variant) with ('write.format.default'='%s','format-version'='3','shred-variants'='true','variant-inference-buffer-size'='10')",


This test uses wrong settings shred-variants and shredding is never enabled. The test then asserts an unshredded schema and passes for the wrong reason.

Yeah, you're right. The config was wrong, I've already updated it. The tests passed because we were inserting all nulls, so the result should be identical with or without shredding enabled. The whole point here was just to verify that shredding handles nulls correctly. Good catch, thanks!

talatuyarer · 2026-05-15T01:16:07Z

 | compression-strategy                    | Table write.orc.compression-strategy       | Overrides this table's compression strategy for ORC tables for this write                                                                       |
 | write-parallelism                       | Upstream operator parallelism              | Overrides the writer parallelism                                                                                                                |
 | uid-suffix                              | As per table property                      | Overrides the uid suffix used in the underlying IcebergSink for this table                                                                      |
+| parquet-shred-variants                  | Table write.parquet.shred-variants | Overrides this table's shred variants for this write


Both new rows are missing the trailing |

Good！Add it.

Guosmilesmile · 2026-05-15T14:51:29Z

I found these configs in Spark:

spark.sql.iceberg.shred-variants
spark.sql.iceberg.variant-inference-buffer-size
shred-variants
variant-inference-buffer-size

All of them already hide the format-related info, while TableProperties keeps the parquet prefix. So I've adjusted the configs on our side to align with Spark's approach — Flink now uses shred-variants, which overrides the table-level configs write.parquet.shred-variants and write.orc.shred-variants (if/when ORC supports it).

On the code side, we can implement it like this:

public boolean parquetShredVariants() {
    return confParser
        .booleanConf()
        .option(FlinkWriteOptions.SHRED_VARIANTS.key())
        .tableProperty(TableProperties.PARQUET_SHRED_VARIANTS)
        .defaultValue(TableProperties.PARQUET_SHRED_VARIANTS_DEFAULT)
        .parse();
  }

public boolean orcShredVariants() {
    return confParser
        .booleanConf()
        .option(FlinkWriteOptions.SHRED_VARIANTS.key())
        .tableProperty(TableProperties.ORC_SHRED_VARIANTS)
        .defaultValue(TableProperties.ORC_SHRED_VARIANTS_DEFAULT)
        .parse();
  }

Guosmilesmile · 2026-05-19T03:11:02Z

Hi @pvary @talatuyarer @nssalian , I'd be grateful if you could take a look when you have time.

pvary · 2026-05-19T12:42:49Z

I don't have more comments.
Anybody else?

nssalian · 2026-05-19T16:19:16Z

                SparkParquetReaders.buildReader(icebergSchema, fileSchema, idToConstant),
            new SparkVariantShreddingAnalyzer(),
-            InternalRow::copy));
+            structType -> InternalRow::copy));


@Guosmilesmile could you update the Spark 4.0 with the above suggestion as well

talatuyarer's Comments

Guosmilesmile · 2026-05-20T02:49:05Z

@pvary @talatuyarer @nssalian @aihuaxu Hey all, I rebased main but ran into some CI failures. Looks like a new check was added recently that doesn't allow modifying the ParquetFormatModel parameter types directly.

As a workaround, I added a new method createWithCopyFuncFactory in ParquetFormatModel. The original create method now delegates to it, so the Spark code stays untouched, while FlinkFormatModels calls createWithCopyFuncFactory explicitly.

Would really appreciate it if you could help take another look at these changes. Thanks a lot!

java.method.parameterTypeChanged: The type of the parameter changed from 'java.util.function.UnaryOperator<D extends java.lang.Object>' to 'java.util.function.Function<S extends java.lang.Object, java.util.function.UnaryOperator<D extends java.lang.Object>>'.

old: parameter <D, S> org.apache.iceberg.parquet.ParquetFormatModel<D, S, org.apache.iceberg.parquet.ParquetValueReader<?>> org.apache.iceberg.parquet.ParquetFormatModel<D, S, R>::create(java.lang.Class<D>, java.lang.Class<S>, org.apache.iceberg.formats.BaseFormatModel.WriterFunction<org.apache.iceberg.parquet.ParquetValueWriter<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.formats.BaseFormatModel.ReaderFunction<org.apache.iceberg.parquet.ParquetValueReader<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.parquet.VariantShreddingAnalyzer<D, S>, ===java.util.function.UnaryOperator<D>===)
new: parameter <D, S> org.apache.iceberg.parquet.ParquetFormatModel<D, S, org.apache.iceberg.parquet.ParquetValueReader<?>> org.apache.iceberg.parquet.ParquetFormatModel<D, S, R>::create(java.lang.Class<D>, java.lang.Class<S>, org.apache.iceberg.formats.BaseFormatModel.WriterFunction<org.apache.iceberg.parquet.ParquetValueWriter<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.formats.BaseFormatModel.ReaderFunction<org.apache.iceberg.parquet.ParquetValueReader<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.parquet.VariantShreddingAnalyzer<D, S>, ===java.util.function.Function<S, java.util.function.UnaryOperator<D>>===)

https://github.com/apache/iceberg/actions/runs/26136650474/job/76873207059?pr=15596

Guosmilesmile · 2026-05-20T08:20:29Z

After discussing with @pvary , we decided to keep the same create name and go with deprecating the old method instead.

Guosmilesmile marked this pull request as draft March 12, 2026 08:09

github-actions Bot added spark parquet flink ORC labels Mar 12, 2026

Guosmilesmile mentioned this pull request Mar 12, 2026

Spark: Support writing shredded variant in Iceberg-Spark #14297

Merged

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 15ff223 to 5b448b9 Compare March 12, 2026 08:22

github-actions Bot removed the ORC label Mar 12, 2026

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 2 times, most recently from 8f6198a to b03caf6 Compare March 12, 2026 08:59

Guosmilesmile changed the title ~~Core,Flink: Support writing shredded variant in Flink~~ Flink: Support writing shredded variant in Flink Mar 12, 2026

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 3 times, most recently from 88045e1 to cbfa8c2 Compare March 13, 2026 07:17

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from fae2814 to f3a2fba Compare March 24, 2026 05:50

github-actions Bot added the core label Mar 24, 2026

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 4 times, most recently from b07b00b to c95d78f Compare March 24, 2026 08:36

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 3 times, most recently from fc8c45a to b116f25 Compare April 1, 2026 01:50

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from b116f25 to 650cb7a Compare April 10, 2026 09:42

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 770d9c4 to 7d48389 Compare May 7, 2026 06:55

Guosmilesmile marked this pull request as ready for review May 7, 2026 08:34

pvary reviewed May 8, 2026

View reviewed changes

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 63ae5ae to 0f2ae10 Compare May 9, 2026 05:33

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 2 times, most recently from 8bf41e8 to b373680 Compare May 11, 2026 02:55

talatuyarer reviewed May 15, 2026

View reviewed changes

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 0e376f4 to 4484b2b Compare May 15, 2026 08:19

pvary reviewed May 15, 2026

View reviewed changes

Comment thread flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/data/FlinkVariantShreddingAnalyzer.java

pvary reviewed May 15, 2026

View reviewed changes

Comment thread docs/docs/flink-configuration.md

pvary reviewed May 15, 2026

View reviewed changes

Comment thread flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/FlinkWriteOptions.java

pvary reviewed May 15, 2026

View reviewed changes

Comment thread flink/v2.1/flink/src/test/java/org/apache/iceberg/flink/TestFlinkVariantShreddingType.java Outdated

nssalian suggested changes May 19, 2026

View reviewed changes

Guosmilesmile added 12 commits May 20, 2026 09:59

Flink:Support writing shredded variant

fc48872

rename SHRED_VARIANTS to PARQUET_SHRED_VARIANTS

3e17878

fix spark 4.0

d8463a6

Fix RowDataSerializer create every row

fc8a53e

move set param to after

3c1ff52

Address aihua's Comment

41d832c

Address Comments

41eeeeb

Address

31a8bc0

talatuyarer's Comments

Address Peter's Comments

8a5c995

Update the Flink config to prioritize the table-level setting.

cee4e35

rename to unused

d87d3ca

rename to unused for spark4.0

03fb902

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from d0fe6b7 to 03fb902 Compare May 20, 2026 02:00

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 0cdb80e to e8ad73e Compare May 20, 2026 07:45

Add new create method in ParquetFormatModel

9480782

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from e8ad73e to 9480782 Compare May 20, 2026 07:50

		.tableProperty(TableProperties.PARQUET_SHRED_VARIANTS)
		.defaultValue(TableProperties.PARQUET_SHRED_VARIANTS_DEFAULT)

Conversation

Guosmilesmile commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Guosmilesmile commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Guosmilesmile commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Guosmilesmile commented May 19, 2026

Uh oh!

pvary commented May 19, 2026

Uh oh!

nssalian May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile commented May 20, 2026

Uh oh!

Guosmilesmile commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Guosmilesmile commented Mar 12, 2026 •

edited

Loading

Guosmilesmile commented May 15, 2026 •

edited

Loading

nssalian May 19, 2026 •

edited

Loading