Core, Data, Spark: Moving Spark to use the new FormatModel API by pvary · Pull Request #15328 · apache/iceberg

pvary · 2026-02-15T15:48:34Z

Part of: #12298
Implementation of the new API: #12774

SparkFormatModel and related changes

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java

singhpk234 · 2026-02-16T16:26:31Z

can we run the benchmarks for spark to see how the benchmarks turns out to be post these : https://github.com/apache/iceberg/tree/main/spark/v4.1/spark/src/jmh/java/org/apache/iceberg/spark ?

pvary · 2026-02-16T23:07:40Z

can we run the benchmarks for spark to see how the benchmarks turns out to be post these : https://github.com/apache/iceberg/tree/main/spark/v4.1/spark/src/jmh/java/org/apache/iceberg/spark ?

Added some new tests for Parquet (readUsingRegistryReader, readWithProjectionUsingRegistryReader, readUsingRegistryReader, readWithProjectionUsingRegistryReader, writeUsingRegistryWriter, writeUsingRegistryWriter):

Benchmark                                                                          Mode  Cnt  Score   Error  Units
SparkParquetReadersFlatDataBenchmark.readUsingIcebergReader                          ss    5  0.311 ± 0.005   s/op
SparkParquetReadersFlatDataBenchmark.readUsingIcebergReaderUnsafe                    ss    5  0.396 ± 0.018   s/op
SparkParquetReadersFlatDataBenchmark.readUsingRegistryReader                         ss    5  0.326 ± 0.049   s/op
SparkParquetReadersFlatDataBenchmark.readUsingSparkReader                            ss    5  0.408 ± 0.008   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingIcebergReader            ss    5  0.185 ± 0.018   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe      ss    5  0.363 ± 0.018   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingRegistryReader           ss    5  0.213 ± 0.026   s/op
SparkParquetReadersFlatDataBenchmark.readWithProjectionUsingSparkReader              ss    5  0.273 ± 0.019   s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReader                        ss    5  0.184 ± 0.018   s/op
SparkParquetReadersNestedDataBenchmark.readUsingIcebergReaderUnsafe                  ss    5  0.219 ± 0.026   s/op
SparkParquetReadersNestedDataBenchmark.readUsingRegistryReader                       ss    5  0.179 ± 0.035   s/op
SparkParquetReadersNestedDataBenchmark.readUsingSparkReader                          ss    5  0.223 ± 0.015   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReader          ss    5  0.077 ± 0.010   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingIcebergReaderUnsafe    ss    5  0.137 ± 0.007   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingRegistryReader         ss    5  0.080 ± 0.006   s/op
SparkParquetReadersNestedDataBenchmark.readWithProjectionUsingSparkReader            ss    5  0.103 ± 0.003   s/op
SparkParquetWritersFlatDataBenchmark.writeUsingIcebergWriter                         ss    5  2.602 ± 0.064   s/op
SparkParquetWritersFlatDataBenchmark.writeUsingRegistryWriter                        ss    5  2.593 ± 0.074   s/op
SparkParquetWritersFlatDataBenchmark.writeUsingSparkWriter                           ss    5  2.594 ± 0.054   s/op
SparkParquetWritersNestedDataBenchmark.writeUsingIcebergWriter                       ss    5  1.559 ± 0.022   s/op
SparkParquetWritersNestedDataBenchmark.writeUsingRegistryWriter                      ss    5  1.569 ± 0.043   s/op
SparkParquetWritersNestedDataBenchmark.writeUsingSparkWriter                         ss    5  1.595 ± 0.046   s/op

The differences are barely noticeable in any direction. There should not be any real difference as the resulting readers and writers are using the same code.

pvary · 2026-02-17T11:58:06Z

...src/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersFlatDataBenchmark.java

            .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
            .set("spark.sql.caseSensitive", "false")
            .set("spark.sql.parquet.fieldId.write.enabled", "false")
+            .set("spark.sql.parquet.variant.annotateLogicalType.enabled", "false")


These tests were failing with Spark 4.1, but probably doesn't worth to create a new PR for this.

I'm okay with this since it isn't production code. It's unlikely that this is going to cause problems cherry-picking.

pvary · 2026-02-17T11:58:18Z

...c/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersNestedDataBenchmark.java

            .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
            .set("spark.sql.caseSensitive", "false")
            .set("spark.sql.parquet.fieldId.write.enabled", "false")
+            .set("spark.sql.parquet.variant.annotateLogicalType.enabled", "false")


These tests were failing with Spark 4.1, but probably doesn't worth to create a new PR for this.

Should we file an issue to track the underlying Spark 4.1 test failure, so we can fix the root cause later?

I think both "true" and "false" is ok as well. The issue was that the config was not set.

...c/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersNestedDataBenchmark.java

kevinjqliu · 2026-02-17T18:26:18Z

we mentioned this PR in the Iceberg <> Spark community sync today. Would be great to get some more 👀 on it!

rdblue · 2026-02-17T22:44:51Z

...src/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetReadersFlatDataBenchmark.java


+  @Benchmark
+  @Threads(1)
+  public void readUsingRegistryReader(Blackhole blackHole) throws IOException {


Do we need to test the direct method vs the registry method? I would expect this to replace the current readUsingIcebergReaderUnsafe implementation since this is the same reader implementation. We should make sure that there is not a regression by running these benchmarks (for which it would be fine to leave this method here) but I don't want to accumulate essentially dead code testing the same thing.

...src/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersFlatDataBenchmark.java

...k/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

rdblue

Looks good overall. I'd prefer to update the benchmark cases since they use the same readers/writers, but that's fairly minor.

kevinjqliu · 2026-02-18T14:08:34Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java

+  @SuppressWarnings("unchecked")
+  public static <T> ParquetValueWriter<T> buildWriter(
+      Schema icebergSchema, MessageType type, StructType dfSchema) {
+    return (ParquetValueWriter<T>)
+        ParquetWithSparkSchemaVisitor.visit(
+            dfSchema != null ? dfSchema : SparkSchemaUtil.convert(icebergSchema),
+            type,
+            new WriteBuilder(type));
+  }
+
+  public static <T> ParquetValueWriter<T> buildWriter(
+      StructType dfSchema, MessageType type, Schema icebergSchema) {
    return (ParquetValueWriter<T>)
-        ParquetWithSparkSchemaVisitor.visit(dfSchema, type, new WriteBuilder(type));
+        ParquetWithSparkSchemaVisitor.visit(
+            dfSchema != null ? dfSchema : SparkSchemaUtil.convert(icebergSchema),
+            type,
+            new WriteBuilder(type));
  }


is it intentional here to have 2 functions with different signature ordering? might be confusing

I’ve given this quite a bit of thought. On the caller side we use the ordericebergSchema, fileSchema, engineSchema, and I believe this is the most logical ordering. If anyone feels strongly otherwise, I’m happy to adjust it.

kevinjqliu · 2026-02-18T14:10:42Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java

+                  .metricsConfig(metricsConfig)
+                  .withPartition(partition)
+                  .overwrite()
+                  .metricsConfig(metricsConfig)


Suggested change

.metricsConfig(metricsConfig)

.withPartition(partition)

.overwrite()

.metricsConfig(metricsConfig)

.metricsConfig(metricsConfig)

.withPartition(partition)

.overwrite()

called twice

Sorry, I have seen this too late.
Created another PR: #15356

kevinjqliu · 2026-02-18T14:30:56Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java

-    this.equalityDeleteSparkType = equalityDeleteSparkType;
-    this.positionDeleteSparkType = null;
+    this.table = table;
+    this.format = dataFileFormat;


Suggested change

this.format = dataFileFormat;

this.format = deleteFileFormat;

this.format is used in newPositionDeleteWriter, so i think it should be the delete file format

Added the fix to #15356

kevinjqliu · 2026-02-18T14:30:59Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java

-    this.equalityDeleteSparkType = equalityDeleteSparkType;
-    this.positionDeleteSparkType = positionDeleteSparkType;
+    this.table = table;
+    this.format = dataFileFormat;


Suggested change

this.format = dataFileFormat;

this.format = deleteFileFormat;

this.format is used in newPositionDeleteWriter, so i think it should be the delete file format

Added the fix to #15356

pvary · 2026-02-18T15:11:09Z

Merged to main.
Thanks @huaxingao, @kevinjqliu, @rdblue for the reviews!

kevinjqliu · 2026-02-18T16:49:53Z

Note that #15356 contains a follow up fix to address the comments above

github-actions bot added spark core data labels Feb 15, 2026

kevinjqliu approved these changes Feb 16, 2026

View reviewed changes

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkFileWriterFactory.java Outdated Show resolved Hide resolved

pvary force-pushed the spark_model branch from 4ec4270 to 98b316f Compare February 16, 2026 20:17

pvary force-pushed the spark_model branch from 64dbe69 to 03f35f1 Compare February 17, 2026 08:37

pvary added 5 commits February 17, 2026 12:52

Core, Data, Spark: Moving Spark to use the new FormatModel API

320ac20

Log message to WARN level

4633b6f

JHM tests

6ae7eaa

Set metrics config and writer configurations in the fallback path

576a7c9

Some formatting

4419dc3

pvary force-pushed the spark_model branch from e9cfd91 to 4419dc3 Compare February 17, 2026 11:57

pvary commented Feb 17, 2026

View reviewed changes

...c/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersNestedDataBenchmark.java Outdated Show resolved Hide resolved

huaxingao approved these changes Feb 17, 2026

View reviewed changes

rdblue reviewed Feb 17, 2026

View reviewed changes

...src/jmh/java/org/apache/iceberg/spark/data/parquet/SparkParquetWritersFlatDataBenchmark.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 17, 2026

View reviewed changes

...k/v4.1/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java Outdated Show resolved Hide resolved

rdblue approved these changes Feb 17, 2026

View reviewed changes

pvary added 2 commits February 18, 2026 10:01

Ryan's comments

b1c225b

Parameter order change

025210f

pvary force-pushed the spark_model branch from be05acd to 025210f Compare February 18, 2026 13:40

kevinjqliu reviewed Feb 18, 2026

View reviewed changes

pvary merged commit dde0dc2 into apache:main Feb 18, 2026
32 checks passed

pvary deleted the spark_model branch February 18, 2026 15:10

This was referenced Feb 18, 2026

Spark: Backport moving Spark to use the new FormatModel API #15355

Merged

Spark: Various fixes for SparkFileWriterFactory #15356

Merged

	this.format = dataFileFormat;
	this.format = deleteFileFormat;

Comments

Conversation

pvary commented Feb 15, 2026

Uh oh!

Uh oh!

singhpk234 commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Feb 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu commented Feb 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented Feb 18, 2026

Uh oh!

kevinjqliu commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

singhpk234 commented Feb 16, 2026 •

edited

Loading