Core, Spark 4.1: Fix querying equality deletes with schema evolution by aokolnychyi · Pull Request #15268 · apache/iceberg

aokolnychyi · 2026-02-08T22:59:37Z

This PR fixes querying tables with equality deletes if identity columns used to create the equality delete file are no longer part of the current or time travel schema.

The following scenario fails without the changes in this PR:

Add equality deletes on column status
Drop status from the table
Query the table expecting that all equality deletes would have been applied correctly

singhpk234 · 2026-02-09T02:19:17Z

do we want this is in 1.11 ?

aokolnychyi · 2026-02-09T18:00:17Z

@singhpk234, possibly nice to include as the use case is pretty basic? Can you help review?

aokolnychyi · 2026-02-09T18:53:52Z

  private static final Logger LOG = LoggerFactory.getLogger(BaseReader.class);

  private final Table table;
-  private final Schema tableSchema;


We only used tableSchema to resolve equality delete fields. We have to switch to all schemas in table.

Hm, have we considered just building a union schema?

Ah nvm, we probably want to lazily do that, since on average we wouldn't expect to have to reference the historical fields.

It is tricky as we don't know which IDs are used in equality delete files.

aokolnychyi · 2026-02-09T18:54:24Z

+    // field lookup for serializable tables that assumes fetching historic schemas is expensive
+    private static class FieldLookup implements Function<Integer, Types.NestedField> {
+      private final Table table;
+


Maybe drop empty line.

aokolnychyi · 2026-02-09T18:54:45Z

  // this class is not meant to be exposed beyond the delete file index
  private static class EqualityDeleteFile {
-    private final PartitionSpec spec;
+    private final Function<Integer, Types.NestedField> fieldLookup;


We previously used spec.schema() to resolve equality fields. That's not OK.

aokolnychyi · 2026-02-09T18:56:29Z

+        return field != null ? field : historicSchemaFields().get(id);
+      }
+
+      private Map<Integer, Types.NestedField> historicSchemaFields() {


Not sure about historic. Maybe previousSchemaFields() or oldSchemaFields?

I actually like historic but I'm also good with previous or prior, not very opinionated here

Then historic it is.

aokolnychyi · 2026-02-09T19:02:46Z

cc @amogh-jahagirdar @nastra @singhpk234 @huaxingao @szehon-ho

singhpk234

Thanks @aokolnychyi this LGTM overall suggested some minor suggestions and a linked a discussion you might be interested

singhpk234 · 2026-02-09T18:47:55Z

+      private static Collection<Schema> historicSchemas(Table table) {
+        return table.schemas().values().stream()
+            .filter(schema -> schema.schemaId() != table.schema().schemaId())
+            .collect(Collectors.toList());
+      }
+    }


This would now require loading the TableMetadata on executor, which is absolutely fine.

please check, why this could be a potential issue: #14944

We've been loading the table metadata on the executor for a while at least for other use cases like metadata tables so we're not quite just doing it "now". I believe the specific concern is that currently, when loading table metadata from executors, executors currently read the table metadata JSON lazily when they need certain parts of table metadata that are not stored in SerializableTable. For use cases like server-side planning, there may not be any credentials to actually read the metadata json directly.

I think that's still ultimately a separate issue and given this change, I believe we'd see this issue specifically when there are equality deletes returned as part of scan planning.

I think we can look at that as needed so that executors can resolve table metadata from different approaches (naive approach, just broadcast the whole serializable table for those cases or more lazily resolve the metadata by issuing load table requests), not just having to read the metadata file.

Since the immediate use cases for scan planning tend to not return equality deletes and also the worst case is failure anyways, I don't think this is something to concern ourselves with for the 1.11 release but it's something we should fix at some point.

To add, I would just test our remote planning with this change, make sure things behave as expected. I was planning on doing that for 1.11 release voting anyways.

I don't think this is something to concern ourselves with for the 1.11 release

I am on the same page :), just posted here as an FYI, since there has been some discussions about it !

Correct, that's why I implemented the happy case when the field is in the current schema. So the only case when we will need to load the extra schemas and read the json file on executors is when we can't find the equality delete fields in the current schema.

What do you think about this as is, @amogh-jahagirdar @singhpk234?

Sounds Great to me ! I think this totally fine since the case we have in mind will not return eq delete (untrusted clients) from server

singhpk234 · 2026-02-09T19:29:16Z

    }
  }
+
+  // indexes all fields from schemas, preferring field definitions from higher schema IDs


minor : may be ok to mention this is for cases like type promotion ?

singhpk234 · 2026-02-09T19:37:52Z

-              Types.NestedField field = spec.schema().findField(id);
+              Types.NestedField field = fieldLookup.apply(id);
+              Preconditions.checkArgument(field != null, "Cannot find field for ID %s", id);


[doubt] what would have been prev behaviour when the field was null in spec.schema() was null ? did we failed later ?

I just copied the new test into a branch without the fix to check this, yes it fails when we try and prune out which equality deletes to apply based on stats because we won't be able to resolve the field in the first place.

https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java#L233 fails with a NPE

Correct, the test I added previously failed.

singhpk234 · 2026-02-09T19:42:11Z

+    Set<Integer> seenSchemaIds = Sets.newHashSet();
+
+    for (Schema schema : sortByIdAsc(schemas)) {
+      if (!seenSchemaIds.contains(schema.schemaId())) {


can we make it sorted set instead ?

That's a good idea.

amogh-jahagirdar · 2026-02-09T23:06:23Z

  private static final Logger LOG = LoggerFactory.getLogger(BaseReader.class);

  private final Table table;
-  private final Schema tableSchema;


Hm, have we considered just building a union schema?

amogh-jahagirdar · 2026-02-09T23:07:05Z

  private static final Logger LOG = LoggerFactory.getLogger(BaseReader.class);

  private final Table table;
-  private final Schema tableSchema;


Ah nvm, we probably want to lazily do that, since on average we wouldn't expect to have to reference the historical fields.

amogh-jahagirdar · 2026-02-09T23:12:14Z

+        return field != null ? field : historicSchemaFields().get(id);
+      }
+
+      private Map<Integer, Types.NestedField> historicSchemaFields() {


I actually like historic but I'm also good with previous or prior, not very opinionated here

singhpk234

LGTM, thanks @aokolnychyi !

amogh-jahagirdar

Thanks @aokolnychyi this looks good to me!

aokolnychyi · 2026-02-10T06:07:39Z

Thank you, @singhpk234 @amogh-jahagirdar!

…pache#15268)

github-actions Bot added API spark core data labels Feb 8, 2026

Core, Spark 4.1: Fix querying equality deletes with schema evolution

8528e5f

aokolnychyi force-pushed the fix-equality-delete-schema-evolution branch from 5d78e6b to 8528e5f Compare February 8, 2026 23:11

Fix tests in older Spark versions + minor improvements

3ba1d29

amogh-jahagirdar self-requested a review February 9, 2026 18:32

aokolnychyi commented Feb 9, 2026

View reviewed changes

singhpk234 reviewed Feb 9, 2026

View reviewed changes

amogh-jahagirdar reviewed Feb 9, 2026

View reviewed changes

Minor updates

9cfbdc8

singhpk234 approved these changes Feb 10, 2026

View reviewed changes

amogh-jahagirdar approved these changes Feb 10, 2026

View reviewed changes

aokolnychyi merged commit 00df493 into apache:main Feb 10, 2026
33 checks passed

qinghui-xu mentioned this pull request Apr 1, 2026

Unable to read data when using nested field as identifier #15826

Open

3 tasks

talatuyarer pushed a commit to talatuyarer/iceberg that referenced this pull request Apr 1, 2026

Core, Spark 4.1: Fix querying equality deletes with schema evolution (a…

a7e3854

…pache#15268)

Conversation

aokolnychyi commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singhpk234 commented Feb 9, 2026

Uh oh!

aokolnychyi commented Feb 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Feb 9, 2026

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aokolnychyi commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

aokolnychyi commented Feb 8, 2026 •

edited

Loading

singhpk234 Feb 9, 2026 •

edited

Loading

amogh-jahagirdar Feb 10, 2026 •

edited

Loading

amogh-jahagirdar Feb 9, 2026 •

edited

Loading