Skip to content

Core, Spark: Fix invalid snapshot ID filtering in RewriteTablePath copy plan#16220

Open
krisnaru wants to merge 1 commit intoapache:mainfrom
krisnaru:copy-table-snapshot-filter
Open

Core, Spark: Fix invalid snapshot ID filtering in RewriteTablePath copy plan#16220
krisnaru wants to merge 1 commit intoapache:mainfrom
krisnaru:copy-table-snapshot-filter

Conversation

@krisnaru
Copy link
Copy Markdown

@krisnaru krisnaru commented May 5, 2026

Fixes #14458

Summary

  • Removes broken snapshotIds.contains(entry.snapshotId()) filter from RewriteTablePathUtil that silently dropped live, un replicated files when their adding snapshot was expired
  • Replaces snapshot-ID-based entry filtering with entry.isLive() only — all live content files are included in the copy plan
  • Adds anti-join dedup for incremental copies: filters out previously-replicated files by comparing against the start version's content files via contentFileDS
  • Fixes both full copy (expired snapshot IDs missing from endMetadata.snapshots()) and incremental copy (expired snapshot IDs missing from deltaSnapshotIds after RewriteManifests + ExpireSnapshots)

Problem

RewriteTablePathUtil.writeDataFileEntry filtered copy plan entries using snapshotIds.contains(entry.snapshotId()). This is fundamentally broken because entry.snapshotId() reflects the snapshot that originally added the file, which can be expired at any time by table maintenance. When RewriteManifests reorganizes entries from expired snapshots into new manifests, the manifest is correctly selected for processing but the entry-level filter drops the files — causing silent data loss at the target.

Scenario:
Append (S3) → RewriteManifests (S4) → Expire S3 → Incremental replication builds deltaSnapshotIds = {S4}, but file C has entry.snapshotId() = S3 → {S4}.contains(S3) == false → file C excluded from copy plan.

Approach

  1. Entry level: Include all live entries in copy plan (no snapshot ID check)
  2. Incremental dedup: Anti-join against start version's content files using contentFileDS from BaseSparkAction
  3. Manifest selection: Unchanged — manifestsToRewrite still uses deltaSnapshotIds from version history

Test plan

  • Verify existing TestRewriteTablePathsAction tests pass (full copy, incremental, expire scenarios)
  • Verify existing TestRewriteTablePathUtil tests pass
  • Add integration test: incremental replication after append + RewriteManifests + ExpireSnapshots
  • Verify incremental copy produces correct file counts (no duplicate copies of already-replicated files)
  • Verify full copy after many snapshots with expired snapshot IDs includes all live data

@krisnaru krisnaru force-pushed the copy-table-snapshot-filter branch 2 times, most recently from fc3db69 to 805b520 Compare May 5, 2026 22:07
@krisnaru krisnaru changed the title Fix invalid snapshot ID filtering in RewriteTablePath copy plan Core, Spark: Fix invalid snapshot ID filtering in RewriteTablePath copy plan May 5, 2026
@krisnaru krisnaru force-pushed the copy-table-snapshot-filter branch 2 times, most recently from 69b44c9 to f9503fd Compare May 6, 2026 01:14
@krisnaru krisnaru force-pushed the copy-table-snapshot-filter branch from f9503fd to ce254d6 Compare May 6, 2026 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove incorrect snapshotId filtering in RewriteTablePathSparkAction

1 participant