feat(table): materialize _row_id and _last_updated_sequence_number on read by abnobdoss · Pull Request #20 · abnobdoss/iceberg-python

abnobdoss · 2026-06-25T00:34:12Z

No description provided.

Enables writing v3 table metadata and implements spec-conformant row lineage (next-row-id / first-row-id / added-rows) end to end. - metadata.py: bump SUPPORTED_TABLE_FORMAT_VERSION to 3; remove the TableMetadataV3.model_dump_json NotImplementedError so v3 serializes; initialize next_row_id=0 for new v3 tables. - manifest.py: DEFAULT_READ_VERSION=3 so manifest-list first_row_id (field 520) round-trips; add ManifestWriterV3, ManifestListWriterV3 (assigns first_row_id only to DATA manifests lacking one, advancing by existing+added rows), and the ManifestFile.first_row_id accessor. - update/__init__.py: upgrade-to-v3 seeds next_row_id=0; AddSnapshotUpdate computes next_row_id = next_row_id + added_rows with NO silent fallback to None (raises instead), fixing the bug where v3 commits collapsed next_row_id to None. - update/snapshot.py: _commit derives added_rows from the manifest-list writer's row-id advance and sets first_row_id; merge manager inherits min(first_row_id) for merged data manifests and refuses to merge v3 manifests whose row-id ranges are non-contiguous/out-of-order (correctness over compaction). - tests: acceptance suite (create v3, append twice, merge-append, delete+merge gap, json round-trip) plus negative guards.

…erent upgrade gate Remediate four defects found by the t1-foundation skeptic audit: 1. [HIGH] Copy-on-write delete no longer re-numbers surviving rows. - Whole-file deletes (incl. shared-manifest case) materialize each survivor's absolute _row_id into DataFile field 142 and inherit the source manifest's first_row_id, so next_row_id and survivors' row ids are preserved. - Partial deletes that need a physical data-file rewrite now FAIL LOUDLY on v3 (NotImplementedError) instead of silently re-numbering survivors, because PyIceberg has no read-side _row_id materialization yet. 2. [HIGH] v3 manifest merge was silently disabled by a descending-vs-ascending ordering check in _v3_row_ids_are_contiguous. The check now sorts ranges and verifies gapless/non-overlapping regardless of input order; _create_manifest writes merged entries in ascending row-id order so min(first_row_id) inheritance is correct. The apache#3070 double-count fix is now actually exercised. 3. [MED] upgrade_table_version gate hardcoded {1,2}; now uses SUPPORTED_TABLE_FORMAT_VERSION so v2->v3 upgrade works end to end (seeds next_row_id=0), consistent with create-allows-v3. 4. [LOW] The two mislabeled "merge" tests now instrument _create_manifest and assert a merge actually runs; added whole-file delete-lineage tests asserting field-142 _row_id values, a loud-fail test, a v2 regression test, and an upgrade test. DataFile gains a first_row_id (field 142) accessor and a __copy__ that isolates its _data list so lineage materialization does not leak into source DataFiles.

Address skeptic re-audit LOW nit: from_args always allocates the full v3-width record, so the length-based "non-v3" guard never fires for from_args-built files. Clarify the comment; the guard is retained only for directly-constructed short records.

Surface per-row v3 row-lineage metadata columns at read time, completing the read half of row lineage (the write path already persists first_row_id / next_row_id). - schema.py: reserved metadata column constants (_row_id = 2147483540, _last_updated_sequence_number = 2147483539, both optional LongType) + name lookup. - table/__init__.py: - TableScan.projection now lets callers opt into the reserved columns by name (case-sensitive or not); select("*") does NOT auto-include them. - FileScanTask carries the manifest entry's data_sequence_number; planner wires it. - _open_manifest materializes each data file's inherited first_row_id from the manifest base per the v3 "First Row ID Inheritance" rule, counting deleted null-first_row_id entries toward the running base (fetch discard_deleted=False, drop deleted entries only after the base is computed). - io/pyarrow.py: _task_to_record_batches computes _row_id = first_row_id + in-file position and _last_updated_sequence_number = data sequence number, with per-row coalesce over any physically-materialized column. Row filter is applied after scan (not pushed down) when _row_id is requested so positions stay physical; positions ride through positional-delete take() and the post-scan filter to stay aligned. Tests: tests/table/test_v3_row_id_read.py (exact-value _row_id across two appends, projection alongside real columns, positional-delete+filter alignment, v2 raises, select("*") excludes metadata, inheritance counts deleted entries). STRETCH (partial-delete _row_id preservation) deferred: lifting the foundation's copy-on-write loud-fail requires physically writing a _row_id column into rewritten data files, which the schema-bound write path does not support yet. The read-side coalesce that would consume such a column is already implemented.

…mns to v3+

Abanoub Doss added 6 commits June 23, 2026 09:45

v3: read _row_id as null when first_row_id is null; gate lineage colu…

cedbefc

…mns to v3+

chore: satisfy mypy/ruff on v3 row-id read paths and tests

345dbfd

abnobdoss closed this Jun 25, 2026

abnobdoss reopened this Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(table): materialize _row_id and _last_updated_sequence_number on read#20

feat(table): materialize _row_id and _last_updated_sequence_number on read#20
abnobdoss wants to merge 6 commits into
mainfrom
v3/row-id-read

abnobdoss commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abnobdoss commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant