feat(table): materialize _row_id and _last_updated_sequence_number on read#20
Open
abnobdoss wants to merge 6 commits into
Open
feat(table): materialize _row_id and _last_updated_sequence_number on read#20abnobdoss wants to merge 6 commits into
abnobdoss wants to merge 6 commits into
Conversation
added 6 commits
June 23, 2026 09:45
Enables writing v3 table metadata and implements spec-conformant row lineage (next-row-id / first-row-id / added-rows) end to end. - metadata.py: bump SUPPORTED_TABLE_FORMAT_VERSION to 3; remove the TableMetadataV3.model_dump_json NotImplementedError so v3 serializes; initialize next_row_id=0 for new v3 tables. - manifest.py: DEFAULT_READ_VERSION=3 so manifest-list first_row_id (field 520) round-trips; add ManifestWriterV3, ManifestListWriterV3 (assigns first_row_id only to DATA manifests lacking one, advancing by existing+added rows), and the ManifestFile.first_row_id accessor. - update/__init__.py: upgrade-to-v3 seeds next_row_id=0; AddSnapshotUpdate computes next_row_id = next_row_id + added_rows with NO silent fallback to None (raises instead), fixing the bug where v3 commits collapsed next_row_id to None. - update/snapshot.py: _commit derives added_rows from the manifest-list writer's row-id advance and sets first_row_id; merge manager inherits min(first_row_id) for merged data manifests and refuses to merge v3 manifests whose row-id ranges are non-contiguous/out-of-order (correctness over compaction). - tests: acceptance suite (create v3, append twice, merge-append, delete+merge gap, json round-trip) plus negative guards.
…erent upgrade gate
Remediate four defects found by the t1-foundation skeptic audit:
1. [HIGH] Copy-on-write delete no longer re-numbers surviving rows.
- Whole-file deletes (incl. shared-manifest case) materialize each survivor's
absolute _row_id into DataFile field 142 and inherit the source manifest's
first_row_id, so next_row_id and survivors' row ids are preserved.
- Partial deletes that need a physical data-file rewrite now FAIL LOUDLY on v3
(NotImplementedError) instead of silently re-numbering survivors, because
PyIceberg has no read-side _row_id materialization yet.
2. [HIGH] v3 manifest merge was silently disabled by a descending-vs-ascending
ordering check in _v3_row_ids_are_contiguous. The check now sorts ranges and
verifies gapless/non-overlapping regardless of input order; _create_manifest
writes merged entries in ascending row-id order so min(first_row_id) inheritance
is correct. The apache#3070 double-count fix is now actually exercised.
3. [MED] upgrade_table_version gate hardcoded {1,2}; now uses
SUPPORTED_TABLE_FORMAT_VERSION so v2->v3 upgrade works end to end (seeds
next_row_id=0), consistent with create-allows-v3.
4. [LOW] The two mislabeled "merge" tests now instrument _create_manifest and
assert a merge actually runs; added whole-file delete-lineage tests asserting
field-142 _row_id values, a loud-fail test, a v2 regression test, and an
upgrade test.
DataFile gains a first_row_id (field 142) accessor and a __copy__ that isolates
its _data list so lineage materialization does not leak into source DataFiles.
Address skeptic re-audit LOW nit: from_args always allocates the full v3-width record, so the length-based "non-v3" guard never fires for from_args-built files. Clarify the comment; the guard is retained only for directly-constructed short records.
Surface per-row v3 row-lineage metadata columns at read time, completing the
read half of row lineage (the write path already persists first_row_id / next_row_id).
- schema.py: reserved metadata column constants (_row_id = 2147483540,
_last_updated_sequence_number = 2147483539, both optional LongType) + name lookup.
- table/__init__.py:
- TableScan.projection now lets callers opt into the reserved columns by name
(case-sensitive or not); select("*") does NOT auto-include them.
- FileScanTask carries the manifest entry's data_sequence_number; planner wires it.
- _open_manifest materializes each data file's inherited first_row_id from the
manifest base per the v3 "First Row ID Inheritance" rule, counting deleted
null-first_row_id entries toward the running base (fetch discard_deleted=False,
drop deleted entries only after the base is computed).
- io/pyarrow.py: _task_to_record_batches computes _row_id = first_row_id + in-file
position and _last_updated_sequence_number = data sequence number, with per-row
coalesce over any physically-materialized column. Row filter is applied after scan
(not pushed down) when _row_id is requested so positions stay physical; positions
ride through positional-delete take() and the post-scan filter to stay aligned.
Tests: tests/table/test_v3_row_id_read.py (exact-value _row_id across two appends,
projection alongside real columns, positional-delete+filter alignment, v2 raises,
select("*") excludes metadata, inheritance counts deleted entries).
STRETCH (partial-delete _row_id preservation) deferred: lifting the foundation's
copy-on-write loud-fail requires physically writing a _row_id column into rewritten
data files, which the schema-bound write path does not support yet. The read-side
coalesce that would consume such a column is already implemented.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.