[python] Support schema evolution of nested struct sub-fields#8187
Open
TheR1sing3un wants to merge 3 commits into
Open
[python] Support schema evolution of nested struct sub-fields#8187TheR1sing3un wants to merge 3 commits into
TheR1sing3un wants to merge 3 commits into
Conversation
Read-time schema evolution previously aligned only top-level columns by field id; sub-fields inside a ROW (and a ROW nested in an ARRAY/MAP) could not evolve: adding one silently created a top-level column, and rename/drop/update-type raised because the schema manager only handled the last path element. - Assign globally-unique ids to nested sub-fields at create time and compute highestFieldId recursively, so nested ids never collide with top-level ones. - Recurse schema changes along the dotted field-name path (transparently through ARRAY/MAP wrappers) for add/rename/drop/update-type/update-nullability/ update-comment, allocating new ids from the persisted highestFieldId. - Validate update-column-type against the cast-support rules. - Align nested sub-fields by field id at read time: reorder, pad missing with NULL, follow renames, and cast changed types, recursing into struct/array/map. Add tests covering nested add/rename/drop/update-type round-trips (append-only and primary-key), ARRAY<ROW>/MAP<.,ROW> sub-fields, the id model, and the cast rules.
JingsongLi
reviewed
Jun 10, 2026
Nested-leaf projection on append-only reads pushed the leaf path down by the LATEST name, bypassing the per-file field-id normalization: after a sub-field rename the old file's leaf read NULL, and after a sub-field type change old and new batches carried different types and failed to concatenate. Mirror the merge path instead: widen the projection to the full top-level columns so the field-id normalization applies (rename follows the id, missing sub-fields pad NULL, types are cast), then extract the requested leaf paths back to the user's flat schema - batch-level via NestedLeafBatchReader, or row-level via OuterProjectionRecordReader when a post-read filter is involved. Add regression tests projecting a renamed and a type-changed sub-field across old and new files.
JingsongLi
reviewed
Jun 11, 2026
update_column_type from ROW/ARRAY/MAP to STRING passes validation (the
cast rules allow constructed types to character strings), but reading an
old file failed with ArrowNotImplementedError because struct/list/map
cannot be cast to utf8 directly.
Render the string form during per-file alignment instead, matching the
engine's cast rules: ROW as '{v1, v2}', ARRAY as '[e1, e2]', MAP as
'{k1 -> v1, k2 -> v2}', with sub-values rendered recursively, NULL
sub-values as the literal 'null', and NULL containers staying NULL.
Add round-trip tests for ROW/ARRAY/MAP to STRING, NULL semantics, and a
nested sub-field changed to STRING.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Follow-up to #8126, which made read-time schema evolution align top-level columns by field id. This extends the same id-based alignment to sub-fields inside a
ROW(including aROWnested in anARRAY/MAP).Before this PR, nested sub-field evolution didn't work: adding a sub-field silently created a top-level column, and rename/drop/update-type failed, because only the last name in the path was matched.
Now a dotted path like
mv.valueis resolved recursively, so for a columnmv ROW<version BIGINT, value STRING>:NULLfor it;How it works:
highestFieldIdis computed recursively so nested and top-level ids never collide.ARRAY/MAPwrappers.update column typeis validated against the cast-support rules.NULL, follow renames, cast changed types — recursing into struct / array / map.Tests
New cases in
schema_evolution_nested_read_test.py:ARRAY<ROW>andMAP<.,ROW>;highestFieldId, duplicate detection);