Skip to content

[python] support read after data evolution updating by shard#7157

Open
XiaoHongbo-Hope wants to merge 20 commits intoapache:masterfrom
XiaoHongbo-Hope:shards_read
Open

[python] support read after data evolution updating by shard#7157
XiaoHongbo-Hope wants to merge 20 commits intoapache:masterfrom
XiaoHongbo-Hope:shards_read

Conversation

@XiaoHongbo-Hope
Copy link
Contributor

@XiaoHongbo-Hope XiaoHongbo-Hope commented Jan 30, 2026

Problem

When the user updates a column for only one shard (e.g. ShardTableUpdator runs shard 0 only and writes new column d), full table read fails:

pyarrow.lib.ArrowInvalid: Schema at index 1 was different: d: int32 vs d: null

Only that shard’s files have the new column; other files do not. Concat batches → schema mismatch → crash. To fix the issue, we support data evolution shard read.

Tests

API and Format

Documentation

@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review January 30, 2026 08:19
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python/hotfix] fix data-evolution read after partial shard update [python] support data evolution shard read Jan 31, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft January 31, 2026 07:48
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review January 31, 2026 08:54
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] support data evolution shard read [python] support read after update by shard of data evolution table Feb 1, 2026
row_tracking_enabled: bool,
system_fields: dict):
system_fields: dict,
requested_field_names: Optional[List[str]] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just use fields: List[DataField]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just use fields: List[DataField]?

Updated

"""Ensure _ROW_ID and _SEQUENCE_NUMBER are not null (per SpecialFields)."""
fields = []
for field in schema:
if field.name == SpecialFields.ROW_ID.name or field.name == SpecialFields.SEQUENCE_NUMBER.name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it can be nullable?

Copy link
Contributor Author

@XiaoHongbo-Hope XiaoHongbo-Hope Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it can be nullable?

A bug here. Nullable info of row-tracking system fields is lost during _assign_row_tracking. Opened a separate PR #7174 to fix it.

@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the shards_read branch 2 times, most recently from 72ffd99 to 7e5f55a Compare February 2, 2026 11:33
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft February 3, 2026 03:51
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review February 3, 2026 08:02
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft February 3, 2026 08:46
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] support read after update by shard of data evolution table [python] support read after data evolution shard updating Feb 3, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] support read after data evolution shard updating [python] support read after data evolution updating by shard Feb 3, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the shards_read branch 4 times, most recently from c00cafa to cb28d6b Compare February 8, 2026 10:35
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the shards_read branch 5 times, most recently from 55289ad to 277fef4 Compare February 27, 2026 09:36
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review February 27, 2026 10:29
same order for that shard.
- **Parallelism**: run multiple shards by calling `new_shard_updator(shard_idx, num_shards)` for each shard.

## Read After Partial Shard Update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this document doesn't make much sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this document doesn't make much sense

Removed

).slice(0, min_rows)
columns.append(column)
else:
columns.append(pa.nulls(min_rows, type=self.schema.field(i).type))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work should be done in DataFileBatchReader?

else:
field = self.schema_map.get(name)
inter_arrays.append(
pa.nulls(num_rows, type=field.type) if field is not None else pa.nulls(num_rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it, FormatPyArrowReader have already handled read_fields.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran your test and tried to fix it. All I need to do is modify FormatPyArrowReader out_fields.append(pa.field(field_name, pa.null(), nullable=True)), Do not pass null type, pass the correct type to fix it.

@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft February 28, 2026 08:26
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review February 28, 2026 15:28
@XiaoHongbo-Hope
Copy link
Contributor Author

XiaoHongbo-Hope commented Feb 28, 2026

I ran your test and tried to fix it. All I need to do is modify FormatPyArrowReader out_fields.append(pa.field(field_name, pa.null(), nullable=True)), Do not pass null type, pass the correct type to fix it.

My bad, fixed by updating read_fields into List[DataField]

@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft February 28, 2026 15:44
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review February 28, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants