Skip to content

fix(arrow,maintenance): keep dict columns in batch reader; opt-in expire_snapshots file GC#15

Open
abnobdoss wants to merge 3 commits into
mainfrom
v3/w3-bugfixes
Open

fix(arrow,maintenance): keep dict columns in batch reader; opt-in expire_snapshots file GC#15
abnobdoss wants to merge 3 commits into
mainfrom
v3/w3-bugfixes

Conversation

@abnobdoss

Copy link
Copy Markdown
Owner

No description provided.

Abanoub Doss added 3 commits June 23, 2026 21:27
Fixes apache#3540. to_arrow_batch_reader(dictionary_columns=...) cast each batch to
a target schema built from schema_to_pyarrow(projection()), which has no
dictionary types, silently decoding dictionary-encoded columns back to plain
arrays (to_arrow does not, because it concatenates with permissive promotion).

Derive the reader's target schema from the first scan batch so requested
columns that ArrowScan actually dictionary-encodes (strings) stay dictionary
typed, while columns it leaves plain (ints, ORC, etc.) stay plain - matching
to_arrow. The trailing cast still conforms later batches. Adds regression
tests covering string, non-string, and mixed dictionary_columns.
Fixes apache#2604. ExpireSnapshots was metadata-only and leaked the data, delete,
manifest, manifest-list, and statistics files of expired snapshots forever.

Add opt-in ExpireSnapshots.clean_expired_files(). On the autocommit path,
collect files reachable from the expiring snapshots before the metadata
commit, then after the commit collect files still reachable from every
surviving snapshot (which covers all branches and tags), and best-effort
delete the difference. Surviving-file resolution runs strict: if it cannot be
fully resolved the cleanup aborts rather than risk deleting a live file.
Statistics and partition-statistics puffin files are cleaned the same way.
Cleanup is off by default and skipped for non-autocommit transactions.
@abnobdoss abnobdoss closed this Jun 25, 2026
@abnobdoss abnobdoss reopened this Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant