Skip to content

fix: pass file_size to ParquetObjectReader to avoid page-index panic#5

Merged
shefeek-jinnah merged 1 commit into
mainfrom
shefeek/fix-parquet-page-index-panic
May 26, 2026
Merged

fix: pass file_size to ParquetObjectReader to avoid page-index panic#5
shefeek-jinnah merged 1 commit into
mainfrom
shefeek/fix-parquet-page-index-panic

Conversation

@shefeek-jinnah

Copy link
Copy Markdown
Collaborator

Root cause

ParquetObjectReader::get_metadata (parquet 58.1, arrow/async_reader/store.rs:238) picks one of two metadata-loading paths:

file_size Path taken
Some(_) ParquetMetaDataReader::load_and_finish (size-aware)
None ParquetMetaDataReader::load_via_suffix_and_finish (suffix fallback)

Today we never call with_file_size, so the suffix fallback is always chosen. That path has a bug in 58.1:

  • load_metadata_via_suffix (file/metadata/reader.rs:706) returns the remainder as Some((0, suffix.slice(..metadata_start))). The 0 is the claimed remainder_start, but the bytes are from the file's
    tail (absolute offset file_size - suffix_len), not offset 0.
  • For comparison, the size-aware load_metadata (reader.rs:653) correctly returns Some((footer_start as usize, ...)).

Then in load_page_index_with_remainder (reader.rs:528):

Some((remainder_start, remainder)) if *remainder_start as u64 <= range.start => {
    let offset = usize::try_from(range.start - remainder_start)?;
    let end    = usize::try_from(range.end   - remainder_start)?;
    assert!(end <= remainder.len());   // panics
    remainder.slice(offset..end)
}

With remainder_start = 0, the guard 0 <= range.start is always true, so the branch is entered even though range.start is an absolute file offset far past remainder.len(). The assert! then trips for any
non-trivial file with page indexes.

Fix

Pass partitioned_file.object_meta.size to with_file_size so the inner ParquetObjectReader takes the size-aware path. This matches the pattern shown in parquet's own doc example (store.rs:49):

ParquetObjectReader::new(storage_container, meta.location).with_file_size(meta.size)

object_meta.size is a u64 populated at listing time, so no extra fetch is required.

Without an explicit file_size, ParquetObjectReader falls back to the
suffix-fetch metadata path, which mislabels `remainder_start` as 0 in
parquet 58.1 and panics in `load_page_index_with_remainder` for any
file containing page indexes.
@shefeek-jinnah shefeek-jinnah merged commit aa4062e into main May 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant