Skip to content

feat: enforce physical column ordering in Parquet files for two-GET streaming merge#6281

Open
g-talbot wants to merge 3 commits intogtt/docs-claude-mdfrom
gtt/parquet-column-ordering
Open

feat: enforce physical column ordering in Parquet files for two-GET streaming merge#6281
g-talbot wants to merge 3 commits intogtt/docs-claude-mdfrom
gtt/parquet-column-ordering

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented Apr 9, 2026

Summary

  • Sort schema columns are written first in their configured order, followed by remaining data columns alphabetically
  • This physical layout enables a future two-GET streaming merge during compaction: footer GET for schema/offsets, then a single streaming GET delivers sort columns first for merge order computation
  • No changes to compaction (not yet implemented) — this prepares the indexing pipeline

Implementation

  • New reorder_columns() method in ParquetWriter that reorders a RecordBatch's columns before writing
  • Called in prepare_write() after sorting rows but before building WriterProperties
  • SortingColumn metadata indices automatically reflect the reordered schema since they're computed on the reordered batch

Test plan

  • test_column_ordering_sort_columns_first_then_alphabetical — verifies in-memory reordering logic
  • test_column_ordering_preserved_in_parquet_file — reads back a written Parquet file and verifies physical column order from the schema descriptor
  • All 154 existing parquet-engine tests pass
  • Clippy clean

🤖 Generated with Claude Code

g-talbot and others added 2 commits April 9, 2026 07:23
Sort schema columns are written first (in their configured sort order),
followed by all remaining data columns in alphabetical order. This
physical layout enables a two-GET streaming merge during compaction:
the footer GET provides the schema and offsets, then a single streaming
GET from the start of the row group delivers sort columns first —
allowing the compactor to compute the global merge order before data
columns arrive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sanity check only asserted presence, not ordering. Now it
verifies that host appears before service in the input (scrambled)
which is the opposite of the sort-schema order (service before host).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot requested a review from mattmkim April 9, 2026 11:28
@mattmkim
Copy link
Copy Markdown
Contributor

mattmkim commented Apr 9, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants