Skip to content

Feat/parquet read options #150#168

Merged
mchav merged 3 commits intoDataHaskell:mainfrom
Sao-Ali:feat/parquet-read-options-150
Feb 28, 2026
Merged

Feat/parquet read options #150#168
mchav merged 3 commits intoDataHaskell:mainfrom
Sao-Ali:feat/parquet-read-options-150

Conversation

@Sao-Ali
Copy link
Contributor

@Sao-Ali Sao-Ali commented Feb 26, 2026

Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers.

Solution:

  • introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions.
  • Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers.
  • Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns.
  • Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns.
  • Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation.

Tradeoffs and rationale:

  • chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl.
  • Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths.
  • read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior.

Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).

…nd row range

Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers.

Solution: introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions. Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers. Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns. Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns. Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation.

Tradeoffs and rationale: chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl. Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths. read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior.

Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).
Apply formatter-driven layout updates in Parquet read-options code and related tests.

No behavior change; this commit is formatting-only after lint/format checks.
@Sao-Ali
Copy link
Contributor Author

Sao-Ali commented Feb 26, 2026

I tried to implemented read options with backward compatibility as a core constraint, so existing functions and the default behavior should remain unchanged while adding the new feature. Let me know if the test cases make sense or if I need more.


data ParquetTimestampPolicy
= PreserveTimestampPrecision
| CoerceTimestampToDay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem as fundamental. Let's hold off on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also revert this back to old implementationg for the time converstion. I saw you guys talked about it in a different issue so let me know if we want to circle back to this.

Nothing -> True
Just selected ->
let fullPath = T.intercalate "." (map T.pack colPath)
in colName `S.member` selected || fullPath `S.member` selected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about nested fields for now. The reader doesn't even have a good way to support them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the nested field for now. Made it so its leaf name only

pure (applyRowRange opts (mconcat dfs))

applyRowRange :: ParquetReadOptions -> DataFrame -> DataFrame
applyRowRange opts df = case rowRange opts of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

fmap (DS.range df) (rowRange opts)

Or use <$>.

tests/Parquet.hs Outdated
TestCase
( assertEqual
"rowRangeWithOpts"
(D.range (2, 5) allTypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is circular since if the range function is broken/produces the wrong result this will still pass. Should just test against the expect dimensions.

  - apply parquet read options in order: predicate filtering, column projection, then row range
  - auto-include predicate-referenced columns during decode when  is set, then project back to requested columns
  - restrict selected-column matching to leaf names only (drop full-path nested matching)
  - remove  and revert timestamp conversion to default behavior
  - update row-range helper implementation style in
  - revise parquet option tests: make row-range assertion non-circular and add predicate-focused cases
@Sao-Ali Sao-Ali requested a review from mchav February 26, 2026 21:58
@mchav
Copy link
Member

mchav commented Feb 28, 2026

Thanks @Sao-Ali - this is great!

@mchav mchav merged commit 7a67f04 into DataHaskell:main Feb 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants