Feat/parquet read options #150 by Sao-Ali · Pull Request #168 · DataHaskell/dataframe

Sao-Ali · 2026-02-26T00:50:29Z

Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers.

Solution:

introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions.
Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers.
Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns.
Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns.
Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation.

Tradeoffs and rationale:

chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl.
Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths.
read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior.

Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).

…nd row range Problem: Parquet reads were all-or-nothing. Users could not subset columns at read-time, control timestamp-to-day conversion, or subset rows while loading. This issue also required preserving current behavior for existing callers. Solution: introduce ParquetReadOptions (selectedColumns, timestampPolicy, rowRange) plus defaultParquetReadOptions. Add readParquetWithOpts/readParquetFilesWithOpts and keep readParquet/readParquetFiles as default-option wrappers. Wire selectedColumns into decode-time filtering with fail-fast ColumnNotFoundException for missing requested columns. Wire timestampPolicy with PreserveTimestampPrecision and CoerceTimestampToDay behaviors, including fallback coercion for already-decoded UTCTime columns. Wire rowRange through the reader and apply global rowRange semantics for readParquetFilesWithOpts after concatenation. Tradeoffs and rationale: chose an options record instead of multiple specialized APIs to keep extension points coherent and avoid API sprawl. Kept legacy conversion wrappers/helpers (applyLogicalType and UTC helpers) to reduce compatibility risk for existing/internal call paths. read-time projection improves performance by skipping unselected chunk decode; rowRange currently uses post-read slicing semantics (start inclusive, end exclusive) for correctness and consistency with existing range behavior. Verification: add focused Parquet tests for selectedColumns, rowRange, timestampPolicy coercion, and missing selected column errors; run full suite successfully via cabal test (all passing).

Apply formatter-driven layout updates in Parquet read-options code and related tests. No behavior change; this commit is formatting-only after lint/format checks.

Sao-Ali · 2026-02-26T00:55:27Z

I tried to implemented read options with backward compatibility as a core constraint, so existing functions and the default behavior should remain unchanged while adding the new feature. Let me know if the test cases make sense or if I need more.

mchav · 2026-02-26T05:15:22Z

src/DataFrame/IO/Parquet.hs


+data ParquetTimestampPolicy
+    = PreserveTimestampPrecision
+    | CoerceTimestampToDay


This doesn't seem as fundamental. Let's hold off on it.

also revert this back to old implementationg for the time converstion. I saw you guys talked about it in a different issue so let me know if we want to circle back to this.

src/DataFrame/IO/Parquet.hs

mchav · 2026-02-26T05:26:26Z

src/DataFrame/IO/Parquet.hs

+                Nothing -> True
+                Just selected ->
+                    let fullPath = T.intercalate "." (map T.pack colPath)
+                     in colName `S.member` selected || fullPath `S.member` selected


Let's not worry about nested fields for now. The reader doesn't even have a good way to support them.

removed the nested field for now. Made it so its leaf name only

mchav · 2026-02-26T05:29:38Z

src/DataFrame/IO/Parquet.hs

+            pure (applyRowRange opts (mconcat dfs))
+
+applyRowRange :: ParquetReadOptions -> DataFrame -> DataFrame
+applyRowRange opts df = case rowRange opts of


nit:

fmap (DS.range df) (rowRange opts)

Or use <$>.

mchav · 2026-02-26T05:31:07Z

tests/Parquet.hs

+    TestCase
+        ( assertEqual
+            "rowRangeWithOpts"
+            (D.range (2, 5) allTypes)


This is circular since if the range function is broken/produces the wrong result this will still pass. Should just test against the expect dimensions.

tests/Parquet.hs

- apply parquet read options in order: predicate filtering, column projection, then row range - auto-include predicate-referenced columns during decode when is set, then project back to requested columns - restrict selected-column matching to leaf names only (drop full-path nested matching) - remove and revert timestamp conversion to default behavior - update row-range helper implementation style in - revise parquet option tests: make row-range assertion non-circular and add predicate-focused cases

mchav · 2026-02-28T01:33:41Z

Thanks @Sao-Ali - this is great!

Sao-Ali added 2 commits February 25, 2026 15:47

chore(parquet): format reader and Parquet tests

d66b064

Apply formatter-driven layout updates in Parquet read-options code and related tests. No behavior change; this commit is formatting-only after lint/format checks.

mchav requested changes Feb 26, 2026

View reviewed changes

Sao-Ali requested a review from mchav February 26, 2026 21:58

mchav approved these changes Feb 28, 2026

View reviewed changes

mchav merged commit 7a67f04 into DataHaskell:main Feb 28, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/parquet read options #150#168

Feat/parquet read options #150#168
mchav merged 3 commits intoDataHaskell:mainfrom
Sao-Ali:feat/parquet-read-options-150

Sao-Ali commented Feb 26, 2026

Uh oh!

Sao-Ali commented Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

Sao-Ali Feb 26, 2026

Uh oh!

Uh oh!

mchav Feb 26, 2026

Uh oh!

Sao-Ali Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Uh oh!

Uh oh!

mchav commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Sao-Ali commented Feb 26, 2026

Uh oh!

Sao-Ali commented Feb 26, 2026

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Sao-Ali Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Sao-Ali Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mchav Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mchav commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants