Skip to content

DataHaskell/dataframe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,197 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

User guide | Discord

DataFrame

Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.

The library ships three API layers — all operating on the same underlying DataFrame type at runtime:

  • Untyped (import qualified DataFrame as D) — string-based column names, great for exploration and scripting.
  • Typed (import qualified DataFrame.Typed as T) — phantom-type schema tracking with compile-time column validation.
  • Monadic API — write your transformation as a self contained pipeline.

This README is a runnable scripths notebook. Every Haskell block runs top-to-bottom in one shared session against the datasets in ./data. Reproduce every output below with scripths docs/base_scripts/base_readme.md -o README.md from the repo root.

Why this library?

  • Concise, declarative, composable data pipelines using the |> pipe operator.
  • Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
  • High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
  • Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.

Install

cabal update
cabal install dataframe

To use as a dependency in a project:

build-depends: base >= 4, dataframe

Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:

dataframe

Quick Start

Group sales by product and compute totals. The first block carries the scripths cabal directives and the imports shared by the rest of the document; you can also drop the same code into an Example.hs and run it with cabal run Example.hs after adding a #!/usr/bin/env cabal header.

-- cabal: build-depends: dataframe, text
-- cabal: default-extensions: OverloadedStrings, TypeApplications, TemplateHaskell, DataKinds, TypeFamilies, FlexibleInstances, FlexibleContexts, ScopedTypeVariables, DeriveGeneric, UndecidableInstances
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import qualified DataFrame.Typed as DT
import DataFrame.Operators
import Data.Text (Text)
import Data.Int (Int64)

sales = D.fromNamedColumns
    [ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
    , ("amount",  D.fromList [100, 120, 50, 20, 40, 30 :: Int])
    ]

-- Group by product and compute totals
sales
    |> D.groupBy ["product"]
    |> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
                   , F.count (F.col @Int "amount") `as` "orders"
                   ]
    |> D.toMarkdown'
product
Int
total
Int
orders
Int
1 220 2
3 70 2
2 70 2

Reading from files works the same way:

fileDf <- D.readCsv "./data/housing.csv"
fileDf <- D.readParquet "./data/mtcars.parquet"

-- Hugging Face datasets (needs network access):
-- fileDf <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"

D.dimensions fileDf

(32,12)

Interactive REPL

The dataframe REPL comes with all imports pre-loaded. Here's a typical exploration session (each block runs as a cell):

df <- D.readCsv "./data/housing.csv"
D.dimensions df

(20640,10)

D.describeColumns df |> D.toMarkdown'
Column Name
Text
# Non-null Values
Int
# Null Values
Int
Type
Text
total_bedrooms 20433 207 Maybe Double
ocean_proximity 20640 0 Text
median_house_value 20640 0 Double
median_income 20640 0 Double
households 20640 0 Double
population 20640 0 Double
total_rooms 20640 0 Double
housing_median_age 20640 0 Double
latitude 20640 0 Double
longitude 20640 0 Double

The :declareColumns macro ($(D.declareColumns df) outside the REPL) generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing F.col @Double "median_income" every time:

$(D.declareColumns df)
df |> D.groupBy ["ocean_proximity"]
   |> D.aggregate [F.mean median_house_value `as` "avg_value"]
   |> D.toMarkdown'
ocean_proximity
Text
avg_value
Double
NEAR BAY 259212.31179039303
NEAR OCEAN 249433.97742663656
INLAND 124805.39200122119
<1H OCEAN 240084.28546409807
ISLAND 380440.0

Create new columns from existing ones:

df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3 |> D.toMarkdown'
longitude
Double
latitude
Double
housing_median_age
Double
total_rooms
Double
total_bedrooms
Maybe Double
population
Double
households
Double
median_income
Double
median_house_value
Double
ocean_proximity
Text
rooms_per_household
Double
-122.23 37.88 41.0 880.0 Just 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 6.984126984126984
-122.22 37.86 21.0 7099.0 Just 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 6.238137082601054
-122.24 37.85 52.0 1467.0 Just 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 8.288135593220339

Type mismatches are caught as compile errors — adding a Double column to a Text column won't silently produce garbage:

dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)

<interactive>:14:47: error: [GHC-83865]
    • Couldn't match type 'Text' with 'Double'
        Expected: Expr Double
          Actual: Expr Text
    • In the second argument of '(+)', namely 'ocean_proximity'
      In the second argument of 'derive', namely
        '(latitude + ocean_proximity)'

Template Haskell

For scripts and projects, Template Haskell can generate column bindings at compile time.

Generate column references from a CSV

declareColumnsFromCsvFile (in DataFrame.TH, also re-exported from DataFrame) reads your CSV at compile time and generates typed Expr bindings for every column:

-- Reads housing.csv at compile time and generates:
--   latitude :: Expr Double
--   total_rooms :: Expr Double
--   ocean_proximity :: Expr Text
--   ... one binding per column
$(D.declareColumnsFromCsvFile "./data/housing.csv")

df <- D.readCsv "./data/housing.csv"
df |> D.derive "rooms_per_household" (total_rooms / households)
   |> D.filterWhere (median_income .>. 5)
   |> D.groupBy ["ocean_proximity"]
   |> D.aggregate [F.mean median_house_value `as` "avg_value"]
   |> D.toMarkdown'
ocean_proximity
Text
avg_value
Double
NEAR BAY 361441.9354304636
NEAR OCEAN 380041.63071895426
INLAND 234817.86695906433
<1H OCEAN 333411.75125531096

Compare this to the manual version which requires spelling out every column name and type:

-- Without TH — every column needs its name and type spelled out
df |> D.derive "rooms_per_household"
        (F.col @Double "total_rooms" / F.col @Double "households")
   |> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)
   |> D.take 5
   |> D.toMarkdown'
longitude
Double
latitude
Double
housing_median_age
Double
total_rooms
Double
total_bedrooms
Maybe Double
population
Double
households
Double
median_income
Double
median_house_value
Double
ocean_proximity
Text
rooms_per_household
Double
-122.23 37.88 41.0 880.0 Just 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 6.984126984126984
-122.22 37.86 21.0 7099.0 Just 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 6.238137082601054
-122.24 37.85 52.0 1467.0 Just 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 8.288135593220339
-122.25 37.85 52.0 1274.0 Just 235.0 558.0 219.0 5.6431000000000004 341300.0 NEAR BAY 5.8173515981735155
-122.29 37.82 49.0 135.0 Just 29.0 86.0 23.0 6.1183 75000.0 NEAR BAY 5.869565217391305

Generate a schema type from a CSV

deriveSchemaFromCsvFile generates a type synonym for use with the typed API — instead of manually writing out every column name and type:

-- Generates:
-- type HousingSchema = '[ DT.Column "longitude" Double
--                       , DT.Column "latitude" Double
--                       , DT.Column "total_rooms" Double
--                       , ...
--                       ]
$(DT.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")

Generate a schema (and a row bridge) from a record ADT

When the canonical row shape lives in your code as a Haskell record, deriveSchemaFromType produces both the typed schema and a HasSchema instance that converts between [Order] and a DataFrame (or TypedDataFrame OrderSchema) at runtime:

data Order = Order
    { orderId :: Int64
    , region  :: Text
    , amount  :: Double
    } deriving (Show, Eq)

$(DT.deriveSchemaFromType ''Order)
-- expands to:
--   type OrderSchema =
--     '[DT.Column "order_id" Int64, DT.Column "region" Text, DT.Column "amount" Double]
--   instance DT.HasSchema Order where
--     type Schema Order = OrderSchema
--     toColumns   = ...
--     fromColumns = ...

xs :: [Order]
xs = [Order 1 "us" 10.0, Order 2 "eu" 20.5]

-- Untyped: [Order] -> DataFrame
ordersDf :: D.DataFrame
ordersDf = D.fromRecords xs

ordersDf |> D.toMarkdown'
order_id
Int64
region
Text
amount
Double
1 us 10.0
2 eu 20.5

The runtime-checked round-trip back to records:

D.toRecords ordersDf :: Either Text [Order]

Right [Order {orderId = 1, region = "us", amount = 10.0},Order {orderId = 2, region = "eu", amount = 20.5}]

And the typed bridge — [Order] to TypedDataFrame OrderSchema and back:

DT.thaw (DT.fromRecordsTyped xs :: DT.TypedDataFrame OrderSchema) |> D.toMarkdown'
order_id
Int64
region
Text
amount
Double
1 us 10.0
2 eu 20.5

Field names are translated camelCase → snake_case by default; override the translation with deriveSchemaFromTypeWith defaultSchemaOptions{nameTransform = id} (or any String -> String).

If all you need is a runtime Schema to drive readCsvWithSchema (no typed-dataframe machinery), there's a companion splice in DataFrame.Internal.Schema (re-exported from DataFrame):

$(D.deriveSchema ''Order)
-- emits:
--   orderSchema     :: Schema
--   orderSchema     = makeSchema [("order_id", schemaType @Int64), ...]
--   orderOrderId    :: Expr Int64
--   orderOrderId    = col "order_id"
--   orderRegion     :: Expr Text
--   orderRegion     = col "region"
--   orderAmount     :: Expr Double
--   orderAmount     = col "amount"

orders :: IO D.DataFrame
orders = do
    raw <- D.readCsvWithSchema orderSchema "./data/orders.csv"
    pure (D.filter orderAmount (> 100) raw)

Each record field gets a typed accessor named <lower-first TyConName><UpperFirst FieldName>, so data Order { customerId :: Int } yields orderCustomerId :: Expr Int = col "customer_id". That's the same shape as $(D.declareColumns df) produces from a runtime DataFrame, but driven off the ADT instead of an existing frame.

If you'd rather not depend on Template Haskell, the same schema is available via GHC.Generics (shown here on an equivalent record):

import GHC.Generics (Generic)
import DataFrame.Typed (Schema)

data OrderG = OrderG
    { orderGId :: Int64
    , regionG  :: Text
    , amountG  :: Double
    } deriving (Generic)

type OrderGSchema = DT.SchemaOf OrderG

instance DT.HasSchema OrderG where
    type Schema OrderG = OrderGSchema
    toColumns   = DT.genericToColumns
    fromColumns = DT.genericFromColumns

Typed API

When you want compile-time guarantees that column names exist and types match, wrap your DataFrame in a TypedDataFrame:

type EmployeeSchema =
    '[ DT.Column "name"       Text
     , DT.Column "department" Text
     , DT.Column "salary"     Double
     ]

employees <- D.readCsv "./data/employees.csv"
case DT.freeze @EmployeeSchema employees of
    Nothing  -> "Schema mismatch!"
    Just tdf -> tdf
        |> DT.derive @"bonus" (DT.col @"salary" * DT.lit 0.1)
        |> DT.filterWhere (DT.col @"salary" DT..>. DT.lit 50000)
        |> DT.select @'["name", "bonus"]
        |> DT.thaw
        |> D.toMarkdown'
name
Text
bonus
Double
Alice 8500.0
Carol 12000.0
Dave 5200.0
Frank 6700.0

DT.freeze validates the runtime DataFrame against your schema once at the boundary. After that, every column access is checked at compile time:

-- Typo in column name -> compile error
tdf |> DT.filterWhere (DT.col @"slary" DT..>. DT.lit 50000)
-- error: Column "slary" not found in schema

-- Wrong type -> compile error
tdf |> DT.filterWhere (DT.col @"name" DT..>. DT.lit 50000)
-- error: Couldn't match type 'Text' with 'Double'

filterAllJust goes further — it strips Maybe from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable:

type ScoreSchema = '[ DT.Column "name" Text, DT.Column "score" (Maybe Double) ]

scoresDf = D.fromNamedColumns
    [ ("name",  D.fromList ["a", "b", "c" :: Text])
    , ("score", D.fromList [Just 1.0, Nothing, Just 3.0 :: Maybe Double])
    ]

Just stdf = DT.freeze @ScoreSchema scoresDf

-- filterAllJust drops the null row and changes the column type from
-- (Maybe Double) to Double, so `scaled` can multiply it directly.
DT.thaw (DT.filterAllJust stdf |> DT.derive @"scaled" (DT.col @"score" * DT.lit 100)) |> D.toMarkdown'
name
Text
score
Double
scaled
Double
a 1.0 100.0
c 3.0 300.0

Features

I/O: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (hf:// URIs). Column projection and predicate pushdown for Parquet reads.

Operations: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.

Expressions: typed column references (F.col @Double "x"), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (.==, .&&), string matching (like, regex), casting, and user-defined functions via lift/lift2.

Statistics: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.

Plotting: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.

Lazy engine: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.

Interop: Arrow C Data Interface for zero-copy round-trips with Python and Polars.

ML: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.

Notebooks: IHaskell integration with pre-built Binder examples.

Lazy Queries

For files too large to fit in memory, DataFrame.Lazy provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and runDataFrame runs it through an optimizer before streaming results batch-by-batch:

import qualified DataFrame.Lazy as L
import DataFrame.Internal.Schema (schemaType, makeSchema)

housingSchema = makeSchema
    [ ("longitude",          schemaType @Double)
    , ("latitude",           schemaType @Double)
    , ("housing_median_age", schemaType @Double)
    , ("total_rooms",        schemaType @Double)
    , ("total_bedrooms",     schemaType @(Maybe Double))
    , ("population",         schemaType @Double)
    , ("households",         schemaType @Double)
    , ("median_income",      schemaType @Double)
    , ("median_house_value", schemaType @Double)
    , ("ocean_proximity",    schemaType @Text)
    ]

lazyResult <- L.runDataFrame $
    L.scanCsv housingSchema "./data/housing.csv"
    |> L.filter  (F.col @Double "median_income" .>. F.lit 5)
    |> L.derive  "value_per_income"
                 (F.col @Double "median_house_value" / F.col @Double "median_income")
    |> L.select  ["ocean_proximity", "median_house_value", "value_per_income"]
    |> L.take 1000

D.take 10 lazyResult |> D.toMarkdown'
ocean_proximity
Text
median_house_value
Double
value_per_income
Double
NEAR BAY 452600.0 54365.06029885168
NEAR BAY 358500.0 43185.48678536151
NEAR BAY 352100.0 48515.997464656764
NEAR BAY 341300.0 60480.94132657581
NEAR BAY 75000.0 12258.307046074891
NEAR BAY 262500.0 51554.49064163246
NEAR BAY 327600.0 55908.25312308007
NEAR BAY 347600.0 65748.65703260951
NEAR BAY 366100.0 61467.42780389524
NEAR BAY 373600.0 58895.860264211624

The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.

Documentation

About

A fast, safe, and intuitive DataFrame library.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors