Unified market data acquisition and storage for quantitative research workflows.
This library is one of six interconnected libraries supporting the machine learning for trading workflow described in Machine Learning for Trading:
Together they cover data infrastructure, feature engineering, modeling, signal evaluation, strategy backtesting, and live deployment.
Quantitative research requires consistent, reproducible access to market data from multiple sources. ml4t-data provides:
DataManageras the unified interface: fetch, store, update, and query across all providers- 20+ provider adapters covering equities, crypto, futures, forex, macro, prediction markets, and factors
- Automated storage in Hive-partitioned Parquet format with metadata tracking
- Incremental updates, gap detection, and backfill via CLI
- Built-in data validation (OHLC invariants, deduplication, anomaly detection)
- Futures module for CME/ICE bulk downloads with continuous contract construction
- COT module for CFTC Commitment of Traders weekly reports
- Resilience: rate limiting, retry with exponential backoff, gap detection
The goal is to support an ongoing research workflow rather than one-off downloads. Data is stored locally, tracked for freshness, and queryable with tools like DuckDB or Polars.
pip install ml4t-datafrom ml4t.data import DataManager
dm = DataManager()
# Fetch and store
dm.fetch("AAPL", "2020-01-01", "2024-12-31", provider="yahoo")
# Load from local storage
data = dm.load("AAPL", "2020-01-01", "2024-12-31")
# Batch load multiple symbols
prices = dm.batch_load(["AAPL", "MSFT", "GOOGL"], "2020-01-01", "2024-12-31")
# Incremental update
dm.update("AAPL")
# List what's stored
symbols = dm.list_symbols()
metadata = dm.get_metadata("AAPL")All providers implement the same interface:
from ml4t.data.providers import YahooFinanceProvider, CoinGeckoProvider, FREDProvider
# Equities
provider = YahooFinanceProvider()
data = provider.fetch_ohlcv("AAPL", "2020-01-01", "2024-12-31")
# Crypto
crypto = CoinGeckoProvider().fetch_ohlcv("bitcoin", "2024-01-01", "2024-12-31")
# Economic data
fred = FREDProvider().fetch_series("GDP", "2020-01-01", "2024-12-31")| Provider | Coverage |
|---|---|
| Yahoo Finance | US/global equities, ETFs, crypto, forex |
| CoinGecko | 10,000+ cryptocurrencies |
| FRED | 850,000 economic series |
| Fama-French | Academic factor data |
| AQR | Research factors (QMJ, BAB, HML Devil, VME, more) |
| Wiki Prices | Frozen US equities history (1962-2018) |
| Kalshi | Prediction market contracts |
| Polymarket | Prediction market history/order book snapshots |
| Binance Public | Bulk crypto data downloads |
| NASDAQ ITCH Sample | Tick-level sample data |
| Provider | Coverage |
|---|---|
| EODHD | 60+ global exchanges |
| Tiingo | US equities with quality focus |
| Twelve Data | Multi-asset coverage |
| Databento | CME, CBOE, ICE futures/options |
| Massive | US equities, options, futures, forex, crypto |
| Finnhub | 70+ global exchanges |
| Binance | Crypto exchange data |
| OKX | Crypto perpetuals and funding rates |
| CryptoCompare | Crypto market data |
| OANDA | Forex broker data |
Bulk download and continuous contract construction for CME/ICE products:
from ml4t.data.futures import FuturesDownloader, ContinuousContractBuilder
# Bulk download via Databento (parent symbology)
downloader = FuturesDownloader(config)
downloader.download() # Downloads ES, NQ, CL, GC, etc.
# Build continuous contracts with configurable roll logic
builder = ContinuousContractBuilder()
continuous = builder.build(contracts_df, roll_method="volume")Book-focused interface with profiling:
from ml4t.data.futures import FuturesDataManager
fm = FuturesDataManager.from_config("config.yaml")
fm.download_all()
data = fm.load_ohlcv("ES")
profile = fm.generate_profile("ES")CFTC weekly positioning data for futures markets:
from ml4t.data.cot import COTFetcher, create_cot_features, combine_cot_ohlcv_pit
fetcher = COTFetcher(config)
cot_data = fetcher.fetch_product("ES", start_year=2015, end_year=2024)
# Point-in-time combination with OHLCV (no look-ahead)
combined = combine_cot_ohlcv_pit(cot_data, ohlcv_data)
# Generate features from COT data
features = create_cot_features(cot_data)Simplified interfaces for the ML4T book workflow:
from ml4t.data.etfs import ETFDataManager
from ml4t.data.crypto import CryptoDataManager
# 50 diversified ETFs via Yahoo Finance
etf_dm = ETFDataManager.from_config("config.yaml")
etf_dm.download_all()
aapl = etf_dm.load_ohlcv("AAPL")
# Crypto premium index via Binance Public
crypto_dm = CryptoDataManager.from_config("config.yaml")
crypto_dm.download_premium_index()# Fetch specific symbols
ml4t-data fetch -s AAPL -s MSFT -s GOOGL --provider yahoo --start 2020-01-01
# Incremental update
ml4t-data update --symbol AAPL
# Validate data quality
ml4t-data validate --symbol AAPL --anomalies
# Check storage status
ml4t-data status --detailed
# List available data
ml4t-data list-data
# Export to CSV/JSON/Excel
ml4t-data export --symbol AAPL --format-type csv --output aapl.csv
# Get symbol info
ml4t-data info --symbol AAPLConfiguration-driven batch updates:
storage:
path: ~/data/market
datasets:
sp500_daily:
provider: yahoo
symbols_file: symbols/sp500.txt
frequency: daily
start_date: 2015-01-01
crypto:
provider: coingecko
symbols: [bitcoin, ethereum, solana]
frequency: daily
start_date: 2020-01-01Data is stored in Hive-partitioned Parquet:
~/data/market/
├── yahoo/daily/symbol=AAPL/data.parquet
├── yahoo/daily/symbol=MSFT/data.parquet
└── coingecko/daily/symbol=bitcoin/data.parquet
Query with DuckDB or Polars:
import duckdb
result = duckdb.execute("""
SELECT * FROM read_parquet('~/data/market/yahoo/daily/**/*.parquet')
WHERE symbol IN ('AAPL', 'MSFT')
AND date >= '2024-01-01'
""").pl()from ml4t.data.validation import OHLCVValidator, ValidationReport
validator = OHLCVValidator()
report = validator.validate(data)
# Checks: high >= low, high >= open/close, low <= open/close
# Detects: duplicates, gaps, anomaliesAnomaly detection:
from ml4t.data.anomaly import AnomalyManager, ReturnOutlierDetector, VolumeSpikeDetector
manager = AnomalyManager([
ReturnOutlierDetector(),
VolumeSpikeDetector(),
])
report = manager.detect(data)- Getting Started — quick start guide
- Configuration — YAML config reference
- Storage — Hive partitioning and backends
- Incremental Updates — update strategies and gap detection
- Data Quality — validation and anomaly detection
- CLI Reference — command-line interface
- Provider Selection Guide — choosing providers
- Creating a Provider — extending with new sources
- Polars-based: Native Polars DataFrames throughout
- Consistent schema: All providers return the same column structure
- Async support: Async providers and batch operations for parallel downloads
- Metadata tracking: Last update timestamps, row counts, date ranges
- Resilience: Rate limiting, retry with exponential backoff, gap detection
- Multiple backends: File system, S3, and in-memory storage
- Type-safe: Full type annotations throughout
- ml4t-engineer: Feature engineering and technical indicators
- ml4t-diagnostic: Signal evaluation and statistical validation
- ml4t-backtest: Event-driven backtesting
- ml4t-live: Live trading with broker integration
git clone https://github.com/ml4t/data.git
cd ml4t-data
uv sync
uv run pytest tests/ -q
uv run ty checkMIT License - see LICENSE for details.

