Skip to content

ml4t/engineer

Repository files navigation

ml4t-engineer

Python 3.12+ PyPI License: MIT

Feature engineering for financial machine learning: validated features, labeling methods, alternative bars, and leakage-safe dataset preparation.

Part of the ML4T Library Ecosystem

This library is one of six interconnected libraries supporting the machine learning for trading workflow described in Machine Learning for Trading:

ML4T Library Ecosystem

Together they cover data infrastructure, feature engineering, modeling, signal evaluation, strategy backtesting, and live deployment.

What This Library Does

Transforming raw price data into predictive features is a core task in quantitative research. ml4t-engineer provides:

  • 120 registry features across 11 categories (momentum, volatility, trend, microstructure, and more)
  • Triple-barrier, ATR-based, percentile, trend-scanning, and meta-labeling methods from Advances in Financial Machine Learning
  • Alternative bar sampling (volume bars, dollar bars, tick imbalance bars)
  • Dataset building, preprocessing, and feature discovery for leakage-safe ML workflows

The library is built on Polars with Numba JIT compilation for numerical operations. 60 features are validated against TA-Lib at 1e-6 tolerance.

ml4t-engineer Architecture

Installation

pip install ml4t-engineer

Optional dependencies:

pip install ml4t-engineer[ta]        # TA-Lib backend
pip install ml4t-engineer[viz]       # Visualization
pip install ml4t-engineer[calendars] # Trading calendars

Quick Start

import polars as pl
from ml4t.engineer import compute_features

df = pl.read_parquet("ohlcv.parquet")

# Compute features with default parameters
result = compute_features(df, ["rsi", "macd", "atr", "obv"])

# Or with custom parameters
result = compute_features(df, [
    {"name": "rsi", "params": {"period": 20}},
    {"name": "bollinger_bands", "params": {"period": 20, "std_dev": 2.0}},
])

Feature Registry

from ml4t.engineer.core.registry import get_registry

registry = get_registry()
print(registry.list_all())                    # All 120 features
print(registry.list_by_category("momentum"))  # 31 momentum indicators
print(registry.list_ta_lib_compatible())      # 60 TA-Lib validated features
print(registry.list_normalized())             # 37 bounded (0-100, -1 to 1)

Feature Categories

Category Count Examples
Momentum 31 RSI, MACD, Stochastic, CCI, ADX, MFI
Microstructure 15 Kyle Lambda, VPIN, Amihud, Roll spread
Volatility 15 ATR, Bollinger, Yang-Zhang, Parkinson
Statistics 14 Variance, Linear Regression, Correlation
ML 14 Fractional Diff, Entropy, Lag features
Trend 10 SMA, EMA, WMA, DEMA, TEMA, KAMA
Risk 6 Max Drawdown, Sortino, CVaR
Price Transform 5 Typical Price, Weighted Close
Regime 4 Hurst Exponent, Choppiness Index
Volume 3 OBV, AD, ADOSC
Math 3 MAX, MIN, SUM

Triple-Barrier Labeling

from ml4t.engineer.config import LabelingConfig
from ml4t.engineer.labeling import triple_barrier_labels, atr_triple_barrier_labels

# Fixed barriers
tb_config = LabelingConfig.triple_barrier(
    upper_barrier=0.02,    # 2% profit target
    lower_barrier=0.01,    # 1% stop loss
    max_holding_period=20, # 20 bars
)
labels = triple_barrier_labels(
    df,
    config=tb_config,
)

# ATR-based dynamic barriers
atr_config = LabelingConfig.atr_barrier(
    atr_tp_multiple=2.0,
    atr_sl_multiple=1.0,
    atr_period=14,
    max_holding_period=20,
)
labels = atr_triple_barrier_labels(
    df,
    config=atr_config,
)

# Time-based horizons
tb_time_config = LabelingConfig.triple_barrier(
    upper_barrier=0.02,
    lower_barrier=0.01,
    max_holding_period="4h",  # 4 hours
)
labels = triple_barrier_labels(
    df,
    config=tb_time_config,
)

Alternative Bars

from ml4t.engineer.bars import VolumeBarSampler, DollarBarSampler, TickImbalanceBarSampler

# Volume bars (equal volume per bar)
vbars = VolumeBarSampler(volume_per_bar=1000).sample(tick_data)

# Dollar bars (equal dollar volume per bar)
dbars = DollarBarSampler(dollars_per_bar=1_000_000).sample(tick_data)

# Tick imbalance bars (information-driven)
ibars = TickImbalanceBarSampler(expected_ticks_per_bar=100).sample(tick_data)

Documentation

  • Docs Home - library overview and workflow map
  • Quickstart - first working feature and labeling workflow
  • Features - 120 features across 11 registry categories
  • Labeling - 7 labeling methods for supervised learning
  • Dataset Builder - leakage-safe train/test preparation
  • Examples - runnable scripts for complete workflows and focused features

Technical Characteristics

  • Polars-native: All computations use Polars expressions
  • Numba-accelerated: JIT compilation for numerical kernels
  • TA-Lib validated: 60 features validated at 1e-6 tolerance
  • AFML-compliant: Labeling methods verified against Advances in Financial Machine Learning
  • ML-ready outputs: 37 features produce bounded outputs (0-100, -1 to 1) for direct model input; remaining features work with standard preprocessing (returns, z-scores, robust scaling)

Related Libraries

  • ml4t-specs: Shared feed and artifact schema definitions across the ML4T stack
  • ml4t-data: Market data acquisition and storage
  • ml4t-diagnostic: Signal evaluation and statistical validation
  • ml4t-backtest: Event-driven backtesting
  • ml4t-live: Live trading with broker integration

Development

git clone https://github.com/ml4t/engineer.git
cd ml4t-engineer
uv sync
uv run pytest tests/ -q
uv run ty check

References

  • Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
  • Lopez de Prado, M. (2020). Machine Learning for Asset Managers. Cambridge.

License

MIT License - see LICENSE for details.