Skip to content

Neerajdec2005/Unsupervised-Feature-Extraction-for-Systematic-Trading-in-S-P-500

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

S&P 500 Unsupervised Machine Learning — Stock Clustering & Systematic Trading

Overview

This project implements a systematic trading strategy using unsupervised machine learning on the S&P 500 universe. By applying KMeans clustering to technical indicators and Fama-French factor exposures, we identify stocks with similar risk/return characteristics and construct optimized monthly rebalanced portfolios.

Key Results: The ML-driven strategy is backtested from 2018-2025 and benchmarked against VOO (Vanguard S&P 500 ETF) buy-and-hold, demonstrating the effectiveness of factor-based clustering combined with mean-variance optimization.


Strategy Performance

Strategy Performance

The ML-driven strategy (red) vs VOO buy-and-hold benchmarks (blue) from 2018-2025. The systematic approach shows distinct performance patterns during market regime changes, with notable divergence during the 2020 COVID crash recovery and subsequent market cycles.


Project Pipeline

S&P 500 Tickers (Wikipedia)
      ↓
Download 10-Year OHLCV Data (yfinance)
      ↓
Compute Technical Features (GKV, RSI, BBands, ATR, MACD, Dollar Volume)
      ↓
Resample to Monthly & Filter Top 150 Liquid Stocks
      ↓
Compute Multi-Horizon Returns (1m, 2m, 3m, 6m, 9m, 12m)
      ↓
Download Fama-French 5-Factor Data
      ↓
Rolling OLS Regression (5-year window) → Factor Betas
      ↓
KMeans Clustering (4 clusters, RSI-seeded centroids)
      ↓
Select Cluster 3 Stocks (Overbought/High RSI)
      ↓
Efficient Frontier Optimization (Max Sharpe, 12-month window)
      ↓
Monthly Rebalancing with Optimized Weights
      ↓
Performance Evaluation vs VOO Benchmark
      ↓
Cumulative Returns Visualization

Data Sources

Source Data
Wikipedia S&P 500 constituent ticker list
Yahoo Finance (yfinance) 10 years of daily OHLCV price data
Kenneth French's Data Library Fama-French 5-Factor monthly returns

Requirements

Install all dependencies via:

pip install -r "SP500 Unsupervised ML/requirements.txt"

Key libraries:

  • yfinance — market data retrieval
  • pandas, numpy — data manipulation and numerical operations
  • statsmodels — rolling OLS regressions for factor exposure estimation
  • pandas_ta — technical indicators (RSI, BBands, ATR, MACD)
  • scikit-learn — unsupervised ML (KMeans clustering)
  • pypfopt — portfolio optimization (Efficient Frontier, Max Sharpe)
  • cvxpy, clarabel, scs — convex optimization solvers
  • matplotlib — visualization
  • requests, beautifulsoup4 — web scraping for S&P 500 tickers

Features Computed

1. Garman-Klass Volatility (GKV)

A more efficient estimator of daily volatility than close-to-close volatility, using all four price points (Open, High, Low, Close):

$$ \sigma^2_{GK} = \frac{1}{2}\left(\ln\frac{H}{L}\right)^2 - (2\ln 2 - 1)\left(\ln\frac{C}{O}\right)^2 $$

Where:

  • $H$ = High price
  • $L$ = Low price
  • $C$ = Close price
  • $O$ = Open price

It captures intraday price range and overnight gaps, making it superior to simple return-based volatility.


2. Relative Strength Index (RSI)

RSI measures the speed and magnitude of recent price changes to evaluate overbought or oversold conditions. It ranges from 0 to 100.

$$ RSI = 100 - \frac{100}{1 + RS} $$

Where:

$$ RS = \frac{\text{Average Gain over } n \text{ periods}}{\text{Average Loss over } n \text{ periods}} $$

  • Period used: 20 days
  • RSI > 70 → typically overbought
  • RSI < 30 → typically oversold

3. Bollinger Bands (BBands)

Bollinger Bands define a volatility envelope around a simple moving average (SMA):

$$ \text{Upper Band} = SMA_n + k \cdot \sigma_n $$ $$ \text{Middle Band} = SMA_n $$ $$ \text{Lower Band} = SMA_n - k \cdot \sigma_n $$

Where:

  • $n$ = 20-period window
  • $k$ = 2 (standard deviations)
  • $\sigma_n$ = rolling standard deviation of log(1 + Close)

Applied on log-transformed prices (log1p) to normalize scale across stocks.


4. Average True Range (ATR)

ATR measures market volatility by decomposing the typical daily range:

$$ TR = \max\left(H - L,\ |H - C_{prev}|,\ |L - C_{prev}|\right) $$

$$ ATR_n = \frac{1}{n} \sum_{i=1}^{n} TR_i $$

  • Period used: 14 days
  • Normalized per stock: $(ATR - \mu_{ATR}) / \sigma_{ATR}$
  • Higher ATR → higher volatility / wider price swings

5. MACD (Moving Average Convergence Divergence)

MACD captures trend momentum by comparing two exponential moving averages:

$$ MACD = EMA_{12} - EMA_{26} $$

$$ \text{Signal Line} = EMA_9(\text{MACD}) $$

$$ \text{Histogram} = MACD - \text{Signal Line} $$

  • Period used: 20-day fast EMA
  • Normalized per stock: $(MACD - \mu) / \sigma$
  • Positive MACD → bullish momentum; Negative → bearish

6. Dollar Volume

A liquidity measure representing the monetary value of shares traded:

$$ \text{Dollar Volume} = \frac{Close \times Volume}{10^6} \quad \text{(in millions)} $$

Used to rank stocks by liquidity and filter the top 150 most liquid stocks each month.


Monthly Return Features

Multi-horizon compounded monthly returns are computed to capture momentum across different lookback windows:

$$ r_{\text{lag}} = \left(\frac{P_t}{P_{t - \text{lag}}} \right)^{1/\text{lag}} - 1 $$

Computed for lags: 1, 2, 3, 6, 9, 12 months

Outliers are clipped at the 0.5th and 99.5th percentile before compounding to reduce the effect of extreme price moves.


Fama-French 5-Factor Model

The Fama-French 5-Factor Model extends the classic 3-factor model by adding profitability and investment factors:

$$ R_i - R_f = \alpha + \beta_1 (R_m - R_f) + \beta_2 SMB + \beta_3 HML + \beta_4 RMW + \beta_5 CMA + \epsilon $$

Factor Description
$R_m - R_f$ Market excess return (Market Risk Premium)
$SMB$ Small Minus Big — size factor
$HML$ High Minus Low — value factor (book-to-market)
$RMW$ Robust Minus Weak — profitability factor
$CMA$ Conservative Minus Aggressive — investment factor

A 5-year rolling OLS regression is run for each stock monthly to estimate time-varying factor exposures (betas), which are used as features for clustering.


Liquidity Filter

Each month, stocks are ranked by their 5-year rolling average dollar volume:

$$ \overline{DV}_{t} = \frac{1}{60} \sum_{i=t-59}^{t} DV_i $$

Only the top 150 stocks by this rank are retained each month, ensuring we focus on the most tradable universe.


Unsupervised Learning: KMeans Clustering

After computing all features and factor loadings, we apply KMeans clustering to group stocks with similar characteristics each month.

Clustering Strategy

  • Algorithm: KMeans with n_clusters=4
  • Features: All 18 computed features (technical indicators + Fama-French factor betas + multi-horizon returns)
  • Initial Centroids: Seeded based on target RSI values [30, 45, 55, 70] to create interpretable clusters:
    • Cluster 0: Oversold stocks (RSI ≈ 30)
    • Cluster 1: Mildly bearish (RSI ≈ 45)
    • Cluster 2: Mildly bullish (RSI ≈ 55)
    • Cluster 3: Overbought stocks (RSI ≈ 70)

Initial Centroids Matrix

initial_centroids = np.zeros((4, 18))
initial_centroids[:, 6] = [30, 45, 55, 70]  # RSI is the 7th feature

This ensures consistent cluster interpretation across time periods.

Cluster Interpretation

Each cluster represents distinct market regimes and stock behaviors:

  • Cluster 0 (RSI ≈ 30): Oversold, potentially undervalued stocks showing recent price weakness
  • Cluster 1 (RSI ≈ 45): Neutral-to-bearish momentum, below midpoint
  • Cluster 2 (RSI ≈ 55): Neutral-to-bullish momentum, above midpoint
  • Cluster 3 (RSI ≈ 70): Overbought, strong recent performance, high momentum

The strategy selects Cluster 3 based on momentum continuation hypothesis: stocks with strong recent performance tend to exhibit short-term persistence before mean reversion.


Portfolio Construction Strategy

The trading strategy focuses on Cluster 3 stocks (high RSI, typically overbought), implementing a contrarian/momentum-hybrid approach:

Monthly Rebalancing Process

  1. Select Cluster 3 stocks from the current month
  2. Shift forward one month to avoid look-ahead bias
  3. Optimization Window: Use 12 months of historical prices
  4. Weight Optimization: Apply Efficient Frontier Max Sharpe Ratio optimization
    • Constraints: Each stock weight between [1/(2N), 10%] where N = number of stocks
    • Solver: SCS (Splitting Conic Solver)
  5. Fallback: If optimization fails, use equal weights
  6. Calculate Returns: Compute log returns for the month using optimized weights

Optimization Objective

$$ \max \frac{E[R_p] - R_f}{\sigma_p} $$

Where:

  • $R_p$ = portfolio return
  • $R_f$ = risk-free rate (assumed 0)
  • $\sigma_p$ = portfolio volatility

Expected returns and covariance matrix are estimated from the trailing 12-month window (annualized to 252 trading days).


Performance Results

Benchmark Comparison

The strategy is benchmarked against VOO (Vanguard S&P 500 ETF) on a buy-and-hold basis from 2018 to 2025.

Key Metrics

Metric ML Strategy VOO Buy & Hold
Cumulative Return Displayed in plot Displayed in plot
Rebalancing Frequency Monthly None (buy-and-hold)
Universe Top 150 liquid S&P 500 stocks (cluster 3) Full S&P 500

Visualization

The notebook generates a cumulative returns comparison plot:

Title: Systematic Trading with Unsupervised Learning: Returns Comparison

Features:

  • Time series from 2018-01 to 2025-12
  • Two lines:
    • Strategy Return (blue): ML-driven monthly rebalanced portfolio
    • VOO Buy&Hold (orange): Passive S&P 500 benchmark
  • Y-axis formatted as percentage returns
  • Plot size: 16" × 6"

The plot demonstrates the effectiveness of the unsupervised learning approach in capturing market dynamics through factor-based clustering and optimization.


Files Structure

QUANT/
├── SP500 Unsupervised ML/
│   ├── code.ipynb           # Main analysis notebook
│   └── requirements.txt     # Python dependencies
├── quant-env/               # Python 3.11 virtual environment
├── quant-env-3.13/          # Python 3.13 virtual environment
├── sp500.xlsx               # Generated S&P 500 ticker list
├── README.md                # This file
└── LICENSE

Usage

  1. Install dependencies:

    pip install -r "SP500 Unsupervised ML/requirements.txt"
  2. Run the notebook:

    jupyter notebook "SP500 Unsupervised ML/code.ipynb"
  3. Execute cells sequentially to:

    • Fetch S&P 500 tickers
    • Download historical price data
    • Compute technical features
    • Run Fama-French factor regressions
    • Apply KMeans clustering
    • Optimize portfolio weights
    • Generate performance comparison plot

Notes

  • pandas_datareader is incompatible with Python 3.13. Fama-French data is fetched directly from Kenneth French's website instead.
  • pd.read_html() requires wrapping HTML strings in StringIO in pandas 2.0+.
  • The strategy uses log returns for compounding, with cumulative returns calculated as: $\exp(\sum \log(1 + r_t)) - 1$

Future Enhancements

  • Add risk metrics (Sharpe ratio, max drawdown, Sortino ratio)
  • Implement alternative clustering algorithms (DBSCAN, Hierarchical)
  • Test different cluster selection strategies (rotate between clusters)
  • Add transaction cost modeling
  • Backtest on extended time periods (2010-2025)
  • Incorporate additional alternative data sources

License

See LICENSE file for details.

About

Developed a systematic trading strategy on S&P 500 equities using unsupervised learning to extract latent market patterns and generate signals.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors