S&P 500 Unsupervised Machine Learning — Stock Clustering & Systematic Trading

Overview

This project implements a systematic trading strategy using unsupervised machine learning on the S&P 500 universe. By applying KMeans clustering to technical indicators and Fama-French factor exposures, we identify stocks with similar risk/return characteristics and construct optimized monthly rebalanced portfolios.

Key Results: The ML-driven strategy is backtested from 2018-2025 and benchmarked against VOO (Vanguard S&P 500 ETF) buy-and-hold, demonstrating the effectiveness of factor-based clustering combined with mean-variance optimization.

Strategy Performance

The ML-driven strategy (red) vs VOO buy-and-hold benchmarks (blue) from 2018-2025. The systematic approach shows distinct performance patterns during market regime changes, with notable divergence during the 2020 COVID crash recovery and subsequent market cycles.

Project Pipeline

S&P 500 Tickers (Wikipedia)
      ↓
Download 10-Year OHLCV Data (yfinance)
      ↓
Compute Technical Features (GKV, RSI, BBands, ATR, MACD, Dollar Volume)
      ↓
Resample to Monthly & Filter Top 150 Liquid Stocks
      ↓
Compute Multi-Horizon Returns (1m, 2m, 3m, 6m, 9m, 12m)
      ↓
Download Fama-French 5-Factor Data
      ↓
Rolling OLS Regression (5-year window) → Factor Betas
      ↓
KMeans Clustering (4 clusters, RSI-seeded centroids)
      ↓
Select Cluster 3 Stocks (Overbought/High RSI)
      ↓
Efficient Frontier Optimization (Max Sharpe, 12-month window)
      ↓
Monthly Rebalancing with Optimized Weights
      ↓
Performance Evaluation vs VOO Benchmark
      ↓
Cumulative Returns Visualization

Data Sources

Source	Data
Wikipedia	S&P 500 constituent ticker list
Yahoo Finance (yfinance)	10 years of daily OHLCV price data
Kenneth French's Data Library	Fama-French 5-Factor monthly returns

Requirements

Install all dependencies via:

pip install -r "SP500 Unsupervised ML/requirements.txt"

Key libraries:

yfinance — market data retrieval
pandas, numpy — data manipulation and numerical operations
statsmodels — rolling OLS regressions for factor exposure estimation
pandas_ta — technical indicators (RSI, BBands, ATR, MACD)
scikit-learn — unsupervised ML (KMeans clustering)
pypfopt — portfolio optimization (Efficient Frontier, Max Sharpe)
cvxpy, clarabel, scs — convex optimization solvers
matplotlib — visualization
requests, beautifulsoup4 — web scraping for S&P 500 tickers

Features Computed

1. Garman-Klass Volatility (GKV)

A more efficient estimator of daily volatility than close-to-close volatility, using all four price points (Open, High, Low, Close):

$$ \sigma^2_{GK} = \frac{1}{2}\left(\ln\frac{H}{L}\right)^2 - (2\ln 2 - 1)\left(\ln\frac{C}{O}\right)^2 $$

Where:

$H$ = High price
$L$ = Low price
$C$ = Close price
$O$ = Open price

It captures intraday price range and overnight gaps, making it superior to simple return-based volatility.

2. Relative Strength Index (RSI)

RSI measures the speed and magnitude of recent price changes to evaluate overbought or oversold conditions. It ranges from 0 to 100.

$$ RSI = 100 - \frac{100}{1 + RS} $$

Where:

$$ RS = \frac{\text{Average Gain over } n \text{ periods}}{\text{Average Loss over } n \text{ periods}} $$

Period used: 20 days
RSI > 70 → typically overbought
RSI < 30 → typically oversold

3. Bollinger Bands (BBands)

Bollinger Bands define a volatility envelope around a simple moving average (SMA):

$$ \text{Upper Band} = SMA_n + k \cdot \sigma_n $$ $$ \text{Middle Band} = SMA_n $$ $$ \text{Lower Band} = SMA_n - k \cdot \sigma_n $$

Where:

$n$ = 20-period window
$k$ = 2 (standard deviations)
$\sigma_n$ = rolling standard deviation of log(1 + Close)

Applied on log-transformed prices (log1p) to normalize scale across stocks.

4. Average True Range (ATR)

ATR measures market volatility by decomposing the typical daily range:

$$ TR = \max\left(H - L,\ |H - C_{prev}|,\ |L - C_{prev}|\right) $$

$$ ATR_n = \frac{1}{n} \sum_{i=1}^{n} TR_i $$

Period used: 14 days
Normalized per stock: $(ATR - \mu_{ATR}) / \sigma_{ATR}$
Higher ATR → higher volatility / wider price swings

5. MACD (Moving Average Convergence Divergence)

MACD captures trend momentum by comparing two exponential moving averages:

$$ MACD = EMA_{12} - EMA_{26} $$

$$ \text{Signal Line} = EMA_9(\text{MACD}) $$

$$ \text{Histogram} = MACD - \text{Signal Line} $$

Period used: 20-day fast EMA
Normalized per stock: $(MACD - \mu) / \sigma$
Positive MACD → bullish momentum; Negative → bearish

6. Dollar Volume

A liquidity measure representing the monetary value of shares traded:

$$ \text{Dollar Volume} = \frac{Close \times Volume}{10^6} \quad \text{(in millions)} $$

Used to rank stocks by liquidity and filter the top 150 most liquid stocks each month.

Monthly Return Features

Multi-horizon compounded monthly returns are computed to capture momentum across different lookback windows:

$$ r_{\text{lag}} = \left(\frac{P_t}{P_{t - \text{lag}}} \right)^{1/\text{lag}} - 1 $$

Computed for lags: 1, 2, 3, 6, 9, 12 months

Outliers are clipped at the 0.5th and 99.5th percentile before compounding to reduce the effect of extreme price moves.

Fama-French 5-Factor Model

The Fama-French 5-Factor Model extends the classic 3-factor model by adding profitability and investment factors:

$$ R_i - R_f = \alpha + \beta_1 (R_m - R_f) + \beta_2 SMB + \beta_3 HML + \beta_4 RMW + \beta_5 CMA + \epsilon $$

Factor	Description
$R_m - R_f$	Market excess return (Market Risk Premium)
$SMB$	Small Minus Big — size factor
$HML$	High Minus Low — value factor (book-to-market)
$RMW$	Robust Minus Weak — profitability factor
$CMA$	Conservative Minus Aggressive — investment factor

A 5-year rolling OLS regression is run for each stock monthly to estimate time-varying factor exposures (betas), which are used as features for clustering.

Liquidity Filter

Each month, stocks are ranked by their 5-year rolling average dollar volume:

$$ \overline{DV}_{t} = \frac{1}{60} \sum_{i=t-59}^{t} DV_i $$

Only the top 150 stocks by this rank are retained each month, ensuring we focus on the most tradable universe.

Unsupervised Learning: KMeans Clustering

After computing all features and factor loadings, we apply KMeans clustering to group stocks with similar characteristics each month.

Clustering Strategy

Algorithm: KMeans with n_clusters=4
Features: All 18 computed features (technical indicators + Fama-French factor betas + multi-horizon returns)
Initial Centroids: Seeded based on target RSI values [30, 45, 55, 70] to create interpretable clusters:
- Cluster 0: Oversold stocks (RSI ≈ 30)
- Cluster 1: Mildly bearish (RSI ≈ 45)
- Cluster 2: Mildly bullish (RSI ≈ 55)
- Cluster 3: Overbought stocks (RSI ≈ 70)

Initial Centroids Matrix

initial_centroids = np.zeros((4, 18))
initial_centroids[:, 6] = [30, 45, 55, 70]  # RSI is the 7th feature

This ensures consistent cluster interpretation across time periods.

Cluster Interpretation

Each cluster represents distinct market regimes and stock behaviors:

Cluster 0 (RSI ≈ 30): Oversold, potentially undervalued stocks showing recent price weakness
Cluster 1 (RSI ≈ 45): Neutral-to-bearish momentum, below midpoint
Cluster 2 (RSI ≈ 55): Neutral-to-bullish momentum, above midpoint
Cluster 3 (RSI ≈ 70): Overbought, strong recent performance, high momentum

The strategy selects Cluster 3 based on momentum continuation hypothesis: stocks with strong recent performance tend to exhibit short-term persistence before mean reversion.

Portfolio Construction Strategy

The trading strategy focuses on Cluster 3 stocks (high RSI, typically overbought), implementing a contrarian/momentum-hybrid approach:

Monthly Rebalancing Process

Select Cluster 3 stocks from the current month
Shift forward one month to avoid look-ahead bias
Optimization Window: Use 12 months of historical prices
Weight Optimization: Apply Efficient Frontier Max Sharpe Ratio optimization
- Constraints: Each stock weight between [1/(2N), 10%] where N = number of stocks
- Solver: SCS (Splitting Conic Solver)
Fallback: If optimization fails, use equal weights
Calculate Returns: Compute log returns for the month using optimized weights

Optimization Objective

$$ \max \frac{E[R_p] - R_f}{\sigma_p} $$

Where:

$R_p$ = portfolio return
$R_f$ = risk-free rate (assumed 0)
$\sigma_p$ = portfolio volatility

Expected returns and covariance matrix are estimated from the trailing 12-month window (annualized to 252 trading days).

Performance Results

Benchmark Comparison

The strategy is benchmarked against VOO (Vanguard S&P 500 ETF) on a buy-and-hold basis from 2018 to 2025.

Key Metrics

Metric	ML Strategy	VOO Buy & Hold
Cumulative Return	Displayed in plot	Displayed in plot
Rebalancing Frequency	Monthly	None (buy-and-hold)
Universe	Top 150 liquid S&P 500 stocks (cluster 3)	Full S&P 500

Visualization

The notebook generates a cumulative returns comparison plot:

Title: Systematic Trading with Unsupervised Learning: Returns Comparison

Features:

Time series from 2018-01 to 2025-12
Two lines:
- Strategy Return (blue): ML-driven monthly rebalanced portfolio
- VOO Buy&Hold (orange): Passive S&P 500 benchmark
Y-axis formatted as percentage returns
Plot size: 16" × 6"

The plot demonstrates the effectiveness of the unsupervised learning approach in capturing market dynamics through factor-based clustering and optimization.

Files Structure

QUANT/
├── SP500 Unsupervised ML/
│   ├── code.ipynb           # Main analysis notebook
│   └── requirements.txt     # Python dependencies
├── quant-env/               # Python 3.11 virtual environment
├── quant-env-3.13/          # Python 3.13 virtual environment
├── sp500.xlsx               # Generated S&P 500 ticker list
├── README.md                # This file
└── LICENSE

Usage

Install dependencies:

pip install -r "SP500 Unsupervised ML/requirements.txt"

Run the notebook:

jupyter notebook "SP500 Unsupervised ML/code.ipynb"

Execute cells sequentially to:
- Fetch S&P 500 tickers
- Download historical price data
- Compute technical features
- Run Fama-French factor regressions
- Apply KMeans clustering
- Optimize portfolio weights
- Generate performance comparison plot

Notes

pandas_datareader is incompatible with Python 3.13. Fama-French data is fetched directly from Kenneth French's website instead.
pd.read_html() requires wrapping HTML strings in StringIO in pandas 2.0+.
The strategy uses log returns for compounding, with cumulative returns calculated as: $\exp(\sum \log(1 + r_t)) - 1$

Future Enhancements

Add risk metrics (Sharpe ratio, max drawdown, Sortino ratio)
Implement alternative clustering algorithms (DBSCAN, Hierarchical)
Test different cluster selection strategies (rotate between clusters)
Add transaction cost modeling
Backtest on extended time periods (2010-2025)
Incorporate additional alternative data sources

License

See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
SP500 Unsupervised ML		SP500 Unsupervised ML
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

S&P 500 Unsupervised Machine Learning — Stock Clustering & Systematic Trading

Overview

Strategy Performance

Project Pipeline

Data Sources

Requirements

Features Computed

1. Garman-Klass Volatility (GKV)

2. Relative Strength Index (RSI)

3. Bollinger Bands (BBands)

4. Average True Range (ATR)

5. MACD (Moving Average Convergence Divergence)

6. Dollar Volume

Monthly Return Features

Fama-French 5-Factor Model

Liquidity Filter

Unsupervised Learning: KMeans Clustering

Clustering Strategy

Initial Centroids Matrix

Cluster Interpretation

Portfolio Construction Strategy

Monthly Rebalancing Process

Optimization Objective

Performance Results

Benchmark Comparison

Key Metrics

Visualization

Files Structure

Usage

Notes

Future Enhancements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages