This project implements a systematic trading strategy using unsupervised machine learning on the S&P 500 universe. By applying KMeans clustering to technical indicators and Fama-French factor exposures, we identify stocks with similar risk/return characteristics and construct optimized monthly rebalanced portfolios.
Key Results: The ML-driven strategy is backtested from 2018-2025 and benchmarked against VOO (Vanguard S&P 500 ETF) buy-and-hold, demonstrating the effectiveness of factor-based clustering combined with mean-variance optimization.
The ML-driven strategy (red) vs VOO buy-and-hold benchmarks (blue) from 2018-2025. The systematic approach shows distinct performance patterns during market regime changes, with notable divergence during the 2020 COVID crash recovery and subsequent market cycles.
S&P 500 Tickers (Wikipedia)
↓
Download 10-Year OHLCV Data (yfinance)
↓
Compute Technical Features (GKV, RSI, BBands, ATR, MACD, Dollar Volume)
↓
Resample to Monthly & Filter Top 150 Liquid Stocks
↓
Compute Multi-Horizon Returns (1m, 2m, 3m, 6m, 9m, 12m)
↓
Download Fama-French 5-Factor Data
↓
Rolling OLS Regression (5-year window) → Factor Betas
↓
KMeans Clustering (4 clusters, RSI-seeded centroids)
↓
Select Cluster 3 Stocks (Overbought/High RSI)
↓
Efficient Frontier Optimization (Max Sharpe, 12-month window)
↓
Monthly Rebalancing with Optimized Weights
↓
Performance Evaluation vs VOO Benchmark
↓
Cumulative Returns Visualization
| Source | Data |
|---|---|
| Wikipedia | S&P 500 constituent ticker list |
| Yahoo Finance (yfinance) | 10 years of daily OHLCV price data |
| Kenneth French's Data Library | Fama-French 5-Factor monthly returns |
Install all dependencies via:
pip install -r "SP500 Unsupervised ML/requirements.txt"Key libraries:
yfinance— market data retrievalpandas,numpy— data manipulation and numerical operationsstatsmodels— rolling OLS regressions for factor exposure estimationpandas_ta— technical indicators (RSI, BBands, ATR, MACD)scikit-learn— unsupervised ML (KMeans clustering)pypfopt— portfolio optimization (Efficient Frontier, Max Sharpe)cvxpy,clarabel,scs— convex optimization solversmatplotlib— visualizationrequests,beautifulsoup4— web scraping for S&P 500 tickers
A more efficient estimator of daily volatility than close-to-close volatility, using all four price points (Open, High, Low, Close):
Where:
-
$H$ = High price -
$L$ = Low price -
$C$ = Close price -
$O$ = Open price
It captures intraday price range and overnight gaps, making it superior to simple return-based volatility.
RSI measures the speed and magnitude of recent price changes to evaluate overbought or oversold conditions. It ranges from 0 to 100.
Where:
- Period used: 20 days
- RSI > 70 → typically overbought
- RSI < 30 → typically oversold
Bollinger Bands define a volatility envelope around a simple moving average (SMA):
Where:
-
$n$ = 20-period window -
$k$ = 2 (standard deviations) -
$\sigma_n$ = rolling standard deviation of log(1 + Close)
Applied on log-transformed prices (log1p) to normalize scale across stocks.
ATR measures market volatility by decomposing the typical daily range:
- Period used: 14 days
- Normalized per stock:
$(ATR - \mu_{ATR}) / \sigma_{ATR}$ - Higher ATR → higher volatility / wider price swings
MACD captures trend momentum by comparing two exponential moving averages:
- Period used: 20-day fast EMA
- Normalized per stock:
$(MACD - \mu) / \sigma$ - Positive MACD → bullish momentum; Negative → bearish
A liquidity measure representing the monetary value of shares traded:
Used to rank stocks by liquidity and filter the top 150 most liquid stocks each month.
Multi-horizon compounded monthly returns are computed to capture momentum across different lookback windows:
Computed for lags: 1, 2, 3, 6, 9, 12 months
Outliers are clipped at the 0.5th and 99.5th percentile before compounding to reduce the effect of extreme price moves.
The Fama-French 5-Factor Model extends the classic 3-factor model by adding profitability and investment factors:
| Factor | Description |
|---|---|
| Market excess return (Market Risk Premium) | |
| Small Minus Big — size factor | |
| High Minus Low — value factor (book-to-market) | |
| Robust Minus Weak — profitability factor | |
| Conservative Minus Aggressive — investment factor |
A 5-year rolling OLS regression is run for each stock monthly to estimate time-varying factor exposures (betas), which are used as features for clustering.
Each month, stocks are ranked by their 5-year rolling average dollar volume:
Only the top 150 stocks by this rank are retained each month, ensuring we focus on the most tradable universe.
After computing all features and factor loadings, we apply KMeans clustering to group stocks with similar characteristics each month.
- Algorithm: KMeans with
n_clusters=4 - Features: All 18 computed features (technical indicators + Fama-French factor betas + multi-horizon returns)
- Initial Centroids: Seeded based on target RSI values
[30, 45, 55, 70]to create interpretable clusters:- Cluster 0: Oversold stocks (RSI ≈ 30)
- Cluster 1: Mildly bearish (RSI ≈ 45)
- Cluster 2: Mildly bullish (RSI ≈ 55)
- Cluster 3: Overbought stocks (RSI ≈ 70)
initial_centroids = np.zeros((4, 18))
initial_centroids[:, 6] = [30, 45, 55, 70] # RSI is the 7th featureThis ensures consistent cluster interpretation across time periods.
Each cluster represents distinct market regimes and stock behaviors:
- Cluster 0 (RSI ≈ 30): Oversold, potentially undervalued stocks showing recent price weakness
- Cluster 1 (RSI ≈ 45): Neutral-to-bearish momentum, below midpoint
- Cluster 2 (RSI ≈ 55): Neutral-to-bullish momentum, above midpoint
- Cluster 3 (RSI ≈ 70): Overbought, strong recent performance, high momentum
The strategy selects Cluster 3 based on momentum continuation hypothesis: stocks with strong recent performance tend to exhibit short-term persistence before mean reversion.
The trading strategy focuses on Cluster 3 stocks (high RSI, typically overbought), implementing a contrarian/momentum-hybrid approach:
- Select Cluster 3 stocks from the current month
- Shift forward one month to avoid look-ahead bias
- Optimization Window: Use 12 months of historical prices
- Weight Optimization: Apply Efficient Frontier Max Sharpe Ratio optimization
- Constraints: Each stock weight between
[1/(2N), 10%]where N = number of stocks - Solver: SCS (Splitting Conic Solver)
- Constraints: Each stock weight between
- Fallback: If optimization fails, use equal weights
- Calculate Returns: Compute log returns for the month using optimized weights
Where:
-
$R_p$ = portfolio return -
$R_f$ = risk-free rate (assumed 0) -
$\sigma_p$ = portfolio volatility
Expected returns and covariance matrix are estimated from the trailing 12-month window (annualized to 252 trading days).
The strategy is benchmarked against VOO (Vanguard S&P 500 ETF) on a buy-and-hold basis from 2018 to 2025.
| Metric | ML Strategy | VOO Buy & Hold |
|---|---|---|
| Cumulative Return | Displayed in plot | Displayed in plot |
| Rebalancing Frequency | Monthly | None (buy-and-hold) |
| Universe | Top 150 liquid S&P 500 stocks (cluster 3) | Full S&P 500 |
The notebook generates a cumulative returns comparison plot:
Title: Systematic Trading with Unsupervised Learning: Returns Comparison
Features:
- Time series from 2018-01 to 2025-12
- Two lines:
- Strategy Return (blue): ML-driven monthly rebalanced portfolio
- VOO Buy&Hold (orange): Passive S&P 500 benchmark
- Y-axis formatted as percentage returns
- Plot size: 16" × 6"
The plot demonstrates the effectiveness of the unsupervised learning approach in capturing market dynamics through factor-based clustering and optimization.
QUANT/
├── SP500 Unsupervised ML/
│ ├── code.ipynb # Main analysis notebook
│ └── requirements.txt # Python dependencies
├── quant-env/ # Python 3.11 virtual environment
├── quant-env-3.13/ # Python 3.13 virtual environment
├── sp500.xlsx # Generated S&P 500 ticker list
├── README.md # This file
└── LICENSE
-
Install dependencies:
pip install -r "SP500 Unsupervised ML/requirements.txt" -
Run the notebook:
jupyter notebook "SP500 Unsupervised ML/code.ipynb" -
Execute cells sequentially to:
- Fetch S&P 500 tickers
- Download historical price data
- Compute technical features
- Run Fama-French factor regressions
- Apply KMeans clustering
- Optimize portfolio weights
- Generate performance comparison plot
-
pandas_datareaderis incompatible with Python 3.13. Fama-French data is fetched directly from Kenneth French's website instead. -
pd.read_html()requires wrapping HTML strings inStringIOin pandas 2.0+. - The strategy uses log returns for compounding, with cumulative returns calculated as:
$\exp(\sum \log(1 + r_t)) - 1$
- Add risk metrics (Sharpe ratio, max drawdown, Sortino ratio)
- Implement alternative clustering algorithms (DBSCAN, Hierarchical)
- Test different cluster selection strategies (rotate between clusters)
- Add transaction cost modeling
- Backtest on extended time periods (2010-2025)
- Incorporate additional alternative data sources
See LICENSE file for details.
