Machine learning system for predicting ATP tennis match outcomes using XGBoost, ELO ratings, and advanced feature engineering. Surface-specific models achieve 80.71-84.28% accuracy with 0.918-0.937 AUC-ROC on temporal holdout data (2000-2024).
- π― Overview
- π Performance Metrics
- π Quick Start
- ποΈ Project Structure
- βοΈ CLI Commands
- π¬ Feature Engineering
- π ELO Rating System
- π Dataset Information
- π Advanced Features
- π» GPU Acceleration
- π§ͺ Testing
- βοΈ Configuration
- π Troubleshooting
- π Acknowledgments
- π€ Contributing
- π License & Citation
- π€ Author
This project implements a tennis match prediction system with the following capabilities:
- Machine Learning: XGBoost classifier with Bayesian hyperparameter optimization
- ELO Rating System: Dynamic player ratings (global and surface-specific)
- Feature Engineering: 40+ features including win rates, form, rankings, H2H records, and serve statistics
- Temporal Validation: Time-series cross-validation to prevent data leakage
- Surface-Specific Models: Separate models for Hard, Clay, and Grass courts
- Model Interpretability: SHAP analysis and feature importance visualization
- Symmetric Predictions: Bidirectional predictions for improved robustness
- GPU Acceleration: CUDA support for faster training
- Command-Line Interface: 10 commands for training, prediction, evaluation, and analysis
- Comprehensive Testing: Full test suite with pytest (>80% coverage)
- Tennis match outcome prediction with probability scores
- Player performance analysis and statistics
- Head-to-head matchup analysis
- Feature importance investigation for tennis analytics
- Academic research in sports prediction and machine learning
- Tournament forecasting with batch predictions
Note: The system uses three surface-specific models (Hard, Clay, Grass) trained independently for optimal performance on each surface type.
| Metric | Value | Interpretation |
|---|---|---|
| Accuracy | 84.08% | Excellent for sports prediction |
| AUC-ROC | 0.937 | Excellent discrimination capability |
| Log Loss | 0.497 | Strong probabilistic calibration |
| Brier Score | 0.156 | High probability reliability |
- Average AUC: 0.942 Β± 0.008 (range: 0.931-0.956)
- Average Accuracy: 85.06% Β± 1.15%
- Best Fold: 86.91% accuracy (Fold 6)
- Log Loss: 0.491 Β± 0.010
| Surface | AUC-ROC | Accuracy | Log Loss | Holdout Matches |
|---|---|---|---|---|
| Hard | 0.937 | 84.08% | 0.497 | 8,174 |
| Clay | 0.918 | 80.71% | 0.431 | 4,848 |
| Grass | 0.936 | 84.28% | 0.331 | 1,539 |
- Python: 3.8 or higher
- Operating System: Linux, macOS, Windows (with WSL)
- Hardware: CPU (minimum), NVIDIA GPU with CUDA (optional, for acceleration)
# 1. Clone the repository
git clone https://github.com/ulpati/atp-tennis-predictor.git
cd atp-tennis-predictor
# 2. Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Verify installation
python3 atp.py --help# Download match data for specific years (example: 2020-2024)
mkdir -p data
for year in {2020..2024}; do
wget -P data https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_${year}.csv
doneData Source: Jeff Sackmann's ATP Tennis Repository (CC BY-NC-SA 4.0)
# Train a model (standard)
python3 atp.py train --data-dir data
# Train with optimization and surface-specific ELO
python3 atp.py train --data-dir data --optimize --elo-per-surface
# Predict a match
python3 atp.py predict "Novak Djokovic" "Carlos Alcaraz" Hard
# Evaluate model performance
python3 atp.py evaluate
# List top players
python3 atp.py list --top 20
# View player statistics
python3 atp.py player "Rafael Nadal"
# Head-to-head analysis
python3 atp.py h2h "Roger Federer" "Rafael Nadal"atp-tennis-predictor/
βββ atp.py # Main CLI application
βββ atp_core/ # Core library modules
β βββ __init__.py # Package initialization
β βββ config.py # Configuration constants
β βββ elo.py # ELO rating engine
β βββ features.py # Feature extraction logic
β βββ model.py # Training and prediction
β βββ shap_analysis.py # SHAP interpretability
β βββ utils.py # Utility functions
β βββ exceptions.py # Custom exceptions
βββ data/ # ATP match CSVs (gitignored)
β βββ atp_matches_YYYY.csv # Annual match data
βββ models/ # Trained models (gitignored)
β βββ artifact_xgb_*.joblib # XGBoost models
β βββ players_stats_*.joblib # Player statistics
β βββ elo_dict_*.json # ELO ratings
βββ tests/ # Test suite
β βββ test_elo_engine.py # ELO system tests
β βββ test_prepare_feature.py # Feature extraction tests
β βββ test_model.py # Model training tests
β βββ test_utils.py # Utility function tests
βββ requirements.txt # Python dependencies
βββ LICENSE # CC BY-NC-SA 4.0 license
βββ CITATION.md # Citation guidelines
βββ README.md # This file
The system provides 10 commands accessible via python3 atp.py <command>:
Train an XGBoost model with optional hyperparameter optimization.
python3 atp.py train --data-dir data [OPTIONS]Options:
| Option | Description | Default |
|---|---|---|
--data-dir PATH |
Directory containing ATP CSV files | data/ |
--optimize |
Enable Bayesian hyperparameter optimization | False |
--use-gpu |
Use NVIDIA GPU for acceleration | False |
--elo-per-surface |
Train separate ELO ratings per surface | False |
--elo-k FLOAT |
ELO K-factor (rating volatility) | 20.0 |
--recent-n INT |
Number of recent matches for form calculation | 30 |
--surface-specific |
Train separate models per surface (Hard/Clay/Grass) | False |
--time-decay FLOAT |
Exponential time decay factor (0=disabled, 0.95=recommended) | 0.0 |
--stacking |
Train stacking ensemble (XGBoost + LogisticRegression) | False |
Examples:
# Basic training
python3 atp.py train --data-dir data
# Recommended: surface-specific ELO
python3 atp.py train --data-dir data --elo-per-surface
# Advanced: optimization + surface-specific + time decay
python3 atp.py train --data-dir data --optimize --elo-per-surface --time-decay 0.95
# Full training with GPU
python3 atp.py train --data-dir data --optimize --surface-specific --use-gpu --time-decay 0.95Predict the outcome of a match between two players on a specific surface.
python3 atp.py predict "Player A" "Player B" Surface [OPTIONS]Arguments:
Player A: First player name (fuzzy matching supported)Player B: Second player name (fuzzy matching supported)Surface: Court surface (Hard,Clay, orGrass)
Options:
--artifact PATH: Path to trained model (auto-detects latest if not specified)--players PATH: Path to player stats file (auto-detects if not specified)--json: Output results in JSON format
Example:
python3 atp.py predict "Novak Djokovic" "Carlos Alcaraz" HardOutput:
======================================================================
MATCH PREDICTION: Novak Djokovic vs Carlos Alcaraz
Surface: Hard | Confidence: β LOW
======================================================================
β Predicted Winner: Carlos Alcaraz
β’ Novak Djokovic win probability: 47.80%
β’ Carlos Alcaraz win probability: 52.20%
Raw probabilities (before averaging):
- Novak Djokovic -> Carlos Alcaraz: 56.03%
- Carlos Alcaraz -> Novak Djokovic: 60.44%
======================================================================
Evaluate model performance on holdout data with cross-validation metrics.
python3 atp.py evaluate [--artifact PATH] [--json]Output includes:
- Holdout metrics (Accuracy, AUC-ROC, Log Loss, Brier Score)
- 10-fold temporal cross-validation results
- Performance interpretation
Display top players ranked by total match wins.
python3 atp.py list [--top N] [--players PATH]Example:
python3 atp.py list --top 10Output:
Rank Player Record Win% Surfaces
----------------------------------------------------------------------
1 Roger Federer 1250W-260L 82.8% Carpet:44, Clay:227, Grass:194, Hard:785
2 Novak Djokovic 1139W-224L 83.6% Carpet:9, Clay:291, Grass:121, Hard:718
3 Rafael Nadal 1091W-235L 82.3% Carpet:2, Clay:488, Grass:76, Hard:525
4 Andy Murray 748W-269L 73.5% Carpet:8, Clay:110, Grass:120, Hard:510
5 David Ferrer 740W-379L 66.1% Carpet:9, Clay:337, Grass:44, Hard:350
View detailed statistics for a specific player.
python3 atp.py player "Player Name" [--players PATH]Example:
python3 atp.py player "Rafael Nadal"Output:
PLAYER PROFILE: Rafael Nadal
Overall Record: 1091W-235L (82.3%)
Surface Breakdown:
β’ Carpet : 2W-6 L ( 25.0%)
β’ Clay : 488W-53 L ( 90.2%)
β’ Grass : 76W-21 L ( 78.4%)
β’ Hard : 525W-155L ( 77.2%)
Recent Form (last 10): W L L W W W W L W L
Analyze head-to-head record between two players.
python3 atp.py h2h "Player A" "Player B" [--players PATH]Example:
python3 atp.py h2h "Roger Federer" "Rafael Nadal"Output:
HEAD-TO-HEAD: Roger Federer vs Rafael Nadal
Overall Record:
Roger Federer: 17W
Rafael Nadal: 24W
Total: 41 matches
Win Percentage:
Roger Federer: 41.5%
Rafael Nadal: 58.5%
Surface Breakdown:
Hard : 12W-9L (57.1% for Roger Federer)
Clay : 2W-14L (12.5% for Roger Federer)
Grass : 3W-1L (75.0% for Roger Federer)
Predict multiple matches from a CSV file.
python3 atp.py batch matches.csv [OPTIONS]CSV Format:
player1,player2,surface
Novak Djokovic,Rafael Nadal,Clay
Roger Federer,Andy Murray,Grass
Carlos Alcaraz,Jannik Sinner,HardDisplay XGBoost feature importance rankings.
python3 atp.py importance [--top N] [--artifact PATH]Example:
python3 atp.py importance --top 10Output:
TOP 10 MOST IMPORTANT FEATURES
Rank Feature Importance
----------------------------------------------------------------------
1 points_diff 0.225201 ββββββββββββββββββββββ
2 points_ratio 0.216739 βββββββββββββββββββββ
3 elo_exp_p1 0.126393 ββββββββββββ
4 elo_diff 0.116776 βββββββββββ
5 rank_ratio 0.095208 βββββββββ
6 rank_diff 0.016479 β
7 consistency_diff 0.016412 β
8 recent_diff 0.013811 β
9 elo_p2_before 0.009583
10 winrate_diff 0.008949
Compute SHAP values for model interpretability.
python3 atp.py shap [--top N] [--max-samples N] [--artifact PATH] [--json]Purpose: Analyze feature contributions using SHAP (SHapley Additive exPlanations) values for more accurate attribution.
Automatically select important features using SHAP values.
python3 atp.py select-features [--threshold FLOAT] [--max-features N] [--artifact PATH]Purpose: Identify and remove low-impact features to reduce overfitting and improve generalization.
The model uses 40+ engineered features across 8 categories:
| Category | Features | Description |
|---|---|---|
| ELO Ratings | 4 | Player ELO scores, differences, and expected probabilities |
| Win Rates | 6 | Global and surface-specific win percentages |
| ATP Rankings | 8 | Current rankings, points, differences, and ratios |
| Recent Form | 3 | Performance in last N matches (with optional time decay) |
| Momentum | 4 | Performance trends and consistency (standard deviation) |
| Serve Statistics | 6 | Aces, double faults, break points saved |
| Head-to-Head | 2 | Historical matchup records |
| Surface | 3 | One-hot encoded (Hard, Clay, Grass) |
Key Features:
elo_diff: Difference in ELO ratings between playerspoints_diff: Difference in ATP ranking pointsrank_ratio: Ratio of ATP rankings (handles unranked players)recent_diff: Recent form difference (last 30 matches)surf_winrate_diff: Surface-specific win rate differenceh2h_p1_over_p2_prior: Head-to-head win rate
Important: All features are computed incrementally during temporal validation to prevent data leakage.
The ELO rating system calculates relative skill levels of players. Each player starts with a base rating of 1500, and ratings update after each match based on expected vs. actual outcomes.
expected_prob = 1 / (1 + 10^((elo_B - elo_A) / 400))
new_elo = old_elo + K * (actual_result - expected_prob)- Base Rating: 1500 (for new players)
- K-Factor: 20.0 (default, controls rating volatility)
- Surface-Specific: Optional (separate ELO for Hard/Clay/Grass)
The K-factor controls how much ratings change after each match:
- K=10 (Conservative): More stable, slower adaptation
- K=20 (Balanced): Standard for most sports - recommended
- K=32 (Aggressive): Faster adaptation to form changes
Usage:
# Standard K-factor
python3 atp.py train --data-dir data --elo-k 20.0
# Conservative ratings
python3 atp.py train --data-dir data --elo-k 10.0
# Aggressive ratings
python3 atp.py train --data-dir data --elo-k 32.0# Before match
Djokovic: 2100 ELO (Hard)
Alcaraz: 1950 ELO (Hard)
# Expected probability
P(Djokovic wins) = 1 / (1 + 10^((1950-2100)/400)) = 0.679
# If Djokovic wins:
new_elo_djokovic = 2100 + 20 * (1 - 0.679) = 2106.4
new_elo_alcaraz = 1950 + 20 * (0 - 0.321) = 1943.6ATP match data is obtained from Jeff Sackmann's public repository:
π github.com/JeffSackmann/tennis_atp
- ATP matches from 1968 to present (tour-level, challenger, futures)
- Historical rankings from 1973
- Detailed match stats (from 1991 for tour-level, 2008 for challenger)
- Player biographical data (name, birth date, country, height, handedness)
Each atp_matches_YYYY.csv file contains:
tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date,
winner_id, winner_name, winner_rank, winner_rank_points, winner_age, winner_ht,
loser_id, loser_name, loser_rank, loser_rank_points, loser_age, loser_ht,
score, best_of, round, minutes,
w_ace, w_df, w_svpt, w_1stIn, w_1stWon, w_2ndWon, w_SvGms, w_bpSaved, w_bpFaced,
l_ace, l_df, l_svpt, l_1stIn, l_1stWon, l_2ndWon, l_SvGms, l_bpSaved, l_bpFaced
See matches_data_dictionary.txt in the original repository for complete field descriptions.
# Method 1: Direct download (recommended for specific years)
mkdir -p data
for year in {2000..2024}; do
wget -P data https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_${year}.csv
done
# Method 2: Clone entire repository (all data)
git clone https://github.com/JeffSackmann/tennis_atp.git temp_data
mv temp_data/atp_matches_*.csv data/
rm -rf temp_dataATP match data is Β© Jeff Sackmann / Tennis Abstract and is licensed under CC BY-NC-SA 4.0.
Terms:
- β Attribution required: Must cite Jeff Sackmann / Tennis Abstract
β οΈ Non-commercial use only: Cannot use data for commercial purposes- β ShareAlike: Modifications must be shared under the same license
Time decay gives more weight to recent matches when calculating player statistics.
Formula: weight = decay_factor ^ days_ago
Decay Factor Examples:
0.90: Aggressive - strong focus on very recent form (half-life: 6.6 days)0.95: Recommended - balanced recency bias (half-life: 13.5 days)0.98: Conservative - slower adaptation (half-life: 34.3 days)
Usage:
python3 atp.py train --data-dir data --time-decay 0.95Impact: Recent matches weighted more heavily in features like recent_form, winrate, and consistency.
Train separate XGBoost models for each court surface (Hard, Clay, Grass).
Benefits:
- Captures surface-specific patterns
- Better accuracy for surface specialists
- Improved predictions for surface-specific tournaments
Usage:
python3 atp.py train --data-dir data --surface-specific --optimizeMeta-learning approach combining multiple XGBoost models with different hyperparameters.
Architecture:
- Base Model 1: Conservative (depth=5, lr=0.02, high regularization)
- Base Model 2: Balanced (depth=7, lr=0.03, moderate)
- Base Model 3: Aggressive (depth=9, lr=0.01, low regularization)
- Meta-Learner: LogisticRegression with class_weight='balanced'
Usage:
python3 atp.py train --stacking --optimize --use-gpuExpected Benefits:
- +1-2% accuracy improvement
- Better calibration and confidence estimates
- Reduced variance through model averaging
Automatic feature pruning using SHAP (SHapley Additive exPlanations) values.
Algorithm:
- Train model on full feature set
- Compute SHAP values for sample of test data
- Calculate mean absolute SHAP value per feature
- Remove features below threshold (default: 0.01)
Usage:
python3 atp.py select-features --threshold 0.01 --max-samples 1000Benefits:
- Reduces overfitting by removing noise features
- Faster inference with fewer features
- Improved model interpretability
This project supports NVIDIA GPU acceleration for faster training using CUDA.
- NVIDIA GPU with CUDA support (compute capability 3.5+)
- NVIDIA Driver installed (verify with
nvidia-smi) - XGBoost 3.x with CUDA support (included in requirements.txt)
# Check NVIDIA GPU
nvidia-smi
# Test XGBoost GPU support
python3 -c "import xgboost as xgb; print(f'CUDA: {xgb.build_info()[\"USE_CUDA\"]}')"# Enable GPU with --use-gpu flag
python3 atp.py train --data-dir data --use-gpu --optimizeNote: GPU acceleration is optional. The system works perfectly with CPU-only training.
# Activate virtual environment
source venv/bin/activate
# Run all tests
PYTHONPATH=$PWD:$PYTHONPATH pytest tests/ -v
# Test with coverage
PYTHONPATH=$PWD:$PYTHONPATH pytest tests/ --cov=atp_core --cov-report=html
# Run specific test files
PYTHONPATH=$PWD:$PYTHONPATH pytest tests/test_elo_engine.py -v
PYTHONPATH=$PWD:$PYTHONPATH pytest tests/test_prepare_feature.py -v
PYTHONPATH=$PWD:$PYTHONPATH pytest tests/test_model.py -v
PYTHONPATH=$PWD:$PYTHONPATH pytest tests/test_utils.py -v- β ELO rating system tests
- β Feature extraction and time decay tests
- β Model training and SHAP analysis tests
- β Utility function tests
- β Expected: >80% coverage on core modules
Note:
PYTHONPATH=$PWD:$PYTHONPATHis required to allow tests to importatp_coreandatpmodules.
# Paths
export DATA_DIR="data"
export MODEL_DIR="models"
export LOG_LEVEL="INFO"
# ELO
export ELO_K_FACTOR="20.0"
export ELO_PER_SURFACE="true"
# XGBoost
export XGBOOST_LEARNING_RATE="0.02"
export XGBOOST_MAX_DEPTH="7"
export USE_GPU="false"
# Features
export RECENT_N_MATCHES="30"Modify default parameters in atp_core/config.py:
# Default configuration constants
DEFAULT_ELO_K = 20.0
DEFAULT_BASE_ELO = 1500
DEFAULT_RECENT_N = 30
DEFAULT_TIME_DECAY = 0.0# Verify data directory
ls data/atp_matches_*.csv
# Download data
mkdir -p data
wget -P data https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2024.csv# Train model first
python3 atp.py train --data-dir data# Use fuzzy matching (automatic)
python3 atp.py predict "Fedrer" "Nadal" Hard # Finds "Federer"
# List available players
python3 atp.py list --top 100# Verify NVIDIA driver
nvidia-smi
# Check CUDA in XGBoost
python3 -c "import xgboost as xgb; print(xgb.build_info())"
# Reinstall XGBoost with CUDA support
pip uninstall xgboost
pip install xgboost --no-cache-dir# Ensure virtual environment is activated
source venv/bin/activate
# Reinstall dependencies
pip install -r requirements.txt
# For tests, use PYTHONPATH
PYTHONPATH=$PWD:$PYTHONPATH pytest tests/ -v- Sackmann, Jeff. ATP Tennis Rankings, Results, and Stats. GitHub repository. github.com/JeffSackmann/tennis_atp. Licensed under CC BY-NC-SA 4.0.
- Tennis Abstract by Jeff Sackmann
- ATP Official Website
- XGBoost Documentation
- Scikit-learn Documentation
- SHAP (SHapley Additive exPlanations)
- ELO Rating System
- Temporal Cross-Validation
This project would not be possible without:
-
Jeff Sackmann (Tennis Abstract) β For maintaining the comprehensive ATP tennis dataset for over a decade. His dedication to open tennis data has enabled countless research projects and applications. Repository: JeffSackmann/tennis_atp
-
XGBoost Development Team β For creating and maintaining an exceptional gradient boosting library that powers the prediction engine
-
Scikit-learn Contributors β For the robust machine learning framework and tools used throughout this project
-
SHAP Development Team β For the interpretability tools that make model decisions transparent
-
Tennis Analytics Community β Including researchers and practitioners on Kaggle and elsewhere who have shared insights and best practices in sports prediction
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) - see the LICENSE file for details.
If you use this project in research or academic work, please cite both the software and the ATP dataset.
See CITATION.md for detailed citation formats (BibTeX, APA, IEEE, Chicago).
Quick Citation:
ulpati. (2025). ATP Tennis Match Predictor: Machine learning system for match
prediction (Version 1.0) [Computer software]. GitHub.
https://github.com/ulpati/atp-tennis-predictor
Sackmann, J. (2025). ATP tennis rankings, results, and stats [Data set].
Tennis Abstract. https://github.com/JeffSackmann/tennis_atp
π€ Author: ulpati
β If you find this project useful, please consider starring the repository!
π¬ Questions? Open an issue at github.com/ulpati/atp-tennis-predictor/issues