End-to-end ML system to predict, explain, and act on customer churn using Random Forest, SHAP Explainable AI, CLV analysis, and a Streamlit dashboard.
The ChurnIQ interface — real-time churn prediction, SHAP explanations, CLV analysis, and What-If simulations in one place.
| Feature | Description |
|---|---|
| 🤖 Multi-Model Training | Random Forest (Optuna-tuned), Gradient Boosting, and Logistic Regression trained and compared automatically |
| 🧠 Explainable AI (XAI) | SHAP bar charts, feature impact tables, and per-prediction explanations — not just what, but why |
| 💰 CLV & Revenue Risk | Estimates Customer Lifetime Value, revenue at risk, and retention ROI per customer |
| 🔄 What-If Simulator | Re-scores 7 intervention scenarios instantly (e.g. "switch to 2-year contract") |
| 📊 Batch Prediction | Upload a CSV to score hundreds of customers at once with downloadable results |
| ⚖️ Threshold Optimisation | F1-optimal classification threshold via sweep — prioritises recall on the minority churn class |
| 🔁 SMOTE Balancing | Synthetic oversampling applied only to training data to handle the 73/27 class imbalance |
ChurnIQ uses SHAP (SHapley Additive exPlanations) to make every prediction fully interpretable. This is critical for business trust — a churn score alone isn't actionable; knowing why a customer is at risk is.
SHAP assigns each feature a contribution value (positive = pushes toward churn, negative = pushes toward retention) based on Shapley values from cooperative game theory.
| Model | SHAP Explainer Used | Why |
|---|---|---|
| Logistic Regression | LinearExplainer |
Exact, fast — leverages linear structure |
| Random Forest | TreeExplainer |
Exact, leverages tree paths directly |
| Gradient Boosting | TreeExplainer |
Same — compatible with boosted trees |
| Any other model | KernelExplainer |
Model-agnostic fallback (slower) |
The app auto-detects the winning model type and selects the correct explainer — no manual config needed.
- SHAP Bar Chart — Top 12 features ranked by absolute impact, coloured red (increases churn) or blue (reduces churn)
- Feature Impact Table — Full ranked table with SHAP values and direction arrows (↑ / ↓) for every encoded feature
- Retention Strategies — Generated from the top-5 SHAP features, so recommendations are grounded in the actual drivers of this customer's risk
tenure ████████████░░░░ −0.18 ↓ Reduces Churn (long-tenured = loyal)
Contract_Month ░░░░░████████████ +0.21 ↑ Increases Churn (no commitment)
TechSupport_No ░░░░░░░████████░░ +0.14 ↑ Increases Churn (unresolved issues)
MonthlyCharges ░░░░░░░░░████████ +0.11 ↑ Increases Churn (high cost sensitivity)
Background Note:
LinearExplaineruses a zero-vector background distribution to compute SHAP values for single-row inference — this gives each feature a proper baseline to compare against and avoids the zero-variance collapse that occurs when passing the input row as its own background.
telco-churn-prediction/
├── data/
│ ├── WA_Fn-UseC_-Telco-Customer-Churn.csv # dataset
├── demo/
│ ├── churniq_demo.webm # demo video
│ └── dash.jpeg # Dashboard screenshot
├── evaluations/ # Model evaluation plots (ROC, PR, F1, confusion matrix, distribution)
├── models/ # Auto-generated after training
│ ├── best_model.pkl
│ ├── preprocessor.pkl
│ ├── feature_names.pkl
│ ├── column_info.pkl
│ ├── model_comparison.pkl
│ ├── raw_columns.pkl
│ └── meta.pkl
├── src/
│ ├── __init__.py # Enables module imports across the project
│ ├── preprocess.py # Cleaning, OHE, scaling
│ ├── train.py # SMOTE + Optuna + model comparison
│ ├── clv.py # Customer Lifetime Value logic
│ ├── evaluates.py # Model evaluation script
│ └── retention.py # Retention strategy recommender
├── .gitignore # to ignore some files
├── Dockerfile # not implemented
├── app.py # Streamlit dashboard (4 tabs)
├── info.py # requirements check script
├── requirements.txt # requirements for the project
└── README.md # about it
| Property | Value |
|---|---|
| Source | IBM / Kaggle |
| Rows | 7,043 customers |
| Features | 19 input columns + 1 target (Churn) |
| Class Split | ~73.5% No Churn / ~26.5% Churn |
| Known Issue | TotalCharges stored as string — 11 rows contain whitespace instead of a number |
Feature Groups:
- Demographics — gender, SeniorCitizen, Partner, Dependents
- Services — PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
- Billing — Contract, PaperlessBilling, PaymentMethod
- Numeric — tenure, MonthlyCharges, TotalCharges
- Load raw CSV with
pandas.read_csv() - Coerce
TotalChargesto numeric — 11 whitespace rows becomeNaNand are dropped - Drop
customerID(no predictive value) and mapChurn→0/1 - Clip outliers on numeric columns using IQR method (Q1 − 1.5×IQR to Q3 + 1.5×IQR)
- Encode categoricals with
OneHotEncoder(notLabelEncoder— avoids false ordinal ranking) - Scale numerics with
StandardScalerviaColumnTransformer - Passthrough
SeniorCitizenas-is (already 0/1)
The fitted ColumnTransformer is saved to models/preprocessor.pkl and reused at inference — ensuring identical encoding every time.
SMOTE (Synthetic Minority Oversampling) is applied only to the training set after the 80/20 stratified split, balancing the churn class from ~26% to 50% without touching test data.
Optuna runs 30 trials of Bayesian optimisation, each scored by 5-fold stratified CV (ROC-AUC). Tuned parameters: n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features.
Three models are trained and compared on the same SMOTE-balanced data:
| Model | Strength |
|---|---|
| Random Forest (Optuna-tuned) | Robust, handles mixed types, SHAP-compatible |
| Gradient Boosting | Often highest accuracy |
| Logistic Regression | Fast, linear baseline |
The best model by Test AUC is auto-selected and saved.
Default 0.5 threshold is swept from 0.2→0.8 to find the value that maximises F1 score — prioritising recall on the minority churn class over raw accuracy.
Run with: streamlit run app.py → opens at http://localhost:8501
| Tab | What It Shows |
|---|---|
| 🔮 Prediction & SHAP | Churn verdict, confidence, risk badge, SHAP bar chart explaining why, feature impact table, retention strategy cards |
| 💰 CLV & Revenue Risk | Estimated customer lifetime value, revenue at risk, retention ROI, customer tier (Bronze→Platinum) |
| 🔄 What-If Simulator | Re-scores 7 scenarios (e.g. "switch to 2-year contract") to show churn probability change |
| 📊 Model Comparison & Batch | AUC/F1 comparison chart + CSV upload for bulk scoring with download |
git clone https://github.com/nabakrishna/telco-churn-prediction.git
cd telco-churn-predictionpython -m venv venv
# Activate on macOS/Linux
source venv/bin/activate
# Activate on Windows
venv\Scripts\activatepip install -r requirements.txtDownload the CSV from Kaggle — Telco Customer Churn and place it at:
data/WA_Fn-UseC_-Telco-Customer-Churn.csv
Run the training script from the project root:
python src/train.pyTraining takes approximately 3–5 minutes. The console will print:
- SMOTE class counts before and after balancing
- Optuna trial progress (30 trials)
- Model comparison table (AUC + F1 for all 3 models)
- Optimal classification threshold
- Final classification report (precision, recall, F1 per class)
After training, the models/ folder will be populated with 7 artifact files ready for the app.
After training, launch the app from the project root:
streamlit run app.pyThe app will open in your browser at http://localhost:8501.
If port 8501 is busy, use:
streamlit run app.py --server.port 8502
Fill in the customer's attributes across four sections:
| Section | Fields |
|---|---|
| Demographics | Gender, Senior Citizen, Partner, Dependents |
| Services | Phone, Internet type, Security, Backup, Streaming, etc. |
| Billing | Contract type, Payment method, Paperless billing |
| Charges | Tenure (months), Monthly & Total charges via sliders |
Hit the button at the bottom of the sidebar to run inference.
| Tab | What You See |
|---|---|
| 🔮 Prediction & SHAP | Churn verdict, confidence score, risk badge, SHAP bar chart explaining why, feature impact table, retention strategy cards with priority levels |
| 💰 CLV & Revenue Risk | Customer lifetime value, revenue at risk if they churn, retention ROI, customer tier (Bronze → Platinum) |
| 🔄 What-If Simulator | 7 pre-scored scenarios (e.g. "switch to 2-year contract") ranked by resulting churn probability |
| 📊 Model Comparison & Batch | AUC/F1 chart for all 3 models + CSV upload to score hundreds of customers at once |
| Field | Value |
|---|---|
| Contract | Month-to-month |
| Tenure | 2 months |
| Internet Service | Fiber optic |
| Tech Support | No |
| Payment Method | Electronic check |
| Monthly Charges | $95 |
Expected:
| Field | Value |
|---|---|
| Contract | Two year |
| Tenure | 58 months |
| Internet Service | DSL |
| Tech Support | Yes |
| Payment Method | Bank transfer (automatic) |
| Monthly Charges | $48 |
Expected: ✅ LIKELY TO STAY (~85–92% confidence)
| Error | Fix |
|---|---|
FileNotFoundError on dataset |
Check the CSV filename matches exactly: WA_Fn-UseC_-Telco-Customer-Churn.csv |
Model artifacts not found in app |
Run python src/train.py before launching the app |
ModuleNotFoundError |
Make sure your virtual environment is activated |
| Port already in use | Run streamlit run app.py --server.port 8502 |
| SHAP chart showing all zeros | Ensure LinearExplainer uses a zero-vector background, not the input row itself |
| SHAP not showing | Install with pip install shap then restart the app |
| Package | Purpose |
|---|---|
pandas |
Data loading and manipulation |
scikit-learn |
Preprocessing, models, evaluation |
imbalanced-learn |
SMOTE oversampling |
optuna |
Bayesian hyperparameter tuning |
shap |
Explainable AI — LinearExplainer, TreeExplainer, KernelExplainer |
streamlit |
Interactive web dashboard |
matplotlib |
Charts and visualisations |
numpy |
Numerical operations |
scipy |
Sparse matrix handling for SHAP compatibility |
