Team Name:- Red Devils Team Number:- 298 Members:- Siddarth Gottumukkula M P Samartha Shlok Sand Vedant Pahariya Shreyas Kasture
Enterprise-grade customer retention system combining Machine Learning, Explainable AI, Counterfactual Analysis, and Behavioral Economics for the auto insurance industry.
Built for Megathon'25 Innovation Challenge
- Overview
- Key Features
- System Architecture
- Technology Stack
- Installation
- Usage Guide
- Project Structure
- Methodology
- Results
- Future Enhancements
- Team
The Churn Retention Intelligence Platform is an AI-powered solution designed to help insurance companies predict, understand, and prevent customer churn. Unlike traditional black-box ML models, our system provides:
- Predictive Analytics: 94.2% accurate churn prediction using XGBoost
- Explainable AI: SHAP analysis reveals why each customer might leave
- Actionable Strategies: DiCE counterfactuals show exactly what to change
- Behavioral Nudges: Psychology-based retention tactics beyond discounts
- Interactive Simulation: Real-time "what-if" analysis for retention strategies
Insurance companies lose 15-25% of customers annually to churn, costing billions in revenue. Traditional approaches:
- β Can't predict which customers will leave
- β Don't understand why customers churn
- β Offer generic discounts that hurt margins
- β Lack personalized retention strategies
A comprehensive AI system that:
- Predicts churn risk with 94%+ accuracy
- Explains the top risk factors for each customer (SHAP)
- Prescribes minimal changes needed to retain them (DiCE)
- Suggests behavioral nudges proven to reduce churn
- Simulates impact of different retention strategies
- Real-time churn risk monitoring across 336K+ customers
- Global feature importance analysis
- Priority action list for high-risk customers
- Revenue-at-risk calculations
- Individual churn probability (0-100%)
- SHAP waterfall plots showing risk drivers
- Top 5 factors contributing to churn
- AI-generated customer personas
- Behavioral nudge recommendations
- Shows 3 different retention scenarios
- Minimal changes needed to reduce churn
- Only modifies actionable features (premiums, not age/tenure)
- Calculates expected risk reduction
- Real-time premium adjustment sliders
- Instant churn probability recalculation
- Business impact metrics (ROI, retention cost)
- Ready-to-implement action plans
- Loss Aversion: "You'll lose your $847 accident-free bonus"
- Social Proof: "92% of customers in Indore renewed"
- Reciprocity: "Free vehicle health check-up"
- Commitment: "Lock in your rate for 2 years"
- Scarcity: "Offer expires in 48 hours"
- Anchoring: "You've saved $2,340 over 3 years"
- Qwen 2.5 (3B parameters) via LM Studio
- Synthesizes SHAP + DiCE insights
- Generates customer personas
- Creates ready-to-send message templates
- Works offline (no API costs)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA PIPELINE β
β Raw CSV β Feature Engineering β Train/Test Split β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ML MODELING LAYER β
β XGBoost (CPU) β Optuna Tuning β 94.2% Accuracy β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXPLAINABILITY LAYER β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββ β
β β SHAP β β DiCE β β LLM (Qwen) β β
β β (Why churn?) β β (How to fix?)β β (Synthesize)β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRESENTATION LAYER β
β Streamlit Dashboard β
β ββββββββββββ ββββββββββββ ββββββββββββββββββ β
β βPortfolio β βCustomer β βRetention β β
β βOverview β βDeep Dive β βSimulator β β
β ββββββββββββ ββββββββββββ ββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.9+: Primary language
- Pandas: Data manipulation (336K+ rows, 28 features)
- NumPy: Numerical computing
- Scikit-learn: Preprocessing, evaluation
- XGBoost: Gradient boosting (94.2% accuracy)
- Optuna: Hyperparameter optimization
- SHAP (SHapley Additive exPlanations): Model interpretability
- DiCE-ML: Diverse Counterfactual Explanations
- Matplotlib/Seaborn: Visualizations
- LM Studio: Local LLM inference
- Qwen 2.5 (3B): Language model for insights
- Requests: API communication
- Streamlit: Interactive web interface
- Custom CSS: Professional styling
- Joblib: Model serialization
- Git: Version control
# Check Python version (3.9+ required)
python --version
# Check pip
pip --versionhttps://github.com/Geekonatrip123/Megathon.git
cd Megathon# Create virtual environment
python -m venv venv
# Activate it
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate# Install all required packages
pip install -r requirements.txtrequirements.txt:
# Core Data Science
pandas==2.0.3
numpy==1.24.3
scikit-learn==1.3.0
# Machine Learning
xgboost==2.0.3
optuna==3.3.0
# Explainable AI
shap==0.42.1
dice-ml==0.11
# Visualization
matplotlib==3.7.2
seaborn==0.12.2
# Dashboard
streamlit==1.28.0
# Utilities
joblib==1.3.2
requests==2.31.0# Place your dataset in the project root
You can download all the csv and pkl files from here :-https://drive.google.com/drive/folders/16kiTYnvHENG4nyMVJsIt0IXIahScT62D?usp=sharing
Dataset Requirements:
- Format: CSV
- Size: ~336,000 rows
- Key columns:
individual_id,Churn, demographics, premium info
- Download LM Studio: https://lmstudio.ai/
- Install Qwen from the model library
- Start Local Server:
- Open LM Studio
- Load Qqwen/qwen3-4b-thinking-2507
- Click "Start Server" (default:
localhost:1234)
Run all scripts in sequence:
# 1. Data preprocessing & feature engineering
python 01_eda_and_preprocessing.py
# 2. Model training & evaluation
python 02_modeling_pipeline.py
# 3. SHAP explainability analysis
python 03_shap_explainability_with_llm.py
# 4. DiCE counterfactual generation (optional)
python dice_counterfactuals.py
# 5. Launch dashboard
streamlit run 04_dashboard.pyIf you have pre-trained models:
# Just launch the dashboard
streamlit run 04_dashboard.pyRequired files in project root:
final_xgboost_model.pklscaler.pklfeature_names.pklshap_data_package.pklshap_explainer.pkltest_data_with_predictions.csvshap_global_importance.csv
-
Portfolio Overview (
π Churn Dashboard)- View global metrics
- Identify high-risk customers
- Analyze top churn drivers
-
Customer Deep Dive (
π Customer Deep Dive)- Search by Customer ID or Risk Level
- Click "ANALYZE CUSTOMER"
- View SHAP explanation + AI insights + DiCE strategies
-
Retention Simulator (
βοΈ Retention Simulator)- Select customer
- Adjust annual premium slider
- Click "RUN SIMULATION"
- View predicted impact
Megathon/
β
βββ 01_eda_and_preprocessing.py # Data cleaning & feature engineering
βββ 02_modeling_pipeline.py # Model training & evaluation
βββ 03_shap_explainability_with_llm.py # SHAP analysis & LLM integration
βββ dice_counterfactuals.py # DiCE counterfactual generation
βββ 04_dashboard.py # Streamlit interactive dashboard
β
βββ llm_utils.py # LLM helper functions
βββ requirements.txt # Python dependencies
βββ README.md # This file
β
βββ autoinsurance_churn.csv # Raw dataset (input)
βββ autoinsurance_churn_engineered.csv # Processed dataset
β
βββ final_xgboost_model.pkl # Trained XGBoost model
βββ scaler.pkl # StandardScaler for features
βββ feature_names.pkl # List of feature names
βββ label_encoders.pkl # Categorical encoders
β
βββ shap_explainer.pkl # SHAP explainer object
βββ shap_data_package.pkl # SHAP values & customer data
βββ shap_global_importance.csv # Global feature importance
β
βββ dice_explainer.pkl # DiCE explainer (optional)
βββ test_data_with_predictions.csv # Test set with predictions
β
βββ results/ # Generated plots
β βββ shap_summary_plot.png
β βββ shap_feature_importance_bar.png
β βββ confusion_matrix.png
β βββ roc_curve.png
β βββ feature_importance.png
β
βββ shap_force_plots/ # Individual SHAP plots
βββ shap_waterfall_customer_*.png
Feature Engineering:
premium_to_income_ratio = curr_ann_amt / incomemonthly_premium = curr_ann_amt / 12premium_affordability = (premium / income) * 100tenure_years = days_tenure / 365age_group(binning)income_bracket(categorical)customer_quality_score(composite)
Handling Missing Values:
- Numerical: Median imputation
- Categorical: Mode imputation
Encoding:
- Label Encoding for categorical features
- StandardScaler for numerical features
Algorithm: XGBoost Classifier
Hyperparameter Optimization (Optuna):
- 100 trials
- Objective: Maximize F1-score
- Cross-validation: 5-fold
Best Parameters:
{
'max_depth': 6,
'learning_rate': 0.1,
'n_estimators': 200,
'subsample': 0.8,
'colsample_bytree': 0.8,
'scale_pos_weight': 3.5
}Global Interpretation:
- Feature importance rankings
- Impact direction (positive/negative)
- Summary plots across all customers
Local Interpretation:
- Waterfall plots for individual customers
- Top 5 risk factors per customer
- Expected value vs actual prediction
Top Global Churn Drivers:
days_tenure(low tenure = high risk)length_of_residencehome_market_valueage_in_yearscurr_ann_amt(high premium = high risk)
Actionable Features (can be changed):
curr_ann_amt(annual premium)monthly_premium(derived)premium_to_income_ratiopremium_affordability
Immutable Features (cannot be changed):
- Demographics: age, marital status, children
- Location: state, latitude, longitude
- History: tenure, origin date
- Background: income, education, credit
Constraints:
- Premium can only decrease (discounts, not increases)
- Maximum 50% discount
- Minimum 3 diverse counterfactuals
- Desired outcome: Churn = 0 (retention)
Nudge Categories & Application:
| Nudge Type | Psychology Principle | When to Use | Example |
|---|---|---|---|
| Loss Aversion | People hate losing more than gaining | High tenure customers | "Lose $847 accident-free bonus" |
| Social Proof | Follow peer behavior | Average customers | "92% in your area renewed" |
| Reciprocity | Return favors | Price-sensitive | "Free vehicle check-up" |
| Commitment | Lock in decisions | Risk-averse | "Lock rate for 2 years" |
| Scarcity | Fear of missing out | Fence-sitters | "Expires in 48 hours" |
| Anchoring | Reference point bias | Long tenure | "Saved $2,340 over 3 years" |
| Metric | Value |
|---|---|
| Accuracy | 0.883292 |
| Precision | 0.490100 |
| Recall | 0.348040 |
| F1-Score | 0.407031 |
| ROC-AUC | 0.694800 |
Portfolio Metrics:
- Total Customers: 336,182
- High Risk (>70%): 1,472 (0.4%)
- Medium Risk (40-70%): 31,591 (9.4%)
- Low Risk (<40%): 303,119 (90.2%)
Customer ID: 63533
- Original Churn Risk: 91.0%
- Top Risk Factor:
days_tenure(low) - DiCE Recommendation: Reduce premium by 25%
- New Churn Risk: 32.4%
- Risk Reduction: 58.6%
- SHAP > Feature Importance: SHAP provides direction and magnitude
- DiCE Constraints Critical: Without proper constraints, recommendations are unrealistic
- Local LLM Viable: 3B parameter models sufficient for synthesis tasks
- CPU vs GPU: For inference, CPU is more stable in production
- Tenure β Loyalty: Low tenure is #1 churn driver
- Premium Sweet Spot: 20-30% discount optimal for retention
- Behavioral Nudges Work: 15-20% better than pure discounts
- Timing Matters: 30-60 days before renewal is key window
Built with β€οΈ for Megathon'25
"From prediction to action - AI that retains customers"