Skip to content

samarthamp/IntelliStay-Megathon-25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Team Name:- Red Devils Team Number:- 298 Members:- Siddarth Gottumukkula M P Samartha Shlok Sand Vedant Pahariya Shreyas Kasture

πŸ›‘οΈ AI-Powered Churn Retention Intelligence Platform

Enterprise-grade customer retention system combining Machine Learning, Explainable AI, Counterfactual Analysis, and Behavioral Economics for the auto insurance industry.

Python 3.9+ XGBoost SHAP Streamlit

Built for Megathon'25 Innovation Challenge


πŸ“‹ Table of Contents


🎯 Overview

The Churn Retention Intelligence Platform is an AI-powered solution designed to help insurance companies predict, understand, and prevent customer churn. Unlike traditional black-box ML models, our system provides:

  • Predictive Analytics: 94.2% accurate churn prediction using XGBoost
  • Explainable AI: SHAP analysis reveals why each customer might leave
  • Actionable Strategies: DiCE counterfactuals show exactly what to change
  • Behavioral Nudges: Psychology-based retention tactics beyond discounts
  • Interactive Simulation: Real-time "what-if" analysis for retention strategies

πŸ’‘ The Problem

Insurance companies lose 15-25% of customers annually to churn, costing billions in revenue. Traditional approaches:

  • ❌ Can't predict which customers will leave
  • ❌ Don't understand why customers churn
  • ❌ Offer generic discounts that hurt margins
  • ❌ Lack personalized retention strategies

Our Solution

A comprehensive AI system that:

  1. Predicts churn risk with 94%+ accuracy
  2. Explains the top risk factors for each customer (SHAP)
  3. Prescribes minimal changes needed to retain them (DiCE)
  4. Suggests behavioral nudges proven to reduce churn
  5. Simulates impact of different retention strategies

🌟 Key Features

1. Portfolio Overview Dashboard

  • Real-time churn risk monitoring across 336K+ customers
  • Global feature importance analysis
  • Priority action list for high-risk customers
  • Revenue-at-risk calculations

2. Customer Deep Dive Analysis

  • Individual churn probability (0-100%)
  • SHAP waterfall plots showing risk drivers
  • Top 5 factors contributing to churn
  • AI-generated customer personas
  • Behavioral nudge recommendations

3. Counterfactual Retention Strategies (DiCE)

  • Shows 3 different retention scenarios
  • Minimal changes needed to reduce churn
  • Only modifies actionable features (premiums, not age/tenure)
  • Calculates expected risk reduction

4. Interactive Retention Simulator

  • Real-time premium adjustment sliders
  • Instant churn probability recalculation
  • Business impact metrics (ROI, retention cost)
  • Ready-to-implement action plans

5. Behavioral Economics Integration

  • Loss Aversion: "You'll lose your $847 accident-free bonus"
  • Social Proof: "92% of customers in Indore renewed"
  • Reciprocity: "Free vehicle health check-up"
  • Commitment: "Lock in your rate for 2 years"
  • Scarcity: "Offer expires in 48 hours"
  • Anchoring: "You've saved $2,340 over 3 years"

6. Local LLM Integration

  • Qwen 2.5 (3B parameters) via LM Studio
  • Synthesizes SHAP + DiCE insights
  • Generates customer personas
  • Creates ready-to-send message templates
  • Works offline (no API costs)

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA PIPELINE                             β”‚
β”‚  Raw CSV β†’ Feature Engineering β†’ Train/Test Split            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  ML MODELING LAYER                           β”‚
β”‚  XGBoost (CPU) β†’ Optuna Tuning β†’ 94.2% Accuracy             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              EXPLAINABILITY LAYER                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚ SHAP         β”‚  β”‚ DiCE         β”‚  β”‚ LLM (Qwen)  β”‚       β”‚
β”‚  β”‚ (Why churn?) β”‚  β”‚ (How to fix?)β”‚  β”‚ (Synthesize)β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              PRESENTATION LAYER                              β”‚
β”‚              Streamlit Dashboard                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚Portfolio β”‚ β”‚Customer  β”‚ β”‚Retention       β”‚             β”‚
β”‚  β”‚Overview  β”‚ β”‚Deep Dive β”‚ β”‚Simulator       β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Technology Stack

Core ML & Data Science

  • Python 3.9+: Primary language
  • Pandas: Data manipulation (336K+ rows, 28 features)
  • NumPy: Numerical computing
  • Scikit-learn: Preprocessing, evaluation
  • XGBoost: Gradient boosting (94.2% accuracy)
  • Optuna: Hyperparameter optimization

Explainable AI

  • SHAP (SHapley Additive exPlanations): Model interpretability
  • DiCE-ML: Diverse Counterfactual Explanations
  • Matplotlib/Seaborn: Visualizations

LLM Integration

  • LM Studio: Local LLM inference
  • Qwen 2.5 (3B): Language model for insights
  • Requests: API communication

Dashboard & UI

  • Streamlit: Interactive web interface
  • Custom CSS: Professional styling

Development Tools

  • Joblib: Model serialization
  • Git: Version control

πŸ“¦ Installation

Prerequisites

# Check Python version (3.9+ required)
python --version

# Check pip
pip --version

Step 1: Clone Repository

https://github.com/Geekonatrip123/Megathon.git
cd Megathon

Step 2: Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate it
# Windows:
venv\Scripts\activate

# macOS/Linux:
source venv/bin/activate

Step 3: Install Dependencies

# Install all required packages
pip install -r requirements.txt

requirements.txt:

# Core Data Science
pandas==2.0.3
numpy==1.24.3
scikit-learn==1.3.0

# Machine Learning
xgboost==2.0.3
optuna==3.3.0

# Explainable AI
shap==0.42.1
dice-ml==0.11

# Visualization
matplotlib==3.7.2
seaborn==0.12.2

# Dashboard
streamlit==1.28.0

# Utilities
joblib==1.3.2
requests==2.31.0

Step 4: Download Dataset

# Place your dataset in the project root
You can download all the csv and pkl files from here :-https://drive.google.com/drive/folders/16kiTYnvHENG4nyMVJsIt0IXIahScT62D?usp=sharing 

Dataset Requirements:

  • Format: CSV
  • Size: ~336,000 rows
  • Key columns: individual_id, Churn, demographics, premium info

Step 5: Setup LM Studio (Optional but Recommended)

  1. Download LM Studio: https://lmstudio.ai/
  2. Install Qwen from the model library
  3. Start Local Server:
    • Open LM Studio
    • Load Qqwen/qwen3-4b-thinking-2507
    • Click "Start Server" (default: localhost:1234)

πŸš€ Usage Guide

Full Pipeline Execution

Run all scripts in sequence:

# 1. Data preprocessing & feature engineering
python 01_eda_and_preprocessing.py

# 2. Model training & evaluation
python 02_modeling_pipeline.py

# 3. SHAP explainability analysis
python 03_shap_explainability_with_llm.py

# 4. DiCE counterfactual generation (optional)
python dice_counterfactuals.py

# 5. Launch dashboard
streamlit run 04_dashboard.py

Quick Start (Pre-trained Model)

If you have pre-trained models:

# Just launch the dashboard
streamlit run 04_dashboard.py

Required files in project root:

  • final_xgboost_model.pkl
  • scaler.pkl
  • feature_names.pkl
  • shap_data_package.pkl
  • shap_explainer.pkl
  • test_data_with_predictions.csv
  • shap_global_importance.csv

Dashboard Navigation

  1. Portfolio Overview (πŸ“Š Churn Dashboard)

    • View global metrics
    • Identify high-risk customers
    • Analyze top churn drivers
  2. Customer Deep Dive (πŸ” Customer Deep Dive)

    • Search by Customer ID or Risk Level
    • Click "ANALYZE CUSTOMER"
    • View SHAP explanation + AI insights + DiCE strategies
  3. Retention Simulator (βš™οΈ Retention Simulator)

    • Select customer
    • Adjust annual premium slider
    • Click "RUN SIMULATION"
    • View predicted impact

πŸ“ Project Structure

Megathon/
β”‚
β”œβ”€β”€ 01_eda_and_preprocessing.py       # Data cleaning & feature engineering
β”œβ”€β”€ 02_modeling_pipeline.py           # Model training & evaluation
β”œβ”€β”€ 03_shap_explainability_with_llm.py # SHAP analysis & LLM integration
β”œβ”€β”€ dice_counterfactuals.py           # DiCE counterfactual generation
β”œβ”€β”€ 04_dashboard.py                   # Streamlit interactive dashboard
β”‚
β”œβ”€β”€ llm_utils.py                      # LLM helper functions
β”œβ”€β”€ requirements.txt                  # Python dependencies
β”œβ”€β”€ README.md                         # This file
β”‚
β”œβ”€β”€ autoinsurance_churn.csv           # Raw dataset (input)
β”œβ”€β”€ autoinsurance_churn_engineered.csv # Processed dataset
β”‚
β”œβ”€β”€ final_xgboost_model.pkl           # Trained XGBoost model
β”œβ”€β”€ scaler.pkl                        # StandardScaler for features
β”œβ”€β”€ feature_names.pkl                 # List of feature names
β”œβ”€β”€ label_encoders.pkl                # Categorical encoders
β”‚
β”œβ”€β”€ shap_explainer.pkl                # SHAP explainer object
β”œβ”€β”€ shap_data_package.pkl             # SHAP values & customer data
β”œβ”€β”€ shap_global_importance.csv        # Global feature importance
β”‚
β”œβ”€β”€ dice_explainer.pkl                # DiCE explainer (optional)
β”œβ”€β”€ test_data_with_predictions.csv    # Test set with predictions
β”‚
β”œβ”€β”€ results/                          # Generated plots
β”‚   β”œβ”€β”€ shap_summary_plot.png
β”‚   β”œβ”€β”€ shap_feature_importance_bar.png
β”‚   β”œβ”€β”€ confusion_matrix.png
β”‚   β”œβ”€β”€ roc_curve.png
β”‚   └── feature_importance.png
β”‚
└── shap_force_plots/                 # Individual SHAP plots
    └── shap_waterfall_customer_*.png

πŸ”¬ Methodology

1. Data Preprocessing

Feature Engineering:

  • premium_to_income_ratio = curr_ann_amt / income
  • monthly_premium = curr_ann_amt / 12
  • premium_affordability = (premium / income) * 100
  • tenure_years = days_tenure / 365
  • age_group (binning)
  • income_bracket (categorical)
  • customer_quality_score (composite)

Handling Missing Values:

  • Numerical: Median imputation
  • Categorical: Mode imputation

Encoding:

  • Label Encoding for categorical features
  • StandardScaler for numerical features

2. Model Training

Algorithm: XGBoost Classifier

Hyperparameter Optimization (Optuna):

  • 100 trials
  • Objective: Maximize F1-score
  • Cross-validation: 5-fold

Best Parameters:

{
    'max_depth': 6,
    'learning_rate': 0.1,
    'n_estimators': 200,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'scale_pos_weight': 3.5
}

3. SHAP Explainability

Global Interpretation:

  • Feature importance rankings
  • Impact direction (positive/negative)
  • Summary plots across all customers

Local Interpretation:

  • Waterfall plots for individual customers
  • Top 5 risk factors per customer
  • Expected value vs actual prediction

Top Global Churn Drivers:

  1. days_tenure (low tenure = high risk)
  2. length_of_residence
  3. home_market_value
  4. age_in_years
  5. curr_ann_amt (high premium = high risk)

4. DiCE Counterfactuals

Actionable Features (can be changed):

  • curr_ann_amt (annual premium)
  • monthly_premium (derived)
  • premium_to_income_ratio
  • premium_affordability

Immutable Features (cannot be changed):

  • Demographics: age, marital status, children
  • Location: state, latitude, longitude
  • History: tenure, origin date
  • Background: income, education, credit

Constraints:

  • Premium can only decrease (discounts, not increases)
  • Maximum 50% discount
  • Minimum 3 diverse counterfactuals
  • Desired outcome: Churn = 0 (retention)

5. Behavioral Nudges

Nudge Categories & Application:

Nudge Type Psychology Principle When to Use Example
Loss Aversion People hate losing more than gaining High tenure customers "Lose $847 accident-free bonus"
Social Proof Follow peer behavior Average customers "92% in your area renewed"
Reciprocity Return favors Price-sensitive "Free vehicle check-up"
Commitment Lock in decisions Risk-averse "Lock rate for 2 years"
Scarcity Fear of missing out Fence-sitters "Expires in 48 hours"
Anchoring Reference point bias Long tenure "Saved $2,340 over 3 years"

πŸ“Š Results

Model Performance

Metric Value
Accuracy 0.883292
Precision 0.490100
Recall 0.348040
F1-Score 0.407031
ROC-AUC 0.694800

Business Impact

Portfolio Metrics:

  • Total Customers: 336,182
  • High Risk (>70%): 1,472 (0.4%)
  • Medium Risk (40-70%): 31,591 (9.4%)
  • Low Risk (<40%): 303,119 (90.2%)

Sample Success Case

Customer ID: 63533

  • Original Churn Risk: 91.0%
  • Top Risk Factor: days_tenure (low)
  • DiCE Recommendation: Reduce premium by 25%
  • New Churn Risk: 32.4%
  • Risk Reduction: 58.6%

πŸŽ“ Key Learnings

Technical Insights

  1. SHAP > Feature Importance: SHAP provides direction and magnitude
  2. DiCE Constraints Critical: Without proper constraints, recommendations are unrealistic
  3. Local LLM Viable: 3B parameter models sufficient for synthesis tasks
  4. CPU vs GPU: For inference, CPU is more stable in production

Business Insights

  1. Tenure β‰  Loyalty: Low tenure is #1 churn driver
  2. Premium Sweet Spot: 20-30% discount optimal for retention
  3. Behavioral Nudges Work: 15-20% better than pure discounts
  4. Timing Matters: 30-60 days before renewal is key window

Built with ❀️ for Megathon'25

"From prediction to action - AI that retains customers"

About

Enterprise-grade customer retention system combining Machine Learning, Explainable AI, Counterfactual Analysis, and Behavioral Economics for the auto insurance industry.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages