Skip to content

saksham-program/credit-risk-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

End-to-End Credit Risk Analytics & Prediction System

A complete machine learning project for predicting credit default risk, An end-to-end machine learning project that I designed, implemented, and iteratively refined to understand real-world credit risk modeling.

Python ML Status

πŸ“Œ Problem Statement

Why does this matter?

Banks and financial institutions lose billions annually from loan defaults. Even a small improvement in predicting who will default can save millions. But it's a balancing act - if you're too strict, you reject good borrowers and lose revenue. Too lenient, and you approve risky loans that won't be repaid.

This project builds a practical credit risk assessment system that can:

  • Predict the likelihood of loan default
  • Explain which factors drive the prediction
  • Help make informed lending decisions

Think of it as a "first-pass filter" that flags high-risk applications for human review, while fast-tracking low-risk ones.


πŸ“Š Dataset Overview

Source: Simulated credit bureau data based on realistic lending scenarios

Size: 32,581 loan applications

Features:

  • Personal Info: Age, income, employment length, home ownership
  • Loan Details: Amount, interest rate, purpose, grade
  • Credit History: Credit history length, previous defaults

Target: loan_status (0 = Repaid, 1 = Default)

Key Challenges I Faced:

  1. Imbalanced data: Only 22% of loans defaulted (but that's what we care about!)
  2. Missing values: ~7% missing in employment length and interest rate
  3. Mixed data types: Numerical + categorical features
  4. Real-world messiness: Data wasn't clean, which made it more realistic
  5. Metric selection: Initially underestimated how misleading accuracy can be for imbalanced datasets

πŸ”„ My Approach

1. Data Exploration (EDA)

Notebook: 01_data_exploration.ipynb

What I did:

  • Loaded 32K+ records and checked for issues
  • Analyzed missing values (found ~7% missing in 2 columns)
  • Discovered class imbalance (78% repaid vs 22% default)
  • Created visualizations to understand patterns

Key Findings:

  • Loan-to-income ratio is a huge predictor (people borrowing 60%+ of income are risky)
  • Previous defaults strongly predict future defaults
  • Interest rates already reflect risk (banks price it in)
  • Surprisingly, age doesn't matter much

What surprised me:

  • Medical and venture loans have much higher default rates than personal loans
  • Some people are approved for loans worth 80%+ of their annual income (scary!)

2. Feature Engineering

Notebook: 02_feature_engineering.ipynb

Missing Values:

  • Used median imputation (more robust than mean for outliers)
  • Considered dropping rows but 7% is too much to lose

New Features Created:

  • income_to_loan_ratio: How much income relative to loan size
  • age_group: Binned age into Young/Mid/Senior
  • high_risk_flag: Combo of previous default + high loan-to-income

Encoding:

  • Label encoding for ordinal features (loan_grade: Aβ†’1, Bβ†’2, etc.)
  • One-hot encoding for nominal features (loan purpose, home ownership)

Scaling:

  • Used StandardScaler to normalize features (mean=0, std=1)
  • Critical: Fit on training data only, then transform test (avoid data leakage!)

Why these choices?

  • Median over mean: Less affected by outliers
  • Stratified split: Keeps class balance in both train/test
  • One-hot encoding: Doesn't assume false orderings (RENT isn't "less than" OWN)

3. Model Training & Evaluation

Notebook: 03_model_training.ipynb

Models Tested:

  1. Logistic Regression (Baseline)

    • Simple, fast, interpretable
    • Used class_weight='balanced' for imbalanced data
    • Good starting point
  2. Random Forest (Final Model)

    • Captures non-linear patterns
    • Handles feature interactions
    • Gives feature importance
    • Better performance overall

Why not just use accuracy?

If I predict "no one defaults", I get 78% accuracy but catch ZERO defaulters. Useless!

Metrics I focused on:

  • Precision: Of predicted defaults, how many were correct? (Don't reject too many good borrowers)
  • Recall: Of actual defaults, how many did we catch? (Don't miss bad loans)
  • F1 Score: Balance between precision and recall
  • ROC-AUC: Overall ability to distinguish classes

Results:

  • Random Forest achieved ~70-75% recall (catching most defaults)
  • ROC-AUC stabilized around 0.78 on the test set (decent predictive power)
  • Better than random guessing, realistic for real-world scenario

πŸ“ˆ Results & Evaluation

Model Performance

Metric Logistic Regression Random Forest (Final)
Accuracy ~70-75% ~72-77%
Precision ~45-55% ~50-60%
Recall ~65-70% ~70-75%
F1 Score ~52-60% ~57-65%
ROC-AUC ~0.72-0.76 ~0.75-0.80

What Do These Numbers Mean?

In plain English:

  • The model catches about 70-75% of defaults (recall)
  • When it predicts "default", it's right 50-60% of the time (precision)
  • Overall, it's significantly better than random guessing

Why isn't it 99% accurate?

Because credit risk is genuinely hard to predict! Real-world factors we don't have:

  • Unexpected life events (medical emergencies, job loss)
  • Economic conditions
  • Detailed payment history
  • Full credit bureau data

Banks with billion-dollar ML teams don't get 99% either. This is realistic.

Feature Importance

Top 5 most important factors:

  1. Loan interest rate (banks already price in risk)
  2. Loan-to-income percentage
  3. Previous default history
  4. Income level
  5. Credit history length

πŸ–₯️ Application Demo

Built a Streamlit web app where users can:

  1. Input loan applicant details
  2. Get instant risk assessment (Low/Medium/High)
  3. See default probability
  4. Receive recommendation (Approve/Review/Deny)
  5. Understand key risk factors

How to run it:

cd credit-risk-analytics
streamlit run app/app.py

Then open your browser to http://localhost:8501

App Features:

  • βœ… Clean, intuitive interface
  • βœ… Real-time prediction
  • βœ… Risk level with color coding
  • βœ… Explanation of key factors
  • βœ… Actionable recommendations

Screenshots: (See screenshots/ folder)


πŸ—‚οΈ Project Structure

credit-risk-analytics/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # Original dataset
β”‚   β”œβ”€β”€ processed/              # Processed, ready-for-modeling data
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb       # EDA with visualizations
β”‚   β”œβ”€β”€ 02_feature_engineering.ipynb    # Data preprocessing
β”‚   β”œβ”€β”€ 03_model_training.ipynb         # Model development
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ preprocessing.py        # Data cleaning functions
β”‚   β”œβ”€β”€ model.py               # Model training utilities
β”‚   β”œβ”€β”€ evaluation.py          # Evaluation metrics
β”‚   β”œβ”€β”€ credit_risk_model.pkl  # Saved trained model
β”‚
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ app.py                 # Streamlit application
β”‚
β”œβ”€β”€ screenshots/               # App screenshots
β”‚
β”œβ”€β”€ requirements.txt          # Python dependencies
└── README.md                 # This file

πŸš€ How to Run This Project

1. Clone and Setup

# Clone the repository
git clone <repo-url>
cd credit-risk-analytics

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Explore the Notebooks

jupyter notebook

Open and run the notebooks in order:

  1. 01_data_exploration.ipynb
  2. 02_feature_engineering.ipynb
  3. 03_model_training.ipynb

3. Run the Web App

streamlit run app/app.py

πŸ’‘ What I Learned

Technical Skills:

  • Data preprocessing: Handling missing values, encoding, scaling
  • Dealing with imbalanced data: class_weight, stratified splitting
  • Model selection: When to use simple vs. complex models
  • Evaluation: Why accuracy is misleading, importance of recall for imbalanced problems
  • Deployment: Building user-facing applications with Streamlit

Real-World Insights:

  • Perfect models don't exist: 70-75% recall is actually pretty good for credit risk
  • Explainability matters: Banks need to explain why they reject someone (regulations!)
  • Business context > metrics: A False Negative costs more than a False Positive
  • Data quality: Missing values and imbalance are normal in real datasets

Mistakes I Made (and fixed):

  1. First tried mean imputation β†’ Switched to median (better for outliers)
  2. Initially forgot stratification β†’ Class distributions didn't match in train/test
  3. Scaled before splitting β†’ Data leakage! Had to redo it properly
  4. Only looked at accuracy β†’ Realized it's meaningless for imbalanced data

What I'd Do Differently:

  • Try SMOTE or ADASYN for handling imbalance
  • Experiment with XGBoost (probably better than Random Forest)
  • Hyperparameter tuning with GridSearchCV or RandomizedSearchCV
  • Add more domain features (debt-to-income ratio, FICO score equivalent)
  • Build an API instead of just Streamlit (FastAPI + Docker)
  • A/B testing for real-world validation (measuring business impact in production)

🎯 Business Impact

How this could be used in practice:

  1. Automated first-pass screening:

    • Low risk (< 30% probability) β†’ Auto-approve
    • Medium risk (30-60%) β†’ Human review
    • High risk (> 60%) β†’ Likely deny or require co-signer
  2. Risk-based pricing:

    • Adjust interest rates based on predicted default probability
    • More accurate pricing = more profit + fairer to borrowers
  3. Portfolio management:

    • Monitor aggregate risk across all loans
    • Identify high-risk segments for intervention

Estimated value (hypothetical):

  • If this model prevents just 1% more defaults on a $100M loan portfolio
  • With average loss of 50% on defaults
  • That's $500K saved per year

⚠️ Limitations & Future Work

Current Limitations:

  • Limited features: Real credit bureaus have 100+ features
  • Simulated data: Not from actual bank records
  • No temporal component: Doesn't account for changing economic conditions
  • Regulatory compliance: Real models need to pass fair lending audits
  • No production infrastructure: Just a demo app

Future Improvements:

  • Integrate with credit bureau APIs
  • Add temporal features (seasonality, economic indicators)
  • Build explain ability layer (SHAP values, LIME)
  • Deploy to cloud (AWS/GCP)
  • Add monitoring & retraining pipeline
  • A/B testing framework
  • Mobile-responsive UI

πŸ“š Technologies Used

Languages & Frameworks:

  • Python 3.8+
  • Pandas, NumPy (data manipulation)
  • Matplotlib, Seaborn (visualization)
  • Scikit-learn (machine learning)
  • Streamlit (web app)

Tools:

  • Jupyter Notebook (analysis)
  • Git (version control)
  • VS Code (development)

🀝 Acknowledgments

  • Dataset inspired by real-world credit risk scenarios
  • Learned a ton from Kaggle notebooks and Stack Overflow
  • Thanks to the open-source community for amazing libraries

πŸ“¬ Contact

Built by a passionate junior data scientist learning by doing!

Questions or feedback? I'd love to hear from you:

  • Open an issue on GitHub
  • Connect on LinkedIn
  • Email: [your-email]

πŸ“„ License

This project is for educational and portfolio purposes. Feel free to use it as a learning resource!


⭐ If this project helped you learn something, consider giving it a star!


Last updated: January 2024

About

End-to-end machine learning system for credit risk analysis and loan default prediction, with EDA, feature engineering, model evaluation, and a Streamlit web app.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors