End-to-End Credit Risk Analytics & Prediction System

A complete machine learning project for predicting credit default risk, An end-to-end machine learning project that I designed, implemented, and iteratively refined to understand real-world credit risk modeling.

📌 Problem Statement

Why does this matter?

Banks and financial institutions lose billions annually from loan defaults. Even a small improvement in predicting who will default can save millions. But it's a balancing act - if you're too strict, you reject good borrowers and lose revenue. Too lenient, and you approve risky loans that won't be repaid.

This project builds a practical credit risk assessment system that can:

Predict the likelihood of loan default
Explain which factors drive the prediction
Help make informed lending decisions

Think of it as a "first-pass filter" that flags high-risk applications for human review, while fast-tracking low-risk ones.

📊 Dataset Overview

Source: Simulated credit bureau data based on realistic lending scenarios

Size: 32,581 loan applications

Features:

Personal Info: Age, income, employment length, home ownership
Loan Details: Amount, interest rate, purpose, grade
Credit History: Credit history length, previous defaults

Target: loan_status (0 = Repaid, 1 = Default)

Key Challenges I Faced:

Imbalanced data: Only 22% of loans defaulted (but that's what we care about!)
Missing values: ~7% missing in employment length and interest rate
Mixed data types: Numerical + categorical features
Real-world messiness: Data wasn't clean, which made it more realistic
Metric selection: Initially underestimated how misleading accuracy can be for imbalanced datasets

🔄 My Approach

1. Data Exploration (EDA)

Notebook: 01_data_exploration.ipynb

What I did:

Loaded 32K+ records and checked for issues
Analyzed missing values (found ~7% missing in 2 columns)
Discovered class imbalance (78% repaid vs 22% default)
Created visualizations to understand patterns

Key Findings:

Loan-to-income ratio is a huge predictor (people borrowing 60%+ of income are risky)
Previous defaults strongly predict future defaults
Interest rates already reflect risk (banks price it in)
Surprisingly, age doesn't matter much

What surprised me:

Medical and venture loans have much higher default rates than personal loans
Some people are approved for loans worth 80%+ of their annual income (scary!)

2. Feature Engineering

Notebook: 02_feature_engineering.ipynb

Missing Values:

Used median imputation (more robust than mean for outliers)
Considered dropping rows but 7% is too much to lose

New Features Created:

income_to_loan_ratio: How much income relative to loan size
age_group: Binned age into Young/Mid/Senior
high_risk_flag: Combo of previous default + high loan-to-income

Encoding:

Label encoding for ordinal features (loan_grade: A→1, B→2, etc.)
One-hot encoding for nominal features (loan purpose, home ownership)

Scaling:

Used StandardScaler to normalize features (mean=0, std=1)
Critical: Fit on training data only, then transform test (avoid data leakage!)

Why these choices?

Median over mean: Less affected by outliers
Stratified split: Keeps class balance in both train/test
One-hot encoding: Doesn't assume false orderings (RENT isn't "less than" OWN)

3. Model Training & Evaluation

Notebook: 03_model_training.ipynb

Models Tested:

Logistic Regression (Baseline)
- Simple, fast, interpretable
- Used class_weight='balanced' for imbalanced data
- Good starting point
Random Forest (Final Model)
- Captures non-linear patterns
- Handles feature interactions
- Gives feature importance
- Better performance overall

Why not just use accuracy?

If I predict "no one defaults", I get 78% accuracy but catch ZERO defaulters. Useless!

Metrics I focused on:

Precision: Of predicted defaults, how many were correct? (Don't reject too many good borrowers)
Recall: Of actual defaults, how many did we catch? (Don't miss bad loans)
F1 Score: Balance between precision and recall
ROC-AUC: Overall ability to distinguish classes

Results:

Random Forest achieved ~70-75% recall (catching most defaults)
ROC-AUC stabilized around 0.78 on the test set (decent predictive power)
Better than random guessing, realistic for real-world scenario

📈 Results & Evaluation

Model Performance

Metric	Logistic Regression	Random Forest (Final)
Accuracy	~70-75%	~72-77%
Precision	~45-55%	~50-60%
Recall	~65-70%	~70-75%
F1 Score	~52-60%	~57-65%
ROC-AUC	~0.72-0.76	~0.75-0.80

What Do These Numbers Mean?

In plain English:

The model catches about 70-75% of defaults (recall)
When it predicts "default", it's right 50-60% of the time (precision)
Overall, it's significantly better than random guessing

Why isn't it 99% accurate?

Because credit risk is genuinely hard to predict! Real-world factors we don't have:

Unexpected life events (medical emergencies, job loss)
Economic conditions
Detailed payment history
Full credit bureau data

Banks with billion-dollar ML teams don't get 99% either. This is realistic.

Feature Importance

Top 5 most important factors:

Loan interest rate (banks already price in risk)
Loan-to-income percentage
Previous default history
Income level
Credit history length

🖥️ Application Demo

Built a Streamlit web app where users can:

Input loan applicant details
Get instant risk assessment (Low/Medium/High)
See default probability
Receive recommendation (Approve/Review/Deny)
Understand key risk factors

How to run it:

cd credit-risk-analytics
streamlit run app/app.py

Then open your browser to http://localhost:8501

App Features:

✅ Clean, intuitive interface
✅ Real-time prediction
✅ Risk level with color coding
✅ Explanation of key factors
✅ Actionable recommendations

Screenshots: (See screenshots/ folder)

🗂️ Project Structure

credit-risk-analytics/
│
├── data/
│   ├── raw/                    # Original dataset
│   ├── processed/              # Processed, ready-for-modeling data
│
├── notebooks/
│   ├── 01_data_exploration.ipynb       # EDA with visualizations
│   ├── 02_feature_engineering.ipynb    # Data preprocessing
│   ├── 03_model_training.ipynb         # Model development
│
├── src/
│   ├── preprocessing.py        # Data cleaning functions
│   ├── model.py               # Model training utilities
│   ├── evaluation.py          # Evaluation metrics
│   ├── credit_risk_model.pkl  # Saved trained model
│
├── app/
│   ├── app.py                 # Streamlit application
│
├── screenshots/               # App screenshots
│
├── requirements.txt          # Python dependencies
└── README.md                 # This file

🚀 How to Run This Project

1. Clone and Setup

# Clone the repository
git clone <repo-url>
cd credit-risk-analytics

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Explore the Notebooks

jupyter notebook

Open and run the notebooks in order:

01_data_exploration.ipynb
02_feature_engineering.ipynb
03_model_training.ipynb

3. Run the Web App

streamlit run app/app.py

💡 What I Learned

Technical Skills:

Data preprocessing: Handling missing values, encoding, scaling
Dealing with imbalanced data: class_weight, stratified splitting
Model selection: When to use simple vs. complex models
Evaluation: Why accuracy is misleading, importance of recall for imbalanced problems
Deployment: Building user-facing applications with Streamlit

Real-World Insights:

Perfect models don't exist: 70-75% recall is actually pretty good for credit risk
Explainability matters: Banks need to explain why they reject someone (regulations!)
Business context > metrics: A False Negative costs more than a False Positive
Data quality: Missing values and imbalance are normal in real datasets

Mistakes I Made (and fixed):

First tried mean imputation → Switched to median (better for outliers)
Initially forgot stratification → Class distributions didn't match in train/test
Scaled before splitting → Data leakage! Had to redo it properly
Only looked at accuracy → Realized it's meaningless for imbalanced data

What I'd Do Differently:

Try SMOTE or ADASYN for handling imbalance
Experiment with XGBoost (probably better than Random Forest)
Hyperparameter tuning with GridSearchCV or RandomizedSearchCV
Add more domain features (debt-to-income ratio, FICO score equivalent)
Build an API instead of just Streamlit (FastAPI + Docker)
A/B testing for real-world validation (measuring business impact in production)

🎯 Business Impact

How this could be used in practice:

Automated first-pass screening:
- Low risk (< 30% probability) → Auto-approve
- Medium risk (30-60%) → Human review
- High risk (> 60%) → Likely deny or require co-signer
Risk-based pricing:
- Adjust interest rates based on predicted default probability
- More accurate pricing = more profit + fairer to borrowers
Portfolio management:
- Monitor aggregate risk across all loans
- Identify high-risk segments for intervention

Estimated value (hypothetical):

If this model prevents just 1% more defaults on a $100M loan portfolio
With average loss of 50% on defaults
That's $500K saved per year

⚠️ Limitations & Future Work

Current Limitations:

Limited features: Real credit bureaus have 100+ features
Simulated data: Not from actual bank records
No temporal component: Doesn't account for changing economic conditions
Regulatory compliance: Real models need to pass fair lending audits
No production infrastructure: Just a demo app

Future Improvements:

Integrate with credit bureau APIs
Add temporal features (seasonality, economic indicators)
Build explain ability layer (SHAP values, LIME)
Deploy to cloud (AWS/GCP)
Add monitoring & retraining pipeline
A/B testing framework
Mobile-responsive UI

📚 Technologies Used

Languages & Frameworks:

Python 3.8+
Pandas, NumPy (data manipulation)
Matplotlib, Seaborn (visualization)
Scikit-learn (machine learning)
Streamlit (web app)

Tools:

Jupyter Notebook (analysis)
Git (version control)
VS Code (development)

🤝 Acknowledgments

Dataset inspired by real-world credit risk scenarios
Learned a ton from Kaggle notebooks and Stack Overflow
Thanks to the open-source community for amazing libraries

📬 Contact

Built by a passionate junior data scientist learning by doing!

Questions or feedback? I'd love to hear from you:

Open an issue on GitHub
Connect on LinkedIn
Email: [your-email]

📄 License

This project is for educational and portfolio purposes. Feel free to use it as a learning resource!

⭐ If this project helped you learn something, consider giving it a star!

Last updated: January 2024

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
data		data
notebooks		notebooks
screenshots		screenshots
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

End-to-End Credit Risk Analytics & Prediction System

📌 Problem Statement

📊 Dataset Overview

🔄 My Approach

1. Data Exploration (EDA)

2. Feature Engineering

3. Model Training & Evaluation

📈 Results & Evaluation

Model Performance

What Do These Numbers Mean?

Feature Importance

🖥️ Application Demo

App Features:

🗂️ Project Structure

🚀 How to Run This Project

1. Clone and Setup

2. Explore the Notebooks

3. Run the Web App

💡 What I Learned

Technical Skills:

Real-World Insights:

Mistakes I Made (and fixed):

What I'd Do Differently:

🎯 Business Impact

⚠️ Limitations & Future Work

Current Limitations:

Future Improvements:

📚 Technologies Used

🤝 Acknowledgments

📬 Contact

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages