A complete machine learning project for predicting credit default risk, An end-to-end machine learning project that I designed, implemented, and iteratively refined to understand real-world credit risk modeling.
Why does this matter?
Banks and financial institutions lose billions annually from loan defaults. Even a small improvement in predicting who will default can save millions. But it's a balancing act - if you're too strict, you reject good borrowers and lose revenue. Too lenient, and you approve risky loans that won't be repaid.
This project builds a practical credit risk assessment system that can:
- Predict the likelihood of loan default
- Explain which factors drive the prediction
- Help make informed lending decisions
Think of it as a "first-pass filter" that flags high-risk applications for human review, while fast-tracking low-risk ones.
Source: Simulated credit bureau data based on realistic lending scenarios
Size: 32,581 loan applications
Features:
- Personal Info: Age, income, employment length, home ownership
- Loan Details: Amount, interest rate, purpose, grade
- Credit History: Credit history length, previous defaults
Target: loan_status (0 = Repaid, 1 = Default)
Key Challenges I Faced:
- Imbalanced data: Only 22% of loans defaulted (but that's what we care about!)
- Missing values: ~7% missing in employment length and interest rate
- Mixed data types: Numerical + categorical features
- Real-world messiness: Data wasn't clean, which made it more realistic
- Metric selection: Initially underestimated how misleading accuracy can be for imbalanced datasets
Notebook: 01_data_exploration.ipynb
What I did:
- Loaded 32K+ records and checked for issues
- Analyzed missing values (found ~7% missing in 2 columns)
- Discovered class imbalance (78% repaid vs 22% default)
- Created visualizations to understand patterns
Key Findings:
- Loan-to-income ratio is a huge predictor (people borrowing 60%+ of income are risky)
- Previous defaults strongly predict future defaults
- Interest rates already reflect risk (banks price it in)
- Surprisingly, age doesn't matter much
What surprised me:
- Medical and venture loans have much higher default rates than personal loans
- Some people are approved for loans worth 80%+ of their annual income (scary!)
Notebook: 02_feature_engineering.ipynb
Missing Values:
- Used median imputation (more robust than mean for outliers)
- Considered dropping rows but 7% is too much to lose
New Features Created:
income_to_loan_ratio: How much income relative to loan sizeage_group: Binned age into Young/Mid/Seniorhigh_risk_flag: Combo of previous default + high loan-to-income
Encoding:
- Label encoding for ordinal features (loan_grade: Aβ1, Bβ2, etc.)
- One-hot encoding for nominal features (loan purpose, home ownership)
Scaling:
- Used
StandardScalerto normalize features (mean=0, std=1) - Critical: Fit on training data only, then transform test (avoid data leakage!)
Why these choices?
- Median over mean: Less affected by outliers
- Stratified split: Keeps class balance in both train/test
- One-hot encoding: Doesn't assume false orderings (RENT isn't "less than" OWN)
Notebook: 03_model_training.ipynb
Models Tested:
-
Logistic Regression (Baseline)
- Simple, fast, interpretable
- Used
class_weight='balanced'for imbalanced data - Good starting point
-
Random Forest (Final Model)
- Captures non-linear patterns
- Handles feature interactions
- Gives feature importance
- Better performance overall
Why not just use accuracy?
If I predict "no one defaults", I get 78% accuracy but catch ZERO defaulters. Useless!
Metrics I focused on:
- Precision: Of predicted defaults, how many were correct? (Don't reject too many good borrowers)
- Recall: Of actual defaults, how many did we catch? (Don't miss bad loans)
- F1 Score: Balance between precision and recall
- ROC-AUC: Overall ability to distinguish classes
Results:
- Random Forest achieved ~70-75% recall (catching most defaults)
- ROC-AUC stabilized around 0.78 on the test set (decent predictive power)
- Better than random guessing, realistic for real-world scenario
| Metric | Logistic Regression | Random Forest (Final) |
|---|---|---|
| Accuracy | ~70-75% | ~72-77% |
| Precision | ~45-55% | ~50-60% |
| Recall | ~65-70% | ~70-75% |
| F1 Score | ~52-60% | ~57-65% |
| ROC-AUC | ~0.72-0.76 | ~0.75-0.80 |
In plain English:
- The model catches about 70-75% of defaults (recall)
- When it predicts "default", it's right 50-60% of the time (precision)
- Overall, it's significantly better than random guessing
Why isn't it 99% accurate?
Because credit risk is genuinely hard to predict! Real-world factors we don't have:
- Unexpected life events (medical emergencies, job loss)
- Economic conditions
- Detailed payment history
- Full credit bureau data
Banks with billion-dollar ML teams don't get 99% either. This is realistic.
Top 5 most important factors:
- Loan interest rate (banks already price in risk)
- Loan-to-income percentage
- Previous default history
- Income level
- Credit history length
Built a Streamlit web app where users can:
- Input loan applicant details
- Get instant risk assessment (Low/Medium/High)
- See default probability
- Receive recommendation (Approve/Review/Deny)
- Understand key risk factors
How to run it:
cd credit-risk-analytics
streamlit run app/app.pyThen open your browser to http://localhost:8501
- β Clean, intuitive interface
- β Real-time prediction
- β Risk level with color coding
- β Explanation of key factors
- β Actionable recommendations
Screenshots: (See screenshots/ folder)
credit-risk-analytics/
β
βββ data/
β βββ raw/ # Original dataset
β βββ processed/ # Processed, ready-for-modeling data
β
βββ notebooks/
β βββ 01_data_exploration.ipynb # EDA with visualizations
β βββ 02_feature_engineering.ipynb # Data preprocessing
β βββ 03_model_training.ipynb # Model development
β
βββ src/
β βββ preprocessing.py # Data cleaning functions
β βββ model.py # Model training utilities
β βββ evaluation.py # Evaluation metrics
β βββ credit_risk_model.pkl # Saved trained model
β
βββ app/
β βββ app.py # Streamlit application
β
βββ screenshots/ # App screenshots
β
βββ requirements.txt # Python dependencies
βββ README.md # This file
# Clone the repository
git clone <repo-url>
cd credit-risk-analytics
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtjupyter notebookOpen and run the notebooks in order:
01_data_exploration.ipynb02_feature_engineering.ipynb03_model_training.ipynb
streamlit run app/app.py- Data preprocessing: Handling missing values, encoding, scaling
- Dealing with imbalanced data: class_weight, stratified splitting
- Model selection: When to use simple vs. complex models
- Evaluation: Why accuracy is misleading, importance of recall for imbalanced problems
- Deployment: Building user-facing applications with Streamlit
- Perfect models don't exist: 70-75% recall is actually pretty good for credit risk
- Explainability matters: Banks need to explain why they reject someone (regulations!)
- Business context > metrics: A False Negative costs more than a False Positive
- Data quality: Missing values and imbalance are normal in real datasets
- First tried mean imputation β Switched to median (better for outliers)
- Initially forgot stratification β Class distributions didn't match in train/test
- Scaled before splitting β Data leakage! Had to redo it properly
- Only looked at accuracy β Realized it's meaningless for imbalanced data
- Try SMOTE or ADASYN for handling imbalance
- Experiment with XGBoost (probably better than Random Forest)
- Hyperparameter tuning with GridSearchCV or RandomizedSearchCV
- Add more domain features (debt-to-income ratio, FICO score equivalent)
- Build an API instead of just Streamlit (FastAPI + Docker)
- A/B testing for real-world validation (measuring business impact in production)
How this could be used in practice:
-
Automated first-pass screening:
- Low risk (< 30% probability) β Auto-approve
- Medium risk (30-60%) β Human review
- High risk (> 60%) β Likely deny or require co-signer
-
Risk-based pricing:
- Adjust interest rates based on predicted default probability
- More accurate pricing = more profit + fairer to borrowers
-
Portfolio management:
- Monitor aggregate risk across all loans
- Identify high-risk segments for intervention
Estimated value (hypothetical):
- If this model prevents just 1% more defaults on a $100M loan portfolio
- With average loss of 50% on defaults
- That's $500K saved per year
- Limited features: Real credit bureaus have 100+ features
- Simulated data: Not from actual bank records
- No temporal component: Doesn't account for changing economic conditions
- Regulatory compliance: Real models need to pass fair lending audits
- No production infrastructure: Just a demo app
- Integrate with credit bureau APIs
- Add temporal features (seasonality, economic indicators)
- Build explain ability layer (SHAP values, LIME)
- Deploy to cloud (AWS/GCP)
- Add monitoring & retraining pipeline
- A/B testing framework
- Mobile-responsive UI
Languages & Frameworks:
- Python 3.8+
- Pandas, NumPy (data manipulation)
- Matplotlib, Seaborn (visualization)
- Scikit-learn (machine learning)
- Streamlit (web app)
Tools:
- Jupyter Notebook (analysis)
- Git (version control)
- VS Code (development)
- Dataset inspired by real-world credit risk scenarios
- Learned a ton from Kaggle notebooks and Stack Overflow
- Thanks to the open-source community for amazing libraries
Built by a passionate junior data scientist learning by doing!
Questions or feedback? I'd love to hear from you:
- Open an issue on GitHub
- Connect on LinkedIn
- Email: [your-email]
This project is for educational and portfolio purposes. Feel free to use it as a learning resource!
β If this project helped you learn something, consider giving it a star!
Last updated: January 2024