Skip to content

Latest commit

 

History

History
150 lines (120 loc) · 5.79 KB

File metadata and controls

150 lines (120 loc) · 5.79 KB

Handling Imbalanced Datasets

This folder contains techniques for addressing class imbalance in classification problems, which can lead to biased models that favor the majority class.

Overview

Class imbalance occurs when one class has significantly more samples than another in a classification problem. This can cause machine learning models to be biased toward the majority class, leading to poor performance on the minority class. This folder provides various techniques to address this issue.

Types Used

1. Upsampling (Oversampling)

  • File: handle_imbalance_dataset.ipynb
  • Method: sklearn.utils.resample with replace=True
  • Description: Increases the number of samples in the minority class by randomly resampling with replacement until it matches the majority class size. This involves duplicating existing minority class samples.
  • Use Case: When you have sufficient data and want to balance by increasing minority samples
  • When to Use:
    • When you don't want to lose any majority class data
    • When you have enough computational resources
    • When minority class samples are representative
  • Advantages:
    • Preserves all original data
    • Simple to implement
    • No data loss
  • Disadvantages:
    • Can lead to overfitting
    • Doesn't add new information
    • May not improve model generalization

2. Downsampling (Undersampling)

  • File: handle_imbalance_dataset.ipynb
  • Method: sklearn.utils.resample with replace=True
  • Description: Reduces the number of samples in the majority class by randomly resampling with replacement until it matches the minority class size. This reduces the dataset size.
  • Use Case: When you want to balance by reducing majority samples (may result in data loss)
  • When to Use:
    • When you have a very large dataset
    • When majority class has redundant information
    • When computational resources are limited
  • Advantages:
    • Reduces dataset size (faster training)
    • Simple to implement
    • Can help if majority class has redundancy
  • Disadvantages:
    • Loses potentially valuable data
    • May remove important patterns
    • Not ideal when data is already limited

3. SMOTE (Synthetic Minority Over-sampling Technique)

  • File: handle_imbalance_smote.ipynb
  • Method: imblearn.over_sampling.SMOTE
  • Description: Generates synthetic samples for the minority class by interpolating between existing minority class samples. Creates new examples along the line segments joining nearby minority class samples in feature space.
  • Use Case: Advanced method for handling imbalanced data that creates new synthetic samples rather than just duplicating existing ones
  • When to Use:
    • When you want to add new information (not just duplicates)
    • When minority class samples are limited
    • When you need better generalization
  • Advantages:
    • Creates new synthetic samples
    • Better than simple duplication
    • Can improve model generalization
    • Works well with continuous features
  • Disadvantages:
    • May create unrealistic samples
    • Can be sensitive to outliers
    • Requires installation of imbalanced-learn
    • May not work well with discrete/categorical features

Files

  • handle_imbalance_dataset.ipynb: Upsampling and Downsampling implementations with examples
  • handle_imbalance_smote.ipynb: SMOTE implementation with visualization and comparison

Implementation Examples

Upsampling

from sklearn.utils import resample

minority = df[df['target'] == 0]
majority = df[df['target'] == 1]

minority_upsampled = resample(minority, 
                             replace=True, 
                             n_samples=len(majority), 
                             random_state=123)

df_balanced = pd.concat([minority_upsampled, majority])

Downsampling

from sklearn.utils import resample

majority_downsampled = resample(majority, 
                               replace=True, 
                               n_samples=len(minority), 
                               random_state=123)

df_balanced = pd.concat([minority, majority_downsampled])

SMOTE

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

Best Practices

  1. Evaluate Before Balancing: Check if imbalance is actually a problem for your use case
  2. Use Appropriate Metrics: Use metrics like F1-score, ROC-AUC, Precision-Recall AUC instead of accuracy
  3. Try Multiple Methods: Compare different resampling techniques
  4. Cross-Validation: Always use cross-validation when applying resampling
  5. Consider Class Weights: Some algorithms support class_weight parameter as an alternative
  6. Domain Knowledge: Consider the cost of false positives vs false negatives
  7. Visualize Results: Use visualizations to understand the effect of resampling

Additional Techniques (Not in this folder but worth knowing)

  • ADASYN: Adaptive Synthetic Sampling
  • Borderline-SMOTE: Focuses on borderline samples
  • Tomek Links: Removes borderline majority samples
  • Edited Nearest Neighbours: Removes noisy samples
  • Class Weights: Adjust algorithm's class weights instead of resampling

Installation

For SMOTE, install the imbalanced-learn library:

pip install imblearn

Dependencies

  • pandas
  • numpy
  • scikit-learn
  • imbalanced-learn (for SMOTE)
  • matplotlib (for visualizations)
  • seaborn (for visualizations)

Additional Notes

  • Visualizations are included to show the effect of resampling techniques on data distribution
  • Always validate model performance after applying balancing techniques
  • Consider the business context when choosing a technique (cost of errors)
  • SMOTE works best with continuous numerical features
  • For categorical features, consider other variants like SMOTE-NC (Nominal and Continuous)