The goal of this project is to analyze a retail customer dataset and divide the customer base into distinct groups based on their purchasing behavior and personal data. By understanding these segments, businesses can optimize targeted marketing strategies, improve customer retention, and maximize revenue.
The data used in this project is the Mall Customer Segmentation Data sourced from Kaggle. It contains synthetic data on supermarket customers designed specifically for learning clustering algorithms.
- Features used for clustering:
AgeAnnual Income (k$)Spending Score (1-100): A proprietary score assigned to customers based on purchasing behavior and history.
customer-segmentation/
│
├── data/
│ └── Mall_Customers.csv # The dataset
│
├── notebooks/
│ └── eda_and_clustering.ipynb
│
├── src/
│ ├── eda.py # Exploratory Data Analysis
│ ├── kmeans_2d.py # 2D Clustering (Income vs Spending)
│ └── kmeans_3d.py # 3D Clustering (Age, Income, Spending)
│
├── visuals/
│ ├── elbow_curve.png
│ └── 3d Scatterplot of Age vs Annual Income vs Spending Score
│ └── Cluster of Spending Score vs Annual Income
│ └── Distribution of Age
│ └── Distribution of Annual Income
│ └── Distribution of Spending Score
│ └── Gender Distribution
│ └── Scatterplot of the clusters of Spending Score vs Annual Income
│
├── requirements.txt
└── README.md
This project utilizes K-Means Clustering, an unsupervised machine learning algorithm.
- Exploratory Data Analysis (EDA): Visualized the distributions of age, annual income, spending score, and gender.
- Determining Optimal Clusters: Used the Elbow Method (calculating the Within-Cluster Sum of Squares) to find the optimal number of clusters (k=5).
- 2D Clustering: Grouped customers based on Annual Income and Spending Score.
- 3D Clustering: Added 'Age' as a third dimension for a more granular segmentation.
Based on the K-Means algorithm, we successfully grouped the customer base into 5 distinct segments:
- Cluster 0 (Target Customers): High Income, High Spending Score. (Prime targets for premium marketing)
- Cluster 1 (Careful Customers): High Income, Low Spending Score. (Target for retention and discount campaigns to boost spending)
- Cluster 2 (Standard Customers): Average Income, Average Spending Score. (The bulk of the customer base)
- Cluster 3 (Sensible Customers): Low Income, Low Spending Score.
- Cluster 4 (Careless Customers): Low Income, High Spending Score.
1. Clone the repository
git clone https://github.com/swechchhapatel/Customer-segmentation-analysis.git
cd customer-segmentation
2. Install dependencies Make sure you have Python installed, then run:
pip install -r requirements.txt
3. Run the scripts To view the Exploratory Data Analysis:
python src/eda.py
To run the 2-Dimensional Clustering model:
python src/kmeans_2d.py
To run the 3-Dimensional Clustering model and print the customer groups:
python src/kmeans_3d.py
Contributions are welcome and appreciated!
If you'd like to improve this project, please follow these steps:
- Fork the repository
- Create a new branch
git checkout -b feature/your-feature-name
- Make changes and commit
git commit -m "Add: your message" - Push to your branch and open a Pull Request
- Feel free to improve features, UI, or model performance.
Made with ❤️
