This project analyzes the Brazilian Olist e-commerce dataset to uncover business insights across sales performance, customer behavior, and delivery operations.
The project is designed with an end-to-end data workflow:
- raw data ingestion
- data cleaning and feature engineering
- analytical data modeling (fact/dimension tables)
- SQL-based analysis
- dashboard-ready outputs
The goal is to demonstrate strong data analysis fundamentals, including:
- correct metric definition
- proper handling of data granularity
- reproducible data pipelines
- business-oriented analytical thinking
- Build a clean analytical dataset from multiple relational tables
- Ensure correct metric computation by separating item-level and order-level logic
- Analyze revenue trends, customer behavior, and delivery performance
- Provide a structured dataset suitable for BI tools (e.g., Power BI)
Source: Olist Brazilian E-commerce Public Dataset
Main tables:
- orders
- order_items
- order_payments
- order_reviews
- customers
- products
- sellers
- product_category_name_translation
olist-ecommerce-data-analysis/
├── data/
│ ├── raw/ # Original datasets
│ ├── processed/ # Cleaned dataset (item-level)
│ └── marts/ # Analytical tables (fact/dimension)
├── notebooks/
│ ├── 01_eda.ipynb # Data exploration
│ ├── 02_cleaning.ipynb # Cleaning logic walkthrough
│ └── 03_analysis.ipynb # KPI validation & reasoning
├── scripts/
│ ├── build_processed_data.py
│ ├── build_analytics_mart.py
│ ├── run_analysis.py
│ └── notebook_reference_checks.py
├── sql/ # Business analysis queries
├── outputs/
│ ├── figures/ # Generated charts
│ └── summary_tables/ # Aggregated results
├── dashboard/
│ ├── exports/ # Dashboard screenshots
│ └── *_notes.md # Modeling & KPI notes
├── docs/
│ ├── data_quality_notes.md
│ └── metric_modeling_rules.md
├── requirements.txt
└── README.md
The processed dataset is stored at the order-item level:
- 1 row = 1 item within an order
- Orders with multiple items appear multiple times
Because of this:
- Business KPIs must be computed at the order level
- Product/category analysis remains at the item level
fact_sales→ item-level (order item)fact_orders→ order-level (aggregated)
Use:
fact_ordersfor KPIs (revenue, AOV, delivery, reviews)fact_salesfor category/product analysis
Two revenue concepts:
revenue→ item-level (price + freight)order_revenue→ aggregated to order level
Primary KPI rule:
- Use delivered orders only
This avoids overstating business performance with canceled or incomplete orders.
order_total_payment_valueis duplicated across item rows → must not be summed at item level- Some records contain delivery inconsistencies → explicitly flagged
- Missing values are partly structural (not all are data quality issues)
- Monthly revenue trend (delivered orders)
- Order volume and growth rate
- Average order value (AOV)
- Top categories by revenue
- Revenue distribution across product groups
- Pareto contribution (top 20% categories)
- Late delivery rate
- Delivery delay impact on customer reviews
- On-time vs late comparison
- Repeat vs one-time customers
- Customer revenue contribution
- RFM-based segmentation (in marts layer)
- Total Orders
- Total Delivered Revenue
- Average Order Value (AOV)
- Average Review Score
- Late Delivery Rate
All computed from order-level aggregation
pip install -r requirements.txtpython scripts/build_processed_data.pypython scripts/build_analytics_mart.pypython scripts/run_analysis.pypython scripts/notebook_reference_checks.pyHigh-level KPIs including total revenue, total orders, average review score, and late delivery rate, along with monthly revenue and order trends.
Top-performing product categories by revenue, revenue share distribution, and the relationship between category revenue and customer satisfaction.
Delivery performance analysis, including late delivery rates by state, delivery status distribution, and the impact of delivery delays on customer reviews.
Customer segmentation using RFM analysis, highlighting customer distribution, repeat behavior, and revenue contribution by segment.
Notes:
- The dashboard is built on top of the marts layer
fact_ordersis used for KPI calculationsfact_salesis used for product/category visuals
- Revenue is heavily concentrated in a small number of categories (Pareto effect)
- Late deliveries negatively impact customer review scores
- Repeat customers contribute disproportionately to total revenue
- Delivered-order filtering provides a more accurate view of business performance
- Proper separation of item-level vs order-level logic
- End-to-end reproducible data pipeline
- Combination of Python + SQL for analysis
- Star schema modeling for BI readiness
- Explicit handling of data quality and metric assumptions
- Dataset is historical and may not reflect real-time business behavior
- Some delivery-related inconsistencies exist in source data
- Customer retention is approximated using order history (no full lifecycle tracking)
This project demonstrates the importance of:
- correct metric definitions
- understanding data granularity
- building clean analytical datasets
- connecting data work to business meaning
It reflects a practical, production-style approach to data analysis rather than a purely exploratory workflow.



