This repository contains my personal data warehouse project, where raw e-commerce data is cleaned, transformed, and structured into fact tables and data marts. The goal is to create datasets that can be used both for BI dashboards and machine learning models, providing insights into sales performance, top products, and seller activities across cities and categories.
So far, the project has reached the gold layer stage, and a customer/order-centric data mart (mart_sales_performance) has been created. It aggregates:
- Total sales per seller, city, and product category
- Average order value and number of items
- Weekly trends for better time-based analysis
The Medallion Layer architecture has been applied in this project to organize data into multiple stages:
- Bronze Layer → raw ingested data, minimal transformations
- Silver Layer → cleaned and harmonized data ready for analytics
- Gold Layer → aggregated, high-quality data ready for data marts, BI dashboards, and ML datasets
- Dataset Source: Kaggle (extracted to local CSV files)
- Initial Data Profiling & Insights: Jupyter Notebook
- Temporary Database / Local Engine: DuckDB
- Data Transformation & Quality Checks: dbt
- Querying & Data Integrity Checks: DBeaver
Here's the data warehouse architecture I've designed:

The star schema that consolidates all fact and dimension tables:
