Role: Data Analyst / Python Analyst
Dataset: Synthetic / anonymized demo data created for portfolio use
Stack: Python, pandas, matplotlib, CSV/Excel processing
Raw customer order data contained duplicate rows, inconsistent date formats, text-based numeric values, mixed city names, missing emails, and inconsistent order statuses.
The goal of this project was to build a reproducible data-cleaning pipeline that converts messy operational data into a clean dataset, a quality summary report, and a simple revenue visualization.
Built a Python/pandas cleaning pipeline that:
- standardizes emails, phone numbers, city names, dates, order statuses and revenue fields;
- removes duplicate rows;
- validates and normalizes core fields;
- exports a clean customer-order dataset;
- generates a reusable data-quality summary;
- creates a revenue visualization after cleaning.
| Metric | Result |
|---|---|
| Raw rows | 945 |
| Clean rows | 900 |
| Duplicates removed | 45 |
| Missing emails after cleaning | 28 |
| Paid orders | 395 |
| Total paid revenue | $46,912.19 |
| Clean export generated | Yes |
| Quality summary generated | Yes |
| Revenue visualization generated | Yes |
results/clean_customer_orders.csv— cleaned datasetresults/cleaning_summary.csv— data quality metricsresults/revenue_after_cleaning.png— revenue visualization after cleaningresults/run_log.txt— pipeline execution summary
I can adapt this project for:
- cleaning client CSV/Excel files;
- building automated data-quality reports;
- creating Google Sheets / Excel dashboards;
- building trading journal dashboards;
- analyzing PnL, win rate, drawdown and strategy performance.
python-data-cleaning-report/
├── README.md
├── requirements.txt
├── data/
│ └── raw_customer_orders.csv
├── src/
│ └── main.py
└── results/
├── clean_customer_orders.csv
├── cleaning_summary.csv
├── revenue_after_cleaning.png
└── run_log.txt
