🧬 Automated Clinical Data Quality Pipeline (ETL)

📌 The Business Problem

In clinical trials, data integrity is paramount. Raw patient logs often contain duplicate entries, missing critical codes, and biological anomalies (e.g., impossible heart rate readings) due to sensor errors or manual entry mistakes. If this raw data is loaded directly into an analytics warehouse, it corrupts trial reporting and compliance.

🛠️ The Solution

I engineered a modular ETL (Extract, Transform, Load) pipeline in Python to automate the cleaning and validation of daily clinical trial data before it reaches the data warehouse.

Extract: Generates synthetic daily patient logs (10,000+ records) mimicking real-world messiness (injected duplicates, nulls, and outliers).
Transform: Utilizes Pandas to enforce strict data quality rules:
- Automatically detects and drops identical duplicate records.
- Imputes missing error_code values to maintain schema integrity.
- Applies boolean filtering to remove biological anomalies (e.g., filtering out heart rates < 40 or > 200 BPM).
Load: Securely ingests the validated, clean dataset into a local SQLite relational database for downstream analytics.

🏗️ Architecture

The pipeline is fully modularized for enterprise scalability:

config.py: Centralized configuration for biological thresholds and file paths.
extract.py / transform.py / load.py: Isolated logic modules.
main.py: The pipeline orchestrator.

🚀 How to Run Locally

Clone the repository.
Install pandas, sqlite3
bash python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
README.md		README.md
clinical_trials.db		clinical_trials.db
config.py		config.py
daily_clinical_logs.csv		daily_clinical_logs.csv
extract.py		extract.py
load.py		load.py
main.py		main.py
transform.py		transform.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Automated Clinical Data Quality Pipeline (ETL)

📌 The Business Problem

🛠️ The Solution

🏗️ Architecture

🚀 How to Run Locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 Automated Clinical Data Quality Pipeline (ETL)

📌 The Business Problem

🛠️ The Solution

🏗️ Architecture

🚀 How to Run Locally

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages