A modular analytics framework for working with Real World Evidence (RWE) and electronic health records (EHR).
The project is structured as a realistic healthcare data pipeline: from synthetic data generation to ingestion, cohort construction, exploratory analysis, and the foundation for downstream modeling (risk prediction, survival analysis, and causal inference).
Business / Clinical Objective
This project simulates a realistic Real-World Evidence (RWE) workflow using synthetic EHR data to demonstrate how patient-level healthcare data can be transformed into clinically meaningful insights and modelling-ready datasets. It is designed to reflect use cases relevant to HEOR, clinical research, digital health, and observational analytics
Use Cases / Why it matters
Cohort construction for observational studies
Risk stratification and outcome prediction
Survival analysis for time-to-event endpoints
Treatment pattern and medication utilization analysis
Causal inference and comparative effectiveness research
Synthetic data prototyping for privacy-preserving healthcare analytics
Potential Digital Health Applications
The analytical workflows implemented in this project are relevant for several digital health and precision medicine use cases, including:
• digital biomarker discovery • remote patient monitoring • clinical risk stratification • early disease detection • decision support systems
These approaches are particularly relevant for chronic disease management contexts such as diabetes, cardiovascular disease and metabolic disorders.
This repository provides a reproducible environment for analysing longitudinal, patient-level healthcare data.
It follows principles commonly used in pharmaceutical R&D, HEOR, and clinical analytics:
- Structured data ingestion and validation
- Cohort definition logic
- Domain-aware exploratory analysis
- Modular organisation under
src/ - Reproducible workflows using notebooks
The project currently uses synthetic EHR-style data aligned with realistic clinical patterns, but the architecture supports real datasets with minimal adaptation.
rwe-health-analytics/
│
├── data/
│ ├── raw/ # Synthetic EHR datasets
│ └── processed/ # Cleaned datasets (future)
│
├── notebooks/
│ └── 01_exploraroy.analysis.ipynb
│
├── src/
│ └── rwe_health_analytics/
│ ├── data/
│ │ ├── data_loader.py
│ │ └── data_generation/
│ │ └── synthetic_data_generator.py
│ ├── models/ # Survival / ML / causal (future)
│ ├── evaluation/ # Metrics and validation (future)
│ └── visualization/ # Plotting utilities (future)
│
├── tests/ # Test suite
├── docs/ # Methodology and technical notes
├── requirements.txt
└── setup.py
The system follows a structured, modular lifecycle:
Synthetic patient-level data are created using clinically-inspired rules and realistic statistical distributions.
HealthcareDataLoader reads and validates the raw tables, offering:
- loading by domain (patients, diagnoses, labs…)
- patient-level queries
- data quality checks
- basic cohort creation logic
Jupyter notebooks provide:
- demographic summaries
- comorbidity patterns
- laboratory distributions
- medication behaviour
- longitudinal event timelines
The architecture is prepared for:
- survival analysis
- risk prediction
- clustering
- causal inference pipelines
Maintaining separation between data → logic → analysis.
git clone https://github.com/Finarosalina/rwe-health-analytics.git
cd rwe-health-analyticsCreate environment:
python -m venv venvActivate:
Windows
venv\Scripts\activatemacOS/Linux
source venv/bin/activateInstall dependencies:
pip install -r requirements.txt
pip install -e .Run:
python src/rwe_health_analytics/data/data_generation/synthetic_data_generator.pyThis generates:
data/raw/
demographics.csv
diagnoses.csv
lab_results.csv
medications.csv
outcomes.csv
| Dataset | Description |
|---|---|
| Demographics | Age, sex, race, BMI, smoking, insurance |
| Diagnoses | ICD-like codes, visit types, chronicity |
| Laboratories | Lab measures with normal/abnormal ranges |
| Medications | Prescriptions, dose, duration, specialty |
| Outcomes | Hospitalizations, ER visits, mortality |
Data are generated using:
- normal, log-normal, Poisson, and multinomial distributions
- disease-driven physiological changes
- comorbidity-based risk modeling
- temporal sequencing for visits, prescriptions, labs and outcomes
This enables EDA, ML, causal inference and survival analysis without PHI/PII.
Launch notebooks:
jupyter notebookNotebook included:
notebooks/01_exploratory_analysis.ipynb
The analysis covers:
- demographic distributions
- comorbidity burden
- diagnosis patterns
- medication dynamics
- laboratory value distributions
- outcome timelines
The project is structured as a Python package to support clean expansion.
Future modules may include:
- Cox PH models
- Random Survival Forests
- Gradient-boosted risk models
- Propensity score pipelines
- Streamlit dashboard
- Docker deployment
pytest
pytest --cov=srcMIT License.
Maria Pais Fajín
GitHub: https://github.com/Finarosalina
LinkedIn: https://linkedin.com/in/maria-pais-fajin