Skip to content

Add PTB-XL dataset and MI classification task#950

Draft
zaidalkhatib wants to merge 4 commits intosunlabuiuc:masterfrom
zaidalkhatib:ptbxl-mi-task
Draft

Add PTB-XL dataset and MI classification task#950
zaidalkhatib wants to merge 4 commits intosunlabuiuc:masterfrom
zaidalkhatib:ptbxl-mi-task

Conversation

@zaidalkhatib
Copy link
Copy Markdown

@zaidalkhatib zaidalkhatib commented Apr 6, 2026

Contributors: Zaid Alkhatib (zaida3@illinois.edu), Anila Narapusetty (anilan2@illinois.edu)

Contribution Type: Dataset + Task

Paper: Data Augmentation for Electrocardiograms
Paper Link: https://arxiv.org/abs/2204.04360


Overview

This PR adds support for the PTB-XL ECG dataset and implements a binary myocardial infarction (MI) classification task as a partial reproduction of the paper Data Augmentation for Electrocardiograms.

The contribution focuses on the dataset and task components of the pipeline within PyHealth, rather than reproducing the full model and training procedure.


What was implemented

Dataset

  • Added PTBXLDataset in pyhealth/datasets/ptbxl.py
  • Inherits from BaseDataset
  • Loads PTB-XL metadata from ptbxl_database.csv
  • Supports both:
    • records100/ (100 Hz, default)
    • records500/ (500 Hz, optional)
  • Converts ECG records into PyHealth-compatible event format
  • Includes record_path for downstream waveform loading
  • Supports dev=True for fast iteration

Task

  • Added PTBXLMIClassificationTask in pyhealth/tasks/ptbxl_mi_classification.py
  • Binary classification:
    • 1 → Myocardial Infarction (MI)
    • 0 → Non-MI
  • Labels derived using scp_statements.csv to map SCP codes to the MI diagnostic class
  • Loads real ECG waveforms using WFDB from PTB-XL records
  • Signal processing:
    • Transposes signals to (channels, time)
    • Normalizes per channel
    • Pads/truncates to fixed length for model compatibility
  • Uses PyHealth schema:
    • signal: "tensor"
    • label: "binary"

Tests

  • Added synthetic unit tests:
    • tests/core/test_ptbxl_dataset.py
    • tests/core/test_ptbxl_mi_classification.py
  • Tests:
    • Use small synthetic data (no external dependency)
    • Run quickly
    • Validate:
      • dataset loading
      • label generation logic
      • sample structure
  • Waveform loading is mocked in tests to keep them lightweight

Docs

  • Added dataset documentation:
    • docs/api/datasets/pyhealth.datasets.ptbxl.rst
  • Added task documentation:
    • docs/api/tasks/pyhealth.tasks.ptbxl_mi_classification.rst
  • Updated:
    • docs/api/datasets.rst
    • docs/api/tasks.rst

Example

  • Added:
    • examples/ptbxl_mi_classification_cnn.py

Demonstrates:

  • loading PTB-XL dataset
  • applying MI classification task
  • generating processed samples
  • using real ECG waveform data

Files to Review

Core implementation

  • pyhealth/datasets/ptbxl.py
  • pyhealth/tasks/ptbxl_mi_classification.py

Registration

  • pyhealth/datasets/__init__.py
  • pyhealth/tasks/__init__.py

Tests

  • tests/core/test_ptbxl_dataset.py
  • tests/core/test_ptbxl_mi_classification.py

Documentation

  • docs/api/datasets/pyhealth.datasets.ptbxl.rst
  • docs/api/tasks/pyhealth.tasks.ptbxl_mi_classification.rst
  • docs/api/datasets.rst
  • docs/api/tasks.rst

Example

  • examples/ptbxl_mi_classification_cnn.py

Notes

  • Tests use synthetic data only (per PyHealth guidelines)
  • Example requires local PTB-XL dataset to run end-to-end
  • This implementation reproduces the dataset and task setup from the paper
  • Model training and augmentation pipeline are not included in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant