Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model by mtmckenna · Pull Request #981 · sunlabuiuc/PyHealth

mtmckenna · 2026-04-14T05:58:01Z

Contributor

Contribution Type

Full Pipeline: Dataset + Task + Model (Option 4)

Paper

Johnson, Alistair E.W., et al. "Deidentification of free-text medical records using pre-trained bidirectional
transformers." Proceedings of the ACM Conference on Health, Inference, and Learning (CHIL), 2020.

https://doi.org/10.1145/3368555.3384455

Description

Implements BERT-based clinical text de-identification as a PyHealth pipeline. Given clinical notes with protected health information (PHI), the model performs token-level NER to detect and classify PHI into 7 categories (NAME, DATE, LOCATION, AGE, CONTACT, ID, PROFESSION) using BIO tagging.

Dataset: Parses PhysioNet deidentifiedmedicaltext 1.0 files (id.text + id.res), aligns PHI spans between original and de-identified versions, and produces token-level BIO labels.
Task: Converts patient events into NER samples. Supports configurable overlapping windows (paper Section 3.3) to handle notes longer than BERT's 512 token limit.
Model: TransformerDeID - pretrained transformer encoder (BERT or RoBERTa).
Example: End-to-end training + evaluation script with binary PHI metrics and overlapping window prediction merging.

Data Access

The test data in test-resources/core/physionet_deid/ is synthetic (fake).
Real data requires PhysioNet credentialed access:
https://physionet.org/content/deidentifiedmedicaltext/1.0/

Ablation Results

Our results are worse than the original paper's results. The hypothesis is that the results are worse because we're only using the phsyionet data and not adding in the other datasets.

Config	Precision	Recall	F1
BERT, no window	95.1%	70.3%	80.8%
BERT, win=100/60	86.9%	75.7%	80.9%
RoBERTa, no window	98.1%	64.7%	78.0%
RoBERTa, win=100/60	82.6%	68.6%	75.0%

Files to Review

File	Description
`pyhealth/datasets/physionet_deid.py`	Dataset: parsing, PHI classification, BIO tagging
`pyhealth/datasets/configs/physionet_deid.yaml`	Dataset YAML config
`pyhealth/tasks/deid_ner.py`	Task with windowing support
`pyhealth/models/transformer_deid.py`	TransformerDeID model
`tests/core/test_physionet_deid.py`	Dataset + task tests (22 tests)
`tests/core/test_transformer_deid.py`	Model tests (20 tests)
`examples/physionet_deid_ner_transformer_deid.py`	Training + ablation script
`docs/api/datasets/pyhealth.datasets.PhysioNetDeIDDataset.rst`	Dataset docs
`docs/api/tasks/pyhealth.tasks.DeIDNERTask.rst`	Task docs
`docs/api/models/pyhealth.models.TransformerDeID.rst`	Model docs

… corresponding unit tests

…nd tests

mtmckenna added 15 commits April 13, 2026 22:39

Add PhysioNetDeIDDataset

108ae5d

add DeIDNERTask for NER-based de-identification and update tests

3fc1d59

Add BertDeID model for BERT-based clinical text de-identification and…

40f018c

… corresponding unit tests

Rename BertDeID to TransformerDeID and update related documentation a…

a2e9b3e

…nd tests

Refactor TransformerDeID model: rename file and update tests

74cf0d5

WIP e2e script

5de2b58

add windowing

64d886e

merge windows before deciding if PHI

b397385

fix for roberta

ff6ea37

clarify comments

d80ad93

Add tests for windowing

b9da8a6

add rst files

6288838

remove dead code

238f98b

add deidentify method so i can test it out manually

ee1fb3a

combine tests to make them faster

92dcca5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981
mtmckenna wants to merge 15 commits intosunlabuiuc:masterfrom
mtmckenna:deid-bert

mtmckenna commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mtmckenna commented Apr 14, 2026

Contributor

Contribution Type

Paper

Description

Data Access

Ablation Results

Files to Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant