Skip to content

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981

Open
mtmckenna wants to merge 15 commits intosunlabuiuc:masterfrom
mtmckenna:deid-bert
Open

Add PhysioNet De-Identification dataset, NER task, and TransformerDeID model#981
mtmckenna wants to merge 15 commits intosunlabuiuc:masterfrom
mtmckenna:deid-bert

Conversation

@mtmckenna
Copy link
Copy Markdown

Contributor

Matt McKenna (mtm16@illinois.edu)

Contribution Type

Full Pipeline: Dataset + Task + Model (Option 4)

Paper

Johnson, Alistair E.W., et al. "Deidentification of free-text medical records using pre-trained bidirectional
transformers." Proceedings of the ACM Conference on Health, Inference, and Learning (CHIL), 2020.

https://doi.org/10.1145/3368555.3384455

Description

Implements BERT-based clinical text de-identification as a PyHealth pipeline. Given clinical notes with protected health information (PHI), the model performs token-level NER to detect and classify PHI into 7 categories (NAME, DATE, LOCATION, AGE, CONTACT, ID, PROFESSION) using BIO tagging.

  • Dataset: Parses PhysioNet deidentifiedmedicaltext 1.0 files (id.text + id.res), aligns PHI spans between original and de-identified versions, and produces token-level BIO labels.
  • Task: Converts patient events into NER samples. Supports configurable overlapping windows (paper Section 3.3) to handle notes longer than BERT's 512 token limit.
  • Model: TransformerDeID - pretrained transformer encoder (BERT or RoBERTa).
  • Example: End-to-end training + evaluation script with binary PHI metrics and overlapping window prediction merging.

Data Access

The test data in test-resources/core/physionet_deid/ is synthetic (fake).
Real data requires PhysioNet credentialed access:
https://physionet.org/content/deidentifiedmedicaltext/1.0/

Ablation Results

Our results are worse than the original paper's results. The hypothesis is that the results are worse because we're only using the phsyionet data and not adding in the other datasets.

Config Precision Recall F1
BERT, no window 95.1% 70.3% 80.8%
BERT, win=100/60 86.9% 75.7% 80.9%
RoBERTa, no window 98.1% 64.7% 78.0%
RoBERTa, win=100/60 82.6% 68.6% 75.0%

Files to Review

File Description
pyhealth/datasets/physionet_deid.py Dataset: parsing, PHI classification, BIO tagging
pyhealth/datasets/configs/physionet_deid.yaml Dataset YAML config
pyhealth/tasks/deid_ner.py Task with windowing support
pyhealth/models/transformer_deid.py TransformerDeID model
tests/core/test_physionet_deid.py Dataset + task tests (22 tests)
tests/core/test_transformer_deid.py Model tests (20 tests)
examples/physionet_deid_ner_transformer_deid.py Training + ablation script
docs/api/datasets/pyhealth.datasets.PhysioNetDeIDDataset.rst Dataset docs
docs/api/tasks/pyhealth.tasks.DeIDNERTask.rst Task docs
docs/api/models/pyhealth.models.TransformerDeID.rst Model docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant