feat(nlp+tasks): Add sentence-level VADER sentiment scoring and MIMIC…#968
Open
vtewari2 wants to merge 1 commit intosunlabuiuc:masterfrom
Open
feat(nlp+tasks): Add sentence-level VADER sentiment scoring and MIMIC…#968vtewari2 wants to merge 1 commit intosunlabuiuc:masterfrom
vtewari2 wants to merge 1 commit intosunlabuiuc:masterfrom
Conversation
…-III mistrust task
Implements the negative-sentiment mistrust proxy from Boag et al. 2018
"Racial Disparities and Mistrust in End-of-Life Care" (MLHC 2018,
arXiv:1808.03827) using sentence-level VADER averaging to avoid the
full-text saturation problem specific to clinical discharge notes.
pyhealth/nlp/__init__.py [new]
- Initialise pyhealth.nlp as a proper Python package
- Export SentimentScorer and normalize_sentiment_scores
pyhealth/nlp/sentiment_scorer.py [new]
- SentimentScorer: wraps NLTK VADER with sentence-level averaging
(score each sentence, take mean) — avoids full-text compound
saturation (>94% of clinical notes saturate at -1.0)
- score(text): mean sentence-level VADER compound score for a document
- score_batch(texts): batch variant
- negate_and_zscore(raw_scores): applies Boag et al. normalisation:
neg_score = -(raw - mu) / sigma
- normalize_sentiment_scores(sample_dataset): post-task Z-score
normalisation utility for MistrustSentimentMIMIC3 samples
pyhealth/tasks/sentiment_mimic3.py [new]
- MistrustSentimentMIMIC3: extracts discharge summary notes from
NOTEEVENTS, scores with SentimentScorer, returns:
input: neg_sentiment (tensor, 1-element list, raw negated score)
output: noncompliance or autopsy_consent (binary, configurable)
- Lazy NLTK initialisation (no import at module load time)
- output_label param: switch between noncompliance / autopsy_consent
to align with MistrustNoncomplianceMIMIC3 / MistrustAutopsyMIMIC3
pyhealth/tasks/__init__.py
- Export MistrustSentimentMIMIC3
Co-Authored-By: Varun Tewari <vtewari2@illinois.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds clinical-text sentiment scoring infrastructure and a MIMIC-III task
that implements the negative-sentiment mistrust proxy from:
This addresses a gap in PyHealth: no support for sentiment analysis as a
clinical feature, and no mechanism to extract affective signals from
unstructured discharge notes for downstream ML tasks.
The saturation problem with full-text VADER
Standard full-text VADER compound scoring is unsuitable for clinical notes:
Clinical discharge language is lexically negative ("pain", "failure",
"respiratory distress"), causing full-text VADER to bottom out.
Sentence-level averaging avoids saturation and closely approximates
the word-averaged
pattern.enapproach used in the original paper.Changes 08:40 [88/1982]
pyhealth/nlp/__init__.py(new)Initialises
pyhealth.nlpas a proper Python package and exportsSentimentScorerandnormalize_sentiment_scores.pyhealth/nlp/sentiment_scorer.py(new)SentimentScorersent_tokenizedeferred to instantiation time (lazy import)
score(text)→ mean sentence-level VADER compound in [−1.0, +1.0]score_batch(texts)→ batch variantnegate_and_zscore(raw_scores)→ applies Boag et al. normalisation:neg_score = -(raw - μ) / σnormalize_sentiment_scores(sample_dataset, feature_key)neg_sentimentcolumnacross all samples in a
SampleDatasetin-placecannot be done inside
__call__(which processes one patient at atime) — this utility fills that gap
pyhealth/tasks/sentiment_mimic3.py(new)MistrustSentimentMIMIC3input_schema = {"neg_sentiment": "tensor"} # 1-element list [float]
output_schema = {"noncompliance": "binary"} # configurable via output_label
For each admission:
CATEGORY = 'Discharge summary'SentimentScorer.score()raw_mean = mean(note_scores)raw_neg = -raw_mean(higher → more negative → more mistrust){"neg_sentiment": [raw_neg], output_label: 0/1}output_labelparameter (default"noncompliance") aligns the outputschema with
MistrustNoncomplianceMIMIC3orMistrustAutopsyMIMIC3for direct comparison across all three mistrust proxies.
NLTK initialisation is lazy — no import at module load time.
pyhealth/tasks/__init__.pyExports
MistrustSentimentMIMIC3.Usage