This is a repository for the "Using sentiment analysis to quantify the relative desirability and acceptability of drug product attributes" publication.
This repository aims to aid reproduction of the methods described in the Sentiment Analysis Publication. Because the publication utilizes private and protected patient data, we have repeated the methods by annotating a small section of the DrugLib dataset, an open source dataset which contains public drug reviews. This dataset is provided in the 'data' folder, and will be used to highlight several features of this repo. Approximately 1500 entries in total were labeled from the public dataset. 304 of these entries were labeled by both reviewers. These entries achieved a 0.91 Kappa/IRR Score.
Here, we reproduce the two main analyses featured in the publication:
- Frequency Analysis: Review Preprocessing + Training DistilBERT for classification of reviews
- Sentiment Analysis and Visualization
Frequency analysis allows you to view the prevalence of certain complaints or comments within your data. These review classifications can be used to visualize trends and concepts present in drug review data. To avoid manual classification of every review in large datasets, we tune a DistilBERT model that helps predict leading terms/qualifiers. All the scripts relating to frequency analysis/tuning DistilBERT are primarily located in the 'scripts' folder. You can use 'omni_script.py' to train a model on your own dataset.
Sentiment analysis is a powerful complement to frequency analysis that not only shows the prevalence of certain opinions, but the extent to which these opinions matter to patients. The sentiment analysis tools allow you to determine the sentiment for your text on any given set of words, and then neatly visualize them. These are primarily located in the 'tools' folder. You can use 'sentimentSeeker' to calculate the sentiment on your own dataset. Refer to 'examples/tools_demo.ipynb' to see the tool in action.
Getting started:
conda create -n sentiment_analysis -f environment.yml
conda activate sentiment_analysis
This will create an environment and install all the prerequisite packages. However, there are a few things that are not yet installed.
If interested in training DistilBERT: You will also need to install NLTK. This is a one-time requirement, and an example is provided in the 'examples/demo.ipynb' notebook.
If interested in sentiment analysis: You will need to install SpaCy's en_core_web_lg. This is a one-time requirement, and an example is provided in the 'examples/tools_demo.ipynb' notebook.