This repository contains code and analysis for identifying microRNA signatures associated with various neurodegenerative diseases using publicly available gene expression datasets.
Note: This is an independent reimplementation of the methods described in the referenced paper, created for educational purposes. All original research credit belongs to Li et al. [1]
The dataset used in this study is sourced from the Gene Expression Omnibus (GEO) download GEO Accession viewer. (GEO accession number: GSE120584)
The dataset contains 1601 samples for various neurodegenerative diseases as shown in the table below (01B_Data_Exploration.ipynb):
| Disease Case | Sample Size |
|---|---|
| Alzheimer’s disease (AD) | 1,021 |
| Vascular dementia (VaD) | 91 |
| Dementia with lewy bodies (DLB) | 169 |
| Mild cognitive impairment (MCI) | 32 |
| Normal control (NC) | 288 |
Data Information:
- GEO Accession: GSE120584
- Technology: Agilent miRNA Microarray
- Samples: 1,601 human serum samples
- Disease Classes: AD, VaD, DLB, MCI, and Normal Control
After data prefrocessing we obtained a final expression matrix shape: (2547, 1601)
- miRNAs: 2547 features
- Samples: 1601 observations
- Missing Values: 0 (complete data)
Details on data preprocessing in : 01_DataExploration/README.md
We use BorutaPy package to filter the features. 30 out of 2547 featrues were selected using boruta filtering using RandomForest as an estimator. For all parameters of boruta, default values were used.
To rank features by maximum relevance and minimum redundancy we applied mRMR feature ranking. Which uses F-test for relevance measurement, Pearson correlation for redundancy and returns features ordered by importance.
Then we tried to rank the features using Monte Carlo ensemble approach.
When comparing mRMR vs MCFS, there was not much agreement betwee two methods as shown in the scatterplot.
After feature ranking was analyed, we conducted Incremental Feature Selection (IFS). We used RandomForest and Decision Trees (Although, the paper[1] used PART from Weka, we tried to use closest interpretable model available in SKLearn Package) as our classifiers and identicied the best set of features based on MCC as evaluation metric.
TABLE Classification Performance of Random Forest and Decision Tree Models with mRMR and MCFS Feature Selection Methods
| Model | Accuracy | F1-score (weighted) | MCC |
|---|---|---|---|
| mRMR + RF + SMOTE | 0.6377 | 0.6122 | 0.2813 |
| mRMR + DT + SMOTE | 0.3092 | 0.3691 | 0.1325 |
| MCFS + RF + SMOTE | 0.6446 | 0.6220 | 0.2989 |
| MCFS + DT + SMOTE | 0.2798 | 0.3289 | 0.1266 |
| mRMR + RF + (NO SMOTE) | 0.6952 | 0.6121 | 0.3294 |
| mRMR + DT + (NO SMOTE) | 0.6421 | 0.5666 | 0.1891 |
| MCFS + RF + (NO SMOTE) | 0.6908 | 0.6042 | 0.3170 |
| MCFS + DT + (NO SMOTE) | 0.6452 | 0.5735 | 0.2006 |
From classification results, we can see that:
- Random Forest models performed better than Decision Tree models
- mRMR + RF without SMOTE achieves the highest MCC in this set of experiments
- MCFS + RF + SMOTE achieves the highest weighted F1-score among the SMOTE models
Venn diagram to show overlap between the top miRNA features selected by mRMR method and MCFS method
- 29 miRNA features are commonly identified.
Feature importance ranking of the miRNA features selected using mRMR and MCFS methods as obtained from Random Forest and Decision Trees.
Repository consists of following notebooks:
- 01A_Understanding_Raw_Data.ipynb : Understanding raw data
- 01B_Data_Exploration.ipynb : series matrix data exploration
- 02A_BorutaFeatureRanking.ipynb: feature filtering using
BorutaPy - 02B_FeatureRanking.ipynb: mRMR and MCFC based feature ranking
- 03A_IncrementalFeatureSelection.ipynb: incremental feature selection and classification
- 04A_ClassificationResults.ipynb: Classification using RF and Decision Trees
Requirements for this project are available in requirements.txt
Most of the experiments in this repository are based on the methods described in the following research paper. While these experiments may not completely follow the original methodology, we gratefully acknowledge the detailed and reproducible description provided in their work:
Li, Z., Guo, W., Ding, S., Chen, L., Feng, K., Huang, T. and Cai, Y.D., 2022. Identifying key MicroRNA signatures for neurodegenerative diseases with machine learning methods. Frontiers in Genetics, 13, p.880997.
[1] Li, Z., Guo, W., Ding, S., Chen, L., Feng, K., Huang, T. and Cai, Y.D., 2022. Identifying key MicroRNA signatures for neurodegenerative diseases with machine learning methods. Frontiers in Genetics, 13, p.880997.
[3] Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010







