MicroRNA Signatures for Neurodegenerative Diseases

This repository contains code and analysis for identifying microRNA signatures associated with various neurodegenerative diseases using publicly available gene expression datasets.

Note: This is an independent reimplementation of the methods described in the referenced paper, created for educational purposes. All original research credit belongs to Li et al. [1]

Dataset

The dataset used in this study is sourced from the Gene Expression Omnibus (GEO) download GEO Accession viewer. (GEO accession number: GSE120584)

The dataset contains 1601 samples for various neurodegenerative diseases as shown in the table below (01B_Data_Exploration.ipynb):

Disease Case	Sample Size
Alzheimer’s disease (AD)	1,021
Vascular dementia (VaD)	91
Dementia with lewy bodies (DLB)	169
Mild cognitive impairment (MCI)	32
Normal control (NC)	288

Exploratory Data Analysis

Data Information:

GEO Accession: GSE120584
Technology: Agilent miRNA Microarray
Samples: 1,601 human serum samples
Disease Classes: AD, VaD, DLB, MCI, and Normal Control

After data prefrocessing we obtained a final expression matrix shape: (2547, 1601)

miRNAs: 2547 features
Samples: 1601 observations
Missing Values: 0 (complete data)

Details on data preprocessing in : 01_DataExploration/README.md

Feature Ranking

We use BorutaPy package to filter the features. 30 out of 2547 featrues were selected using boruta filtering using RandomForest as an estimator. For all parameters of boruta, default values were used.

Minimum Redundancy Maximum Relevance (mRMR)

To rank features by maximum relevance and minimum redundancy we applied mRMR feature ranking. Which uses F-test for relevance measurement, Pearson correlation for redundancy and returns features ordered by importance.

Monte Carlo Feature Selection Feature Ranking (MCFC)

Then we tried to rank the features using Monte Carlo ensemble approach.

When comparing mRMR vs MCFS, there was not much agreement betwee two methods as shown in the scatterplot.

Incremental Feature Selection and Classification

After feature ranking was analyed, we conducted Incremental Feature Selection (IFS). We used RandomForest and Decision Trees (Although, the paper[1] used PART from Weka, we tried to use closest interpretable model available in SKLearn Package) as our classifiers and identicied the best set of features based on MCC as evaluation metric.

Incremental Feature Selection using Random Forest

Incremental Feature Selection using Decision Tree

TABLE Classification Performance of Random Forest and Decision Tree Models with mRMR and MCFS Feature Selection Methods

Model	Accuracy	F1-score (weighted)	MCC
mRMR + RF + SMOTE	0.6377	0.6122	0.2813
mRMR + DT + SMOTE	0.3092	0.3691	0.1325
MCFS + RF + SMOTE	0.6446	0.6220	0.2989
MCFS + DT + SMOTE	0.2798	0.3289	0.1266
mRMR + RF + (NO SMOTE)	0.6952	0.6121	0.3294
mRMR + DT + (NO SMOTE)	0.6421	0.5666	0.1891
MCFS + RF + (NO SMOTE)	0.6908	0.6042	0.3170
MCFS + DT + (NO SMOTE)	0.6452	0.5735	0.2006

From classification results, we can see that:

Random Forest models performed better than Decision Tree models
mRMR + RF without SMOTE achieves the highest MCC in this set of experiments
MCFS + RF + SMOTE achieves the highest weighted F1-score among the SMOTE models

Classwise Performance for mRMR features using RandomForest

Classwise Performance for MCFS features using RandomForest

Comparison of Precision, Recall and F1-score of different models

Overlap in features selected by mRMR method and MCFS methods

Venn diagram to show overlap between the top miRNA features selected by mRMR method and MCFS method

29 miRNA features are commonly identiﬁed.

Feature Importance

Feature importance ranking of the miRNA features selected using mRMR and MCFS methods as obtained from Random Forest and Decision Trees.

File information:

Repository consists of following notebooks:

01A_Understanding_Raw_Data.ipynb : Understanding raw data
01B_Data_Exploration.ipynb : series matrix data exploration
02A_BorutaFeatureRanking.ipynb: feature filtering usingBorutaPy
02B_FeatureRanking.ipynb: mRMR and MCFC based feature ranking
03A_IncrementalFeatureSelection.ipynb: incremental feature selection and classification
04A_ClassificationResults.ipynb: Classification using RF and Decision Trees

Install Requirements

Requirements for this project are available in requirements.txt

Acknowledgements

Most of the experiments in this repository are based on the methods described in the following research paper. While these experiments may not completely follow the original methodology, we gratefully acknowledge the detailed and reproducible description provided in their work:

Li, Z., Guo, W., Ding, S., Chen, L., Feng, K., Huang, T. and Cai, Y.D., 2022. Identifying key MicroRNA signatures for neurodegenerative diseases with machine learning methods. Frontiers in Genetics, 13, p.880997.

References

[1] Li, Z., Guo, W., Ding, S., Chen, L., Feng, K., Huang, T. and Cai, Y.D., 2022. Identifying key MicroRNA signatures for neurodegenerative diseases with machine learning methods. Frontiers in Genetics, 13, p.880997.

[2] GEO Accession viewer

[3] Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010

[4] Risk prediction models for dementia constructed by supervised principal component analysis using miRNA expression data

[5] A Guide to NCBI: Gene Expression

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
01_DataExploration		01_DataExploration
02_DataPreprocessing		02_DataPreprocessing
03_IncrementalFeatureSelection.ipynb		03_IncrementalFeatureSelection.ipynb
04_Classification		04_Classification
experiments		experiments
images		images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MicroRNA Signatures for Neurodegenerative Diseases

Dataset

Exploratory Data Analysis

Feature Ranking

Minimum Redundancy Maximum Relevance (mRMR)

Monte Carlo Feature Selection Feature Ranking (MCFC)

Incremental Feature Selection and Classification

Incremental Feature Selection using Random Forest

Incremental Feature Selection using Decision Tree

Classwise Performance for mRMR features using RandomForest

Classwise Performance for MCFS features using RandomForest

Comparison of Precision, Recall and F1-score of different models

Overlap in features selected by mRMR method and MCFS methods

Feature Importance

File information:

Install Requirements

Acknowledgements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MicroRNA Signatures for Neurodegenerative Diseases

Dataset

Exploratory Data Analysis

Feature Ranking

Minimum Redundancy Maximum Relevance (mRMR)

Monte Carlo Feature Selection Feature Ranking (MCFC)

Incremental Feature Selection and Classification

Incremental Feature Selection using Random Forest

Incremental Feature Selection using Decision Tree

Classwise Performance for mRMR features using RandomForest

Classwise Performance for MCFS features using RandomForest

Comparison of Precision, Recall and F1-score of different models

Overlap in features selected by mRMR method and MCFS methods

Feature Importance

File information:

Install Requirements

Acknowledgements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages