Skip to content

darshanz/MicroRNA-Signatures-for-Neurodegenerative-Diseases

Repository files navigation

MicroRNA Signatures for Neurodegenerative Diseases

Python

This repository contains code and analysis for identifying microRNA signatures associated with various neurodegenerative diseases using publicly available gene expression datasets.

Note: This is an independent reimplementation of the methods described in the referenced paper, created for educational purposes. All original research credit belongs to Li et al. [1]

Dataset

The dataset used in this study is sourced from the Gene Expression Omnibus (GEO) download GEO Accession viewer. (GEO accession number: GSE120584)

The dataset contains 1601 samples for various neurodegenerative diseases as shown in the table below (01B_Data_Exploration.ipynb):

Disease Case Sample Size
Alzheimer’s disease (AD) 1,021
Vascular dementia (VaD) 91
Dementia with lewy bodies (DLB) 169
Mild cognitive impairment (MCI) 32
Normal control (NC) 288

Exploratory Data Analysis

Data Information:

  • GEO Accession: GSE120584
  • Technology: Agilent miRNA Microarray
  • Samples: 1,601 human serum samples
  • Disease Classes: AD, VaD, DLB, MCI, and Normal Control

After data prefrocessing we obtained a final expression matrix shape: (2547, 1601)

  • miRNAs: 2547 features
  • Samples: 1601 observations
  • Missing Values: 0 (complete data)

Details on data preprocessing in : 01_DataExploration/README.md

Feature Ranking

We use BorutaPy package to filter the features. 30 out of 2547 featrues were selected using boruta filtering using RandomForest as an estimator. For all parameters of boruta, default values were used.

Minimum Redundancy Maximum Relevance (mRMR)

To rank features by maximum relevance and minimum redundancy we applied mRMR feature ranking. Which uses F-test for relevance measurement, Pearson correlation for redundancy and returns features ordered by importance.

Monte Carlo Feature Selection Feature Ranking (MCFC)

Then we tried to rank the features using Monte Carlo ensemble approach.

When comparing mRMR vs MCFS, there was not much agreement betwee two methods as shown in the scatterplot.

alt text

Incremental Feature Selection and Classification

After feature ranking was analyed, we conducted Incremental Feature Selection (IFS). We used RandomForest and Decision Trees (Although, the paper[1] used PART from Weka, we tried to use closest interpretable model available in SKLearn Package) as our classifiers and identicied the best set of features based on MCC as evaluation metric.

Incremental Feature Selection using Random Forest

alt text

Incremental Feature Selection using Decision Tree

alt text

TABLE Classification Performance of Random Forest and Decision Tree Models with mRMR and MCFS Feature Selection Methods

Model Accuracy F1-score (weighted) MCC
mRMR + RF + SMOTE 0.6377 0.6122 0.2813
mRMR + DT + SMOTE 0.3092 0.3691 0.1325
MCFS + RF + SMOTE 0.6446 0.6220 0.2989
MCFS + DT + SMOTE 0.2798 0.3289 0.1266
mRMR + RF + (NO SMOTE) 0.6952 0.6121 0.3294
mRMR + DT + (NO SMOTE) 0.6421 0.5666 0.1891
MCFS + RF + (NO SMOTE) 0.6908 0.6042 0.3170
MCFS + DT + (NO SMOTE) 0.6452 0.5735 0.2006

From classification results, we can see that:

  • Random Forest models performed better than Decision Tree models
  • mRMR + RF without SMOTE achieves the highest MCC in this set of experiments
  • MCFS + RF + SMOTE achieves the highest weighted F1-score among the SMOTE models
Classwise Performance for mRMR features using RandomForest

alt text

Classwise Performance for MCFS features using RandomForest

alt text

Comparison of Precision, Recall and F1-score of different models

alt text

Overlap in features selected by mRMR method and MCFS methods

Venn diagram to show overlap between the top miRNA features selected by mRMR method and MCFS method

  • 29 miRNA features are commonly identified.

alt text

Feature Importance

Feature importance ranking of the miRNA features selected using mRMR and MCFS methods as obtained from Random Forest and Decision Trees.

alt text

File information:

Repository consists of following notebooks:

  1. 01A_Understanding_Raw_Data.ipynb : Understanding raw data
  2. 01B_Data_Exploration.ipynb : series matrix data exploration
  3. 02A_BorutaFeatureRanking.ipynb: feature filtering usingBorutaPy
  4. 02B_FeatureRanking.ipynb: mRMR and MCFC based feature ranking
  5. 03A_IncrementalFeatureSelection.ipynb: incremental feature selection and classification
  6. 04A_ClassificationResults.ipynb: Classification using RF and Decision Trees

Install Requirements

Requirements for this project are available in requirements.txt

Acknowledgements

Most of the experiments in this repository are based on the methods described in the following research paper. While these experiments may not completely follow the original methodology, we gratefully acknowledge the detailed and reproducible description provided in their work:

Li, Z., Guo, W., Ding, S., Chen, L., Feng, K., Huang, T. and Cai, Y.D., 2022. Identifying key MicroRNA signatures for neurodegenerative diseases with machine learning methods. Frontiers in Genetics, 13, p.880997.

References

[1] Li, Z., Guo, W., Ding, S., Chen, L., Feng, K., Huang, T. and Cai, Y.D., 2022. Identifying key MicroRNA signatures for neurodegenerative diseases with machine learning methods. Frontiers in Genetics, 13, p.880997.

[2] GEO Accession viewer

[3] Kursa M., Rudnicki W., "Feature Selection with the Boruta Package" Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010

[4] Risk prediction models for dementia constructed by supervised principal component analysis using miRNA expression data

[5] A Guide to NCBI: Gene Expression

About

MicroRNA Signatures for Neurodegenerative Diseases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors