Skip to content
This repository was archived by the owner on Sep 15, 2019. It is now read-only.

Latest commit

 

History

History
56 lines (42 loc) · 2.69 KB

File metadata and controls

56 lines (42 loc) · 2.69 KB

Feature Selection

Run feature selection module

Assume that we have already run normalization module and selected best matrix processing method based on the UCA score, we can run feature selection module using the following command:

exseek.py feature_selection -d ${dataset}

Output files

Outuput directory

Feature selection results using one combination of parameters are saved in a separate directory:

${output_dir}/cross_validation/${preprocess_method}.${count_method}/${compare_group}/${classifier}.${n_select}.${selector}.${fold_change_filter_direction}

Variables in file patterns

Variable Descrpition
output_dir Output directory for the dataset, e.g. output/dataset
preprocess_method Combination of matrix processing methods
count_method Type of feature counts, e.g. domains_combined, domains_long, transcript, featurecounts
compare_group Name of the negative-positive class pair defined in compare_groups.yaml
classifier Classifier defined in the configuration file
n_select Maximum number of features to select
selector Feature selection method, e.g. robust, rfe
fold_change_filter_direction Direction of fold change for filtering features. Three possible values: up, down and any

Files in output directory

File name pattern Descrpition
features.txt Selected features. Plain text with one column: feature names
feature_importances.txt Plain text with two columns: feature name, feature importance
samples.txt Sample IDs in input matrix selected for feature selection
classes.txt Sample class labels selected for feature selection
final_model.pkl Final model fitted on all samples in Python pickle format
metrics.train.txt Evaluation metrics on training data. First row is metric names. First column is index of each train-test split
metrics.test.txt Same format with metrics.train.txt on test data.
cross_validation.h5 Cross-validation details in HDF5 format.

Cross validation details (cross_validation.h5)

Dataset name Dimension Description
feature_selection (n_splits, n_features) Binary matrix indicating features selected in each cross-validation split
labels (n_samples,) True class labels
predicted_labels (n_splits, n_samples) Predicted class labels on all samples
predictions (n_splits, n_samples) Predicted probabilities of the positive class (or decision function for SVM)
train_index (n_splits, n_samples) Binary matrix indicating training samples in each cross-validation split