Assume that we have already run normalization module and selected best matrix processing method based on the UCA score, we can run feature selection module using the following command:
exseek.py feature_selection -d ${dataset}Feature selection results using one combination of parameters are saved in a separate directory:
${output_dir}/cross_validation/${preprocess_method}.${count_method}/${compare_group}/${classifier}.${n_select}.${selector}.${fold_change_filter_direction}
Variables in file patterns
| Variable | Descrpition |
|---|---|
output_dir |
Output directory for the dataset, e.g. output/dataset |
preprocess_method |
Combination of matrix processing methods |
count_method |
Type of feature counts, e.g. domains_combined, domains_long, transcript, featurecounts |
compare_group |
Name of the negative-positive class pair defined in compare_groups.yaml |
classifier |
Classifier defined in the configuration file |
n_select |
Maximum number of features to select |
selector |
Feature selection method, e.g. robust, rfe |
fold_change_filter_direction |
Direction of fold change for filtering features. Three possible values: up, down and any |
| File name pattern | Descrpition |
|---|---|
features.txt |
Selected features. Plain text with one column: feature names |
feature_importances.txt |
Plain text with two columns: feature name, feature importance |
samples.txt |
Sample IDs in input matrix selected for feature selection |
classes.txt |
Sample class labels selected for feature selection |
final_model.pkl |
Final model fitted on all samples in Python pickle format |
metrics.train.txt |
Evaluation metrics on training data. First row is metric names. First column is index of each train-test split |
metrics.test.txt |
Same format with metrics.train.txt on test data. |
cross_validation.h5 |
Cross-validation details in HDF5 format. |
Cross validation details (cross_validation.h5)
| Dataset name | Dimension | Description |
|---|---|---|
| feature_selection | (n_splits, n_features) | Binary matrix indicating features selected in each cross-validation split |
| labels | (n_samples,) | True class labels |
| predicted_labels | (n_splits, n_samples) | Predicted class labels on all samples |
| predictions | (n_splits, n_samples) | Predicted probabilities of the positive class (or decision function for SVM) |
| train_index | (n_splits, n_samples) | Binary matrix indicating training samples in each cross-validation split |