ProteoBRCA: A Proteomics-Based Breast Cancer Subtype Prediction Pipeline

This project provides a command-line pipeline to preprocess proteomic data, train an advanced FT-Transformer model, and evaluate its performance in predicting breast cancer stages.

The entire workflow has been encapsulated into a single, modular toolset: brca_toolset.py.

Installation

Before running the pipeline, please install the necessary Python libraries:

pip install pandas scikit-learn imblearn torch joblib rtdl

Usage

The pipeline is divided into four sequential commands. You must run them in the following order. Open your terminal in the project directory and execute each command one by one.

Step 1: Preprocess Data

This command loads the raw .txt data files, performs cleaning, scaling, SMOTE oversampling, and finally applies PCA for dimensionality reduction. All processed data and fitted processors (scaler, pca, etc.) are saved to the brca_pipeline_artifacts/ directory.

python brca_toolset.py preprocess

Step 2: Train Model

This command loads the preprocessed data and runs a hyperparameter search for the FT-Transformer model to find the best learning rate and weight decay. It then saves the best-performing model's weights and hyperparameters to the brca_pipeline_artifacts/ directory.

Note: This step is computationally intensive and will take a significant amount of time to complete.

python brca_toolset.py train

Step 3: Run Inference

This command loads the best trained model from the previous step and runs predictions on the preprocessed test set. The predictions are saved in the artifacts directory.

python brca_toolset.py predict

Step 4: Evaluate Performance

This final command loads the saved predictions and the true test labels, then prints a complete classification report and the final Macro F1-Score to the console.

python brca_toolset.py evaluate

Pipeline Artifacts

All intermediate files generated by the pipeline are stored in the brca_pipeline_artifacts/ directory. This includes:

scaler.joblib, pca.joblib, label_encoder.joblib: Fitted pre-processors.
*.npy: Processed data splits (train, validation, test).
best_ft_transformer.pt: The trained weights of the best FT-Transformer model.
best_hyperparams.json: The best hyperparameters found during the search.
predictions.npy: The final predictions on the test set.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
BRCA_meta.txt		BRCA_meta.txt
BRCA_phenotype.txt		BRCA_phenotype.txt
BRCA_proteomics_gene_abundance_log2_reference_intensity_normalized_Tumor.txt		BRCA_proteomics_gene_abundance_log2_reference_intensity_normalized_Tumor.txt
BRCA_survival.txt		BRCA_survival.txt
Final Project Report.docx		Final Project Report.docx
LICENSE		LICENSE
README.md		README.md
brca_toolset.py		brca_toolset.py
checkpoint-computer-lab-mcp_toolcall.json		checkpoint-computer-lab-mcp_toolcall.json
dnn_experiment.py		dnn_experiment.py
phase1.ipynb		phase1.ipynb
phase2_advanced_workflow.ipynb		phase2_advanced_workflow.ipynb
trained_model.pkl		trained_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProteoBRCA: A Proteomics-Based Breast Cancer Subtype Prediction Pipeline

Installation

Usage

Step 1: Preprocess Data

Step 2: Train Model

Step 3: Run Inference

Step 4: Evaluate Performance

Pipeline Artifacts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProteoBRCA: A Proteomics-Based Breast Cancer Subtype Prediction Pipeline

Installation

Usage

Step 1: Preprocess Data

Step 2: Train Model

Step 3: Run Inference

Step 4: Evaluate Performance

Pipeline Artifacts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages