Skip to content

xiangtai-pixel/DDLS_final_project

Repository files navigation

ProteoBRCA: A Proteomics-Based Breast Cancer Subtype Prediction Pipeline

This project provides a command-line pipeline to preprocess proteomic data, train an advanced FT-Transformer model, and evaluate its performance in predicting breast cancer stages.

The entire workflow has been encapsulated into a single, modular toolset: brca_toolset.py.

Installation

Before running the pipeline, please install the necessary Python libraries:

pip install pandas scikit-learn imblearn torch joblib rtdl

Usage

The pipeline is divided into four sequential commands. You must run them in the following order. Open your terminal in the project directory and execute each command one by one.

Step 1: Preprocess Data

This command loads the raw .txt data files, performs cleaning, scaling, SMOTE oversampling, and finally applies PCA for dimensionality reduction. All processed data and fitted processors (scaler, pca, etc.) are saved to the brca_pipeline_artifacts/ directory.

python brca_toolset.py preprocess

Step 2: Train Model

This command loads the preprocessed data and runs a hyperparameter search for the FT-Transformer model to find the best learning rate and weight decay. It then saves the best-performing model's weights and hyperparameters to the brca_pipeline_artifacts/ directory.

Note: This step is computationally intensive and will take a significant amount of time to complete.

python brca_toolset.py train

Step 3: Run Inference

This command loads the best trained model from the previous step and runs predictions on the preprocessed test set. The predictions are saved in the artifacts directory.

python brca_toolset.py predict

Step 4: Evaluate Performance

This final command loads the saved predictions and the true test labels, then prints a complete classification report and the final Macro F1-Score to the console.

python brca_toolset.py evaluate

Pipeline Artifacts

All intermediate files generated by the pipeline are stored in the brca_pipeline_artifacts/ directory. This includes:

  • scaler.joblib, pca.joblib, label_encoder.joblib: Fitted pre-processors.
  • *.npy: Processed data splits (train, validation, test).
  • best_ft_transformer.pt: The trained weights of the best FT-Transformer model.
  • best_hyperparams.json: The best hyperparameters found during the search.
  • predictions.npy: The final predictions on the test set.

About

Using MCP to generate analysing tools for CPTAC BRAC data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors