This project provides a command-line pipeline to preprocess proteomic data, train an advanced FT-Transformer model, and evaluate its performance in predicting breast cancer stages.
The entire workflow has been encapsulated into a single, modular toolset: brca_toolset.py.
Before running the pipeline, please install the necessary Python libraries:
pip install pandas scikit-learn imblearn torch joblib rtdlThe pipeline is divided into four sequential commands. You must run them in the following order. Open your terminal in the project directory and execute each command one by one.
This command loads the raw .txt data files, performs cleaning, scaling, SMOTE oversampling, and finally applies PCA for dimensionality reduction. All processed data and fitted processors (scaler, pca, etc.) are saved to the brca_pipeline_artifacts/ directory.
python brca_toolset.py preprocessThis command loads the preprocessed data and runs a hyperparameter search for the FT-Transformer model to find the best learning rate and weight decay. It then saves the best-performing model's weights and hyperparameters to the brca_pipeline_artifacts/ directory.
Note: This step is computationally intensive and will take a significant amount of time to complete.
python brca_toolset.py trainThis command loads the best trained model from the previous step and runs predictions on the preprocessed test set. The predictions are saved in the artifacts directory.
python brca_toolset.py predictThis final command loads the saved predictions and the true test labels, then prints a complete classification report and the final Macro F1-Score to the console.
python brca_toolset.py evaluateAll intermediate files generated by the pipeline are stored in the brca_pipeline_artifacts/ directory. This includes:
scaler.joblib,pca.joblib,label_encoder.joblib: Fitted pre-processors.*.npy: Processed data splits (train, validation, test).best_ft_transformer.pt: The trained weights of the best FT-Transformer model.best_hyperparams.json: The best hyperparameters found during the search.predictions.npy: The final predictions on the test set.