Predict protein function from amino acid sequences using ESM2 protein language models and XGBoost classifiers.
CAFA 6 Protein Function Prediction on Kaggle
- Embeddings: ESM2 protein language model
- Classification: Multi-label XGBoost
- Evaluation: CAFA-specific metrics (F1, Precision, Recall)
# Clone repository
git clone https://github.com/menna890/CAFA6-Protein-Function-Prediction.git
cd CAFA6-Protein-Function-Prediction
# Install dependencies
pip install -r requirements.txt# Using Kaggle API
kaggle competitions download -c cafa-6-protein-function-prediction
unzip cafa-6-protein-function-prediction.zip -d data/raw/# Extract embeddings
python scripts/extract_embeddings.py
# Train model
python scripts/train.py
# Make predictions
python scripts/predict.py├── data/ # Data files (not tracked)
├── src/ # Source code
│ ├── data/ # Data loading
│ ├── features/ # Feature extraction
│ ├── models/ # Model definitions
│ └── utils/ # Utilities
├── notebooks/ # Jupyter notebooks
├── scripts/ # Executable scripts
├── models/ # Trained models (not tracked)
└── outputs/ # Results and reports
- Data Loading: Parse FASTA and TSV files
- Embeddings: Extract ESM2 embeddings (480-dim vectors)
- Classification: Multi-label Binary Relevance with XGBoost
- Evaluation: CAFA metrics with cross-validation
| Model | F1-Macro | F1-Micro | Precision | Recall |
|---|---|---|---|---|
| Baseline | TBD | TBD | TBD | TBD |
- Python 3.9+
- PyTorch
- Transformers (HuggingFace)
- XGBoost
- scikit-learn
- BioPython
Menna - @menna890
MIT License
- CAFA competition organizers
- Meta AI for ESM2
- Kaggle community