Skip to content

Commit 1c9f8c9

Browse files
authored
Merge pull request #2 from endixk/viral-phylo
reformat markdown for viral phylogenetics
2 parents 82e69a7 + 54aefa4 commit 1c9f8c9

1 file changed

Lines changed: 23 additions & 32 deletions

File tree

src/content/projects/viral-phylogenetics.md

Lines changed: 23 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -12,74 +12,67 @@ tags:
1212
github: "https://github.com/NIAID-BRC-Codeathons/viral-phylogenetics"
1313
---
1414

15-
\*\*Project Theme
15+
# Viral Structural Phylogenetics
1616

17-
**Other** - Viral Structural Phylogenetics
17+
## Team Information
1818

19-
\*\*Team Information
19+
### Team Name: _virAllSpark_
2020

21-
**Team Name:** virAllSpark
21+
### Team Leads
2222

23-
**Team Lead(s):**
23+
- David Moi, University of Lausanne [✉️](mailto:david.moi@unil.ch)
24+
- Dongwook Kim, University of Lausanne [✉️](mailto:dongwook.kim@unil.ch)
2425

25-
- Name: David Moi
26-
- Affiliation: University of Lausanne
27-
- Email: david.moi@unil.ch
26+
### Team Members
2827

29-
- Name: Dongwook Kim
30-
- Affiliation: University of Lausanne
31-
- Email: dongwook.kim@unil.ch
32-
33-
**Team Members (4-6 members recommended):**
34-
35-
| Name | Affiliation | Role / Expertise |
28+
| Name | Affiliation | Expected Role / Expertise |
3629
| ------------ | ---------------------- | --------------------------------------------------------------------------------------------------------------------------- |
3730
| David Moi | University of Lausanne | Project leader (model development) / AI, ML, snakemake, HPC, docker, structural biology, phylogenetics |
3831
| Dongwook Kim | University of Lausanne | Project leader (validation and integration) / Phylogenetics, sequence analysis, protein structures, Protein Language Models |
3932
| TBD | - | Model development part / AI expert(s) on transformer models, biological language models, sequence embeddings |
4033
| TBD | - | Validation part / Biology and/or Bioinformatics expert(s) on virology, viral taxonomy, sequence analysis, sequence database |
4134

42-
\*\*Project Summary
35+
## Project Summary
4336

4437
In this project, we will zero in on producing a fast and accurate language model for viral structure prediction. Current state-of-the-art models, such as ProstT5, are shown robust in species with a wealth of sequences, while suffering from under-represented species like viruses. For this, we will fine-tune the ProstT5 model weights with the LoRA optimization layer, where we can exploit the recent expansion on viral protein structure databases. We will validate the model by reconstructing viral taxonomy with structurally conserved core genes. Integrating this model into the Foldseek framework will allow us to predict structures from massive, possibly all publicly available, viral sequence databases.
4538

46-
\*\*Goals and Objectives
39+
## Goals and Objectives
4740

4841
1. **Develop a protein language model for viral structural token prediction**
4942
2. **Validate the model with ground-truth viral structures and taxonomy**
5043
3. **Integrate the model into Foldseek and apply on massive databases**
5144

52-
\*\*Approach
45+
## Approach
5346

54-
**Methods and AI/ML Approaches:**
47+
### Methods and AI/ML Approaches
5548

5649
- Fine-tuning ProstT5 model with LoRA optimization
5750
- Hugging Face transformers and tokenizers for biological language models
5851
- Structural phylogenetics using 3Di token predictions
5952
- Foldseek framework integration for large-scale structure prediction
6053

61-
**Implementation Steps:**
54+
### Implementation Steps
6255

63-
1. **Milestone 1: Fine-tuning the ProstT5 model with viral sequence/structure pairs**
56+
**Milestone 1: Fine-tuning the ProstT5 model with viral sequence/structure pairs**
6457
- BFVD dataset transformed to 3di and ready to train
6558
- Prepare large corpus of structural tokens to train virus-specific ProstT5 model
6659
- Tokenization and training using Hugging Face tokenizer with fill-in and translation tasks
6760
- LoRA optimization of ProstT5 weights using the Hugging Face interface
6861
- If necessary, split into clade-specific models
6962
- Inference: Transform UniProt proteomic data to 3di
7063

71-
2. **Milestone 2: Validation of the model with viral taxonomy reconstruction**
64+
**Milestone 2: Validation of the model with viral taxonomy reconstruction**
7265
- Prepare ground-truth dataset of viruses with well-studied taxonomic relationships
7366
- Run fine-tuned model to predict 3Di strings and compare with known structures
7467
- Define structural core genes to conduct structural phylogeny of virus species
7568
- Estimate model quality by comparing results with pre-established taxonomy
7669

77-
3. **Milestone 3: Foldseek framework integration and applications**
70+
**Milestone 3: Foldseek framework integration and applications**
7871
- Integrate updated weights into Foldseek-ProstT5 framework
7972
- Run model on viral subset of NCBI/UniProt protein databases
8073
- Convert to 3Di strings as valuable resource for structural analysis
8174

82-
\*\*Data and Resources Required
75+
## Data and Resources Required
8376

8477
| Resource Type | Source / Link | Description / Purpose |
8578
| --------------------- | ----------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
@@ -88,7 +81,7 @@ In this project, we will zero in on producing a fast and accurate language model
8881
| **Compute / Storage** | Argonne HPC – GPU resources | Required for model training, benchmarking, and post-application of the model on large databases |
8982
| **Compute / Storage** | Argonne HPC – Storage | Required to store prerequisite, intermediate, resulting data produced during the development |
9083

91-
\*\*Expected Outcomes / Deliverables
84+
## Expected Outcomes / Deliverables
9285

9386
By the end of the Codeathon, we expect to deliver:
9487

@@ -99,30 +92,28 @@ By the end of the Codeathon, we expect to deliver:
9992
- **Documentation:** Model usage guides and API documentation
10093
- **Presentation:** Demo showing model performance and validation results
10194

102-
\*\*Potential Impact and Next Steps
95+
## Potential Impact and Next Steps
10396

104-
**Impact on:**
97+
### Impact
10598

10699
- **Infectious disease research or surveillance:** Development of a robust model will enable viral protein structural prediction at unprecedented scale. The resulting knowledgebase will unlock large-scale, deep comparative genomics across viral species with their protein structures, which has been limited by the scarcity of viral structure data.
107100
- **AI/ML automation and interpretability:** Demonstrates advanced fine-tuning techniques for biological language models and structural prediction at scale
108101
- **Public health preparedness or education:** Will catalyze downstream advances, including structure-based reclassification of viruses, structure-guided viral identification methods, or metagenomic expansion
109102

110-
**Next Steps After Codeathon:**
103+
### Next Steps After Codeathon
111104

112105
- Expand model to cover broader viral families and clades
113106
- Integrate with existing viral databases and phylogenetic tools
114107
- Develop web interface for community access
115108
- Publish methodology and make model weights publicly available
116109
- Apply to outbreak surveillance and viral evolution studies
117110

118-
\*\*Technical Support Needed
111+
## Technical Support Needed
119112

120113
- Datasets preloaded - Pre-downloaded BFVD datasets to save time
121114
- GPU / LLM access - Access to the ANL GPU resources (for fine-tuning and database-scale prediction)
122115
API keys
123-
Mentor support
124-
Other: [Specify]
125116

126-
\*\*Additional Comments
117+
## Additional Comments
127118

128119
This project focuses on addressing the under-representation of viral proteins in current state-of-the-art structural prediction models. By fine-tuning ProstT5 specifically for viral sequences, we aim to create a specialized tool that can handle the unique characteristics of viral proteins and enable large-scale structural analysis across viral species.

0 commit comments

Comments
 (0)