You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| David Moi | University of Lausanne | Project leader (model development) / AI, ML, snakemake, HPC, docker, structural biology, phylogenetics |
38
31
| Dongwook Kim | University of Lausanne | Project leader (validation and integration) / Phylogenetics, sequence analysis, protein structures, Protein Language Models |
39
32
| TBD | - | Model development part / AI expert(s) on transformer models, biological language models, sequence embeddings |
In this project, we will zero in on producing a fast and accurate language model for viral structure prediction. Current state-of-the-art models, such as ProstT5, are shown robust in species with a wealth of sequences, while suffering from under-represented species like viruses. For this, we will fine-tune the ProstT5 model weights with the LoRA optimization layer, where we can exploit the recent expansion on viral protein structure databases. We will validate the model by reconstructing viral taxonomy with structurally conserved core genes. Integrating this model into the Foldseek framework will allow us to predict structures from massive, possibly all publicly available, viral sequence databases.
45
38
46
-
\*\*Goals and Objectives
39
+
## Goals and Objectives
47
40
48
41
1.**Develop a protein language model for viral structural token prediction**
49
42
2.**Validate the model with ground-truth viral structures and taxonomy**
50
43
3.**Integrate the model into Foldseek and apply on massive databases**
51
44
52
-
\*\*Approach
45
+
## Approach
53
46
54
-
**Methods and AI/ML Approaches:**
47
+
### Methods and AI/ML Approaches
55
48
56
49
- Fine-tuning ProstT5 model with LoRA optimization
57
50
- Hugging Face transformers and tokenizers for biological language models
58
51
- Structural phylogenetics using 3Di token predictions
59
52
- Foldseek framework integration for large-scale structure prediction
60
53
61
-
**Implementation Steps:**
54
+
### Implementation Steps
62
55
63
-
1.**Milestone 1: Fine-tuning the ProstT5 model with viral sequence/structure pairs**
56
+
**Milestone 1: Fine-tuning the ProstT5 model with viral sequence/structure pairs**
64
57
- BFVD dataset transformed to 3di and ready to train
65
58
- Prepare large corpus of structural tokens to train virus-specific ProstT5 model
66
59
- Tokenization and training using Hugging Face tokenizer with fill-in and translation tasks
67
60
- LoRA optimization of ProstT5 weights using the Hugging Face interface
68
61
- If necessary, split into clade-specific models
69
62
- Inference: Transform UniProt proteomic data to 3di
70
63
71
-
2.**Milestone 2: Validation of the model with viral taxonomy reconstruction**
64
+
**Milestone 2: Validation of the model with viral taxonomy reconstruction**
72
65
- Prepare ground-truth dataset of viruses with well-studied taxonomic relationships
73
66
- Run fine-tuned model to predict 3Di strings and compare with known structures
74
67
- Define structural core genes to conduct structural phylogeny of virus species
75
68
- Estimate model quality by comparing results with pre-established taxonomy
76
69
77
-
3.**Milestone 3: Foldseek framework integration and applications**
70
+
**Milestone 3: Foldseek framework integration and applications**
78
71
- Integrate updated weights into Foldseek-ProstT5 framework
79
72
- Run model on viral subset of NCBI/UniProt protein databases
80
73
- Convert to 3Di strings as valuable resource for structural analysis
81
74
82
-
\*\*Data and Resources Required
75
+
## Data and Resources Required
83
76
84
77
| Resource Type | Source / Link | Description / Purpose |
@@ -88,7 +81,7 @@ In this project, we will zero in on producing a fast and accurate language model
88
81
|**Compute / Storage**| Argonne HPC – GPU resources | Required for model training, benchmarking, and post-application of the model on large databases |
89
82
|**Compute / Storage**| Argonne HPC – Storage | Required to store prerequisite, intermediate, resulting data produced during the development |
90
83
91
-
\*\*Expected Outcomes / Deliverables
84
+
## Expected Outcomes / Deliverables
92
85
93
86
By the end of the Codeathon, we expect to deliver:
94
87
@@ -99,30 +92,28 @@ By the end of the Codeathon, we expect to deliver:
99
92
-**Documentation:** Model usage guides and API documentation
100
93
-**Presentation:** Demo showing model performance and validation results
101
94
102
-
\*\*Potential Impact and Next Steps
95
+
## Potential Impact and Next Steps
103
96
104
-
**Impact on:**
97
+
### Impact
105
98
106
99
-**Infectious disease research or surveillance:** Development of a robust model will enable viral protein structural prediction at unprecedented scale. The resulting knowledgebase will unlock large-scale, deep comparative genomics across viral species with their protein structures, which has been limited by the scarcity of viral structure data.
107
100
-**AI/ML automation and interpretability:** Demonstrates advanced fine-tuning techniques for biological language models and structural prediction at scale
108
101
-**Public health preparedness or education:** Will catalyze downstream advances, including structure-based reclassification of viruses, structure-guided viral identification methods, or metagenomic expansion
109
102
110
-
**Next Steps After Codeathon:**
103
+
### Next Steps After Codeathon
111
104
112
105
- Expand model to cover broader viral families and clades
113
106
- Integrate with existing viral databases and phylogenetic tools
114
107
- Develop web interface for community access
115
108
- Publish methodology and make model weights publicly available
116
109
- Apply to outbreak surveillance and viral evolution studies
117
110
118
-
\*\*Technical Support Needed
111
+
## Technical Support Needed
119
112
120
113
- Datasets preloaded - Pre-downloaded BFVD datasets to save time
121
114
- GPU / LLM access - Access to the ANL GPU resources (for fine-tuning and database-scale prediction)
122
115
API keys
123
-
Mentor support
124
-
Other: [Specify]
125
116
126
-
\*\*Additional Comments
117
+
## Additional Comments
127
118
128
119
This project focuses on addressing the under-representation of viral proteins in current state-of-the-art structural prediction models. By fine-tuning ProstT5 specifically for viral sequences, we aim to create a specialized tool that can handle the unique characteristics of viral proteins and enable large-scale structural analysis across viral species.
0 commit comments