A curated list of protein foundation models and generative models for protein sequence, structure, and multimodal biological modeling.
The field has evolved from family-specific statistical models to large self-supervised foundation models trained on evolutionary-scale protein datasets. Taxonomy and model selection are based on the review Foundation models of protein sequences: a brief overview, including the distributional view p(x), p(x, s), p(x | s), and p(x, m).
The repository focuses on models capable of:
- representation learning
- generative protein design
- zero-shot variant effect prediction
- multimodal protein modeling
Citations are auto-updated in-place using live OpenAlex/shields.io badges. Beware that these counts are usually smaller than those observed on Google Scholar.
If you find this repository useful, please consider putting a star for later!
2025 snapshot from our review paper: Foundation models of protein sequences: a brief overview.
Early probabilistic and family-level generative models.
| Model | Paper | Venue | Year | Citations | Notes |
|---|---|---|---|---|---|
| HMMs | Hidden Markov Models in Computational Biology | Journal of Molecular Biology | 1994 | Markovian sequence models for protein families. | |
| PSSM / PSI-BLAST | Gapped BLAST and PSI-BLAST: a new generation of protein database search programs | Nucleic Acids Research | 1997 | Site-independent profile model over aligned families. | |
| Potts / DCA | Sequence co-evolution gives 3D contacts and structures of protein complexes | eLife | 2014 | Pairwise co-evolution modeling for contact and fitness signals. | |
| ProtVec | Continuous distributed representation of biological sequences for deep proteomics and genomics | PLOS ONE | 2015 | k-mer embedding pretraining for protein sequence representations. | |
| DeepSequence | Deep generative models of genetic variation capture the effects of mutations | Nature Methods | 2018 | Latent variable (VAE) modeling over homologous alignments. |
Models that leverage multiple sequence alignments or retrieval over homologs.
| Model | Paper | Venue | Year | Citations | Notes |
|---|---|---|---|---|---|
| EVE | Disease variant prediction with deep generative models of evolutionary data | Nature | 2021 | Evolutionary VAE for variant effect prediction in families. | |
| MSA Transformer | MSA Transformer | ICML | 2021 | Transformer over full multiple sequence alignments. | |
| Tranception | Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval | ICML | 2022 | Autoregressive LM with retrieval-time alignment context. | |
| PoET | PoET: A generative model of protein families as sequences-of-sequences | NeurIPS | 2024 | Sequences-of-sequences family modeling with retrieval. |
Protein foundation models on the amino acid sequence space.
| Model | Paper | Venue | Year | Citations | Notes |
|---|---|---|---|---|---|
| UniRep | Unified rational protein engineering with sequence-based deep representation learning | Nature Methods | 2019 | RNN pretraining for protein representations. | |
| ProtTrans | ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning | IEEE | 2021 | Large transformer pretraining on UniProt/BFD-scale corpora. | |
| ESM-1b | Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences | PNAS | 2021 | Scaled transformer pLM with emergent structure/function signals. | |
| ProteinBERT | ProteinBERT: a universal deep-learning model of protein sequence and function | Bioinformatics | 2022 | Masked language modeling with global-local architecture. | |
| ESM-2 | Evolutionary-scale prediction of atomic-level protein structure with a language model | Science | 2023 | High-capacity pLM with strong zero-shot and structure signals. | |
| ProGen2 | ProGen2: Exploring the boundaries of protein language models | Cell Systems | 2023 | Autoregressive generative sequence foundation model. | |
| Ankh | Ankh : Optimized Protein Language Model Unlocks General-Purpose Modelling | Preprint / Workshop | 2023 | Compute-efficient pLM with strong downstream transfer. | |
| xTrimoPGLM | xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein | Preprint / Workshop | 2023 | 100B-scale protein language model. | |
| ProtHyena | ProtHyena: A fast and efficient foundation protein language model at single amino acid resolution | Preprint / Workshop | 2024 | Long-context Hyena architecture for efficient protein language modeling. | |
| DPLM | Diffusion Language Models Are Versatile Protein Learners | ICML | 2024 | Diffusion language modeling for sequence generation. | |
| PTM-Mamba | PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks | Nature Methods | 2025 | PTM-aware state-space modeling for scalable sequence learning. |
Generative models over both sequence and structure.
| Model | Paper | Venue | Year | Citations | Notes |
|---|---|---|---|---|---|
| LM-GVP | LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction | Scientific Reports | 2022 | Joint sequence-structure representation learning. | |
| Chroma | Illuminating protein space with a programmable generative model | Nature | 2023 | Programmable generative modeling over protein space. | |
| RFdiffusion | De novo design of protein structure and function with RFdiffusion | Nature | 2023 | Diffusion-based backbone generation with design conditioning. | |
| SaProt | SaProt: Protein Language Modeling with Structure-aware Vocabulary | ICLR | 2024 | Structure-tokenized sequence modeling in a unified vocabulary. | |
| ProteinGenerator | Multistate and functional protein design using RoseTTAFold sequence space diffusion | Nature Biotechnology | 2024 | Co-design via sequence diffusion with structure refinement. | |
| Protpardelle | An all-atom protein generative model | PNAS | 2024 | All-atom generative protein modeling. | |
| Multiflow | Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design | ICML | 2024 | Discrete flow matching for multimodal co-design. |
Sequence generation conditioned on backbone structure.
| Model | Paper | Venue | Year | Citations | Notes |
|---|---|---|---|---|---|
| Structured Transformer | Generative models for graph-based protein design | NeurIPS | 2019 | Graph-based inverse folding from structure to sequence. | |
| GVP-GNN | Learning from Protein Structure with Geometric Vector Perceptrons | ICLR | 2021 | Geometric message passing for structure-conditioned sequence tasks. | |
| ProteinMPNN | Robust deep learning-based protein sequence design using ProteinMPNN | Science | 2022 | State-of-the-art sequence design from backbone context. | |
| ESM-IF1 | Learning inverse folding from millions of predicted structures | ICML | 2022 | Inverse folding at scale from predicted structures. | |
| MIF-ST | Masked inverse folding with sequence transfer for protein representation learning | Protein Engineering Design and Selection | 2022 | Masked inverse folding with sequence transfer learning. | |
| Knowledge-Design (KW-Design) | Knowledge-Design: Pushing the Limit of Protein Design via Knowledge Refinement | ICLR | 2024 | Knowledge refinement for structure-guided sequence design. | |
| CarbonDesign | Accurate and robust protein sequence design with CarbonDesign | Nature Machine Intelligence | 2024 | Robust sequence design with iterative refinement. |
Models combining sequence/structure with text, ontology, or other modalities.
| Model | Paper | Venue | Year | Citations | Notes |
|---|---|---|---|---|---|
| OntoProtein | OntoProtein: Protein Pretraining With Gene Ontology Embedding | ICLR | 2022 | Protein modeling with gene ontology context. | |
| ProtST | ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts | ICML | 2023 | Contrastive alignment of proteins and biomedical text. | |
| ZymCTRL | Conditional language models enable the efficient design of proficient enzymes | Preprint / Workshop | 2024 | EC-conditioned controllable generation of enzyme sequences. | |
| ProTrek | ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning | Preprint / Workshop | 2024 | Tri-modal contrastive navigation in protein space. | |
| Prot2Text | Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers | AAAI | 2024 | Protein-to-text multimodal function generation. | |
| ESM-3 | Simulating 500 million years of evolution with a language model | Science | 2025 | Unified multimodal track modeling (sequence, structure, and function). | |
| PAIR | Boosting the predictive power of protein representations with a corpus of text annotations | Nature Machine Intelligence | 2025 | Text-annotation corpora to boost protein representations. | |
| ProteinDT | A text-guided protein design framework | Nature Machine Intelligence | 2025 | Text-guided protein generation framework. |
Protein language modeling and design ProteinGym • FLIP • TAPE • ATOM3D
Community challenges. CAFA • CASP • CAMEO
Sequence corpora. UniProtKB • UniRef • UniParc • BFD • MGnify
Structure corpora. RCSB PDB • AlphaFold DB • CATH • SCOPe
Functional and domain annotations. Pfam • InterPro • Gene Ontology • Swiss-Prot
Model frameworks. ESM • Hugging Face Transformers • ProteinMPNN • OpenFold • AlphaFold
Search and structural tooling. MMseqs2 • Foldseek • ColabFold
Contributions are welcome. Please open a pull request if you would like to add:
- new protein foundation models
- important new utilities for proteins
For a brief tour through these papers and the current direction of research, you may be interested in our review, /paper.pdf.
@article{bjerregaard2025foundation,
title = {Foundation models of protein sequences: A brief overview},
author = {Andreas Bjerregaard and Peter Mørch Groth and Søren Hauberg and Anders Krogh and Wouter Boomsma},
url = {https://www.sciencedirect.com/science/article/pii/S0959440X25000223},
doi = {https://doi.org/10.1016/j.sbi.2025.103004},
journal = {Current Opinion in Structural Biology},
volume = {91},
pages = {103004},
year = {2025},
issn = {0959-440X},
}