Skip to content

MachineLearningLifeScience/awesome-protein-foundation-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Protein Foundation Models

A curated list of protein foundation models and generative models for protein sequence, structure, and multimodal biological modeling.

Awesome Papers License

The field has evolved from family-specific statistical models to large self-supervised foundation models trained on evolutionary-scale protein datasets. Taxonomy and model selection are based on the review Foundation models of protein sequences: a brief overview, including the distributional view p(x), p(x, s), p(x | s), and p(x, m).

The repository focuses on models capable of:

  • representation learning
  • generative protein design
  • zero-shot variant effect prediction
  • multimodal protein modeling

Citations are auto-updated in-place using live OpenAlex/shields.io badges. Beware that these counts are usually smaller than those observed on Google Scholar.

If you find this repository useful, please consider putting a star for later!

Overview

2025 snapshot from our review paper: Foundation models of protein sequences: a brief overview.

Overview

Historical models

Early probabilistic and family-level generative models.

Model Paper Venue Year Citations Notes
HMMs Hidden Markov Models in Computational Biology Journal of Molecular Biology 1994 cites Markovian sequence models for protein families.
PSSM / PSI-BLAST Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Research 1997 cites Site-independent profile model over aligned families.
Potts / DCA Sequence co-evolution gives 3D contacts and structures of protein complexes eLife 2014 cites Pairwise co-evolution modeling for contact and fitness signals.
ProtVec Continuous distributed representation of biological sequences for deep proteomics and genomics PLOS ONE 2015 cites k-mer embedding pretraining for protein sequence representations.
DeepSequence Deep generative models of genetic variation capture the effects of mutations Nature Methods 2018 cites Latent variable (VAE) modeling over homologous alignments.

Alignment-based models

Models that leverage multiple sequence alignments or retrieval over homologs.

Model Paper Venue Year Citations Notes
EVE Disease variant prediction with deep generative models of evolutionary data Nature 2021 cites Evolutionary VAE for variant effect prediction in families.
MSA Transformer MSA Transformer ICML 2021 cites Transformer over full multiple sequence alignments.
Tranception Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval ICML 2022 cites Autoregressive LM with retrieval-time alignment context.
PoET PoET: A generative model of protein families as sequences-of-sequences NeurIPS 2024 cites Sequences-of-sequences family modeling with retrieval.

Sequence models p(x)

Protein foundation models on the amino acid sequence space.

Model Paper Venue Year Citations Notes
UniRep Unified rational protein engineering with sequence-based deep representation learning Nature Methods 2019 cites RNN pretraining for protein representations.
ProtTrans ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning IEEE 2021 cites Large transformer pretraining on UniProt/BFD-scale corpora.
ESM-1b Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences PNAS 2021 cites Scaled transformer pLM with emergent structure/function signals.
ProteinBERT ProteinBERT: a universal deep-learning model of protein sequence and function Bioinformatics 2022 cites Masked language modeling with global-local architecture.
ESM-2 Evolutionary-scale prediction of atomic-level protein structure with a language model Science 2023 cites High-capacity pLM with strong zero-shot and structure signals.
ProGen2 ProGen2: Exploring the boundaries of protein language models Cell Systems 2023 cites Autoregressive generative sequence foundation model.
Ankh Ankh : Optimized Protein Language Model Unlocks General-Purpose Modelling Preprint / Workshop 2023 cites Compute-efficient pLM with strong downstream transfer.
xTrimoPGLM xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein Preprint / Workshop 2023 cites 100B-scale protein language model.
ProtHyena ProtHyena: A fast and efficient foundation protein language model at single amino acid resolution Preprint / Workshop 2024 cites Long-context Hyena architecture for efficient protein language modeling.
DPLM Diffusion Language Models Are Versatile Protein Learners ICML 2024 cites Diffusion language modeling for sequence generation.
PTM-Mamba PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks Nature Methods 2025 cites PTM-aware state-space modeling for scalable sequence learning.

Sequence and structure modeling p(x,s)

Generative models over both sequence and structure.

Model Paper Venue Year Citations Notes
LM-GVP LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction Scientific Reports 2022 cites Joint sequence-structure representation learning.
Chroma Illuminating protein space with a programmable generative model Nature 2023 cites Programmable generative modeling over protein space.
RFdiffusion De novo design of protein structure and function with RFdiffusion Nature 2023 cites Diffusion-based backbone generation with design conditioning.
SaProt SaProt: Protein Language Modeling with Structure-aware Vocabulary ICLR 2024 cites Structure-tokenized sequence modeling in a unified vocabulary.
ProteinGenerator Multistate and functional protein design using RoseTTAFold sequence space diffusion Nature Biotechnology 2024 cites Co-design via sequence diffusion with structure refinement.
Protpardelle An all-atom protein generative model PNAS 2024 cites All-atom generative protein modeling.
Multiflow Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design ICML 2024 cites Discrete flow matching for multimodal co-design.

Inverse folding p(x|s)

Sequence generation conditioned on backbone structure.

Model Paper Venue Year Citations Notes
Structured Transformer Generative models for graph-based protein design NeurIPS 2019 cites Graph-based inverse folding from structure to sequence.
GVP-GNN Learning from Protein Structure with Geometric Vector Perceptrons ICLR 2021 cites Geometric message passing for structure-conditioned sequence tasks.
ProteinMPNN Robust deep learning-based protein sequence design using ProteinMPNN Science 2022 cites State-of-the-art sequence design from backbone context.
ESM-IF1 Learning inverse folding from millions of predicted structures ICML 2022 cites Inverse folding at scale from predicted structures.
MIF-ST Masked inverse folding with sequence transfer for protein representation learning Protein Engineering Design and Selection 2022 cites Masked inverse folding with sequence transfer learning.
Knowledge-Design (KW-Design) Knowledge-Design: Pushing the Limit of Protein Design via Knowledge Refinement ICLR 2024 cites Knowledge refinement for structure-guided sequence design.
CarbonDesign Accurate and robust protein sequence design with CarbonDesign Nature Machine Intelligence 2024 cites Robust sequence design with iterative refinement.

Multi-modal models p(x,m)

Models combining sequence/structure with text, ontology, or other modalities.

Model Paper Venue Year Citations Notes
OntoProtein OntoProtein: Protein Pretraining With Gene Ontology Embedding ICLR 2022 cites Protein modeling with gene ontology context.
ProtST ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts ICML 2023 cites Contrastive alignment of proteins and biomedical text.
ZymCTRL Conditional language models enable the efficient design of proficient enzymes Preprint / Workshop 2024 cites EC-conditioned controllable generation of enzyme sequences.
ProTrek ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning Preprint / Workshop 2024 cites Tri-modal contrastive navigation in protein space.
Prot2Text Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers AAAI 2024 cites Protein-to-text multimodal function generation.
ESM-3 Simulating 500 million years of evolution with a language model Science 2025 cites Unified multimodal track modeling (sequence, structure, and function).
PAIR Boosting the predictive power of protein representations with a corpus of text annotations Nature Machine Intelligence 2025 cites Text-annotation corpora to boost protein representations.
ProteinDT A text-guided protein design framework Nature Machine Intelligence 2025 cites Text-guided protein generation framework.

Benchmarks

Protein language modeling and design ProteinGymFLIPTAPEATOM3D

Community challenges. CAFACASPCAMEO

Datasets

Sequence corpora. UniProtKBUniRefUniParcBFDMGnify

Structure corpora. RCSB PDBAlphaFold DBCATHSCOPe

Functional and domain annotations. PfamInterProGene OntologySwiss-Prot

Libraries

Model frameworks. ESMHugging Face TransformersProteinMPNNOpenFoldAlphaFold

Search and structural tooling. MMseqs2FoldseekColabFold

Contributing

Contributions are welcome. Please open a pull request if you would like to add:

  • new protein foundation models
  • important new utilities for proteins

Further reading

For a brief tour through these papers and the current direction of research, you may be interested in our review, /paper.pdf.

@article{bjerregaard2025foundation,
  title = {Foundation models of protein sequences: A brief overview},
  author = {Andreas Bjerregaard and Peter Mørch Groth and Søren Hauberg and Anders Krogh and Wouter Boomsma},
  url = {https://www.sciencedirect.com/science/article/pii/S0959440X25000223},
  doi = {https://doi.org/10.1016/j.sbi.2025.103004},
  journal = {Current Opinion in Structural Biology},
  volume = {91},
  pages = {103004},
  year = {2025},
  issn = {0959-440X},
}