Bilingual Language Model for Protein Sequence and Structure

Heinzinger, Michael; Weissenow, Konstantin; Sanchez, Joaquin Gomez; Henkel, Adrian; Mirdita, Milot; Steinegger, Martin; Rost, Burkhard

doi:10.1101/2023.07.23.550085

Cited by 50 publications

(28 citation statements)

References 92 publications

(264 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, the recent advancements in protein folding models [2,8,29] have provided structure-based models with access to extensive datasets of protein structures. This has led to a growing interest in developing pre-training models that leverage protein structure information [7,5,27].…”

Section: Structure-based Modelsmentioning

confidence: 99%

“…Similarly, the recent ProtSSN [32] model leverages ESM-2 [8] embeddings as input for the EGNN [33] model, resulting in notable advancements. Both ESM-IF1 [34] and MIF-ST [35] target inverse folding, utilizing the structure to predict corresponding protein residues, whereas ProstT5 [7] focuses on the transformation between residue sequences and their structure token sequences [6] as a pre-training objective. SaProt [5] constructs a structure-aware vocabulary using structure tokens generated by foldseek [6].…”

Section: Hybird Structure-sequence Modelsmentioning

confidence: 99%

“…This approach not only decreases the risk of overfitting but also facilitates the incorporation of structure data into Transformer architectures. Such models include ProtsT5 [7] and SaProt [5], which build upon existing sequence models. For instance, SaProt extends the ESM model [8], while ProtsT5 evolves from the ProtT5 model [9].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Li,

Tan,

et al. 2024

Preprint

View full text Add to dashboard Cite

Protein language models have exhibited remarkable representational capabilities in various downstream tasks, notably in the prediction of protein functions. Despite their success, these models traditionally grapple with a critical shortcoming: the absence of explicit protein structure information, which is pivotal for elucidating the relationship between protein sequences and their functionality. Addressing this gap, we introduce DeProt, a Transformer-based protein language model designed to incorporate protein sequences and structures. It was pre-trained on millions of protein structures from diverse natural protein clusters. DeProt first serializes protein structures into residue-level local-structure sequences and use a graph neural network based auto-encoder to vectorized the local structures. Then, these vectors are quantized and formed a discrete structure tokens by a pre-trained codebook. Meanwhile, DeProt utilize disentangled attention mechanisms to effectively integrate residue sequences with structure token sequences. Despite having fewer parameters and less training data, DeProt significantly outperforms other state-of-the-art (SOTA) protein language models, including those that are structure-aware and evolution-based, particularly in the task of zero-shot mutant effect prediction across 217 deep mutational scanning assays. Furthermore, DeProt exhibits robust representational capabilities across a spectrum of supervised-learning downstream tasks. Our comprehensive benchmarks underscore the innovative nature of DeProt's framework and its superior performance, suggesting its wide applicability in the realm of protein deep learning.

show abstract

Section: Structure-based Modelsmentioning

confidence: 99%

Section: Hybird Structure-sequence Modelsmentioning

confidence: 99%

See 1 more Smart Citation

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Li,

Tan,

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…More complex architectures have also been explored 33 . Continued unsupervised training can focus models on specific protein families 34 or enrich embeddings with structural information essentially creating a bi-lingual pLM 35 . Training specialist models from scratch on smaller, specific proteins, e.g., antibodies 25,36 , seems an alternative to "continued training" (train pLM on large generic data and refine on specific proteins).…”

Section: Introductionmentioning

confidence: 99%

Fine-tuning protein language models boosts predictions across diverse tasks

Schmirler,

Heinzinger,

Rost

2023

Preprint

Self Cite

View full text Add to dashboard Cite

Prediction methods inputting embeddings from protein Language Models (pLMs) have reached or even surpassed state-of-the-art (SOTA) performance on many protein prediction tasks. In natural language processing (NLP) fine-tuning Language Models has become thede factostandard. In contrast, most pLM-based protein predictions do not back-propagate to the pLM. Here, we compared the fine-tuning of three SOTA pLMs (ESM2, ProtT5, Ankh) on eight different tasks. Two results stood out. Firstly, task-specific supervised fine-tuning almost always improved downstream predictions. Secondly, parameter-efficient fine-tuning could reach similar improvements consuming substantially fewer resources. Put simply: always fine-tune pLMs and you will mostly gain. To help you, we provided easy-to-use notebooks for parameter efficient fine-tuning of ProtT5 for per-protein (pooling) and per-residue prediction tasks athttps://github.com/agemagician/ProtTrans/tree/master/Fine-Tuning.

show abstract

“…But on the user side, the computational demands of de novo structure prediction for query sequences can still limit the number of searches possible. Methods for converting amino acid sequences directly to Foldseek-compatible encodings, without full atomic coordinate prediction, can greatly expand the search space and scale accessible to the users 21,22 .…”

mentioning

confidence: 99%

Domainator, a flexible software suite for domain-based annotation and neighborhood analysis, identifies proteins involved in antiviral systems

Johnson,

Weigele,

Fomenkov

et al. 2024

Preprint

View full text Add to dashboard Cite

The availability of large databases of biological sequences presents an opportunity for in-depth exploration of gene diversity and function. Bacterial defense systems are a rich source of diverse, but difficult to annotate genes with biotechnological applications. In this work, we present Domainator, a flexible and modular software suite for domain-based gene neighborhood and protein search, extraction, and clustering. We demonstrate the utility of Domainator through three examples related to bacterial defense systems. First, we cluster CRISPR-associated Rossman fold (CARF) containing proteins with difficult to annotate effector domains, classifying most of them as likely transcriptional regulators and a subset as likely RNAses. Second, we extract and cluster P4-like phage satellite defense hotspots and identify an abundant system related to Lamassu phage defense systems. Third, we integrate a protein language model into Domainator and use it to identify restriction enzymes with low homology to known reference sequences, validating the activity of one example in-vitro. Domainator is made available as an open-source package with detailed documentation and usage examples.

show abstract

Bilingual Language Model for Protein Sequence and Structure

Cited by 50 publications

References 92 publications

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Fine-tuning protein language models boosts predictions across diverse tasks

Domainator, a flexible software suite for domain-based annotation and neighborhood analysis, identifies proteins involved in antiviral systems

Contact Info

Product

Resources

About