2023
DOI: 10.1101/2023.07.23.550085
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Bilingual Language Model for Protein Sequence and Structure

Michael Heinzinger,
Konstantin Weissenow,
Joaquin Gomez Sanchez
et al.

Abstract: Advanced Artificial Intelligence (AI) enabled large language models (LLMs) to revolutionize Natural Language Processing (NLP). Their adaptation to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. For the first time, we can now systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve in linear strings of one-dimensional (1D) s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

1
26
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 50 publications
(28 citation statements)
references
References 92 publications
(264 reference statements)
1
26
0
Order By: Relevance
“…Moreover, the recent advancements in protein folding models [2,8,29] have provided structure-based models with access to extensive datasets of protein structures. This has led to a growing interest in developing pre-training models that leverage protein structure information [7,5,27].…”
Section: Structure-based Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…Moreover, the recent advancements in protein folding models [2,8,29] have provided structure-based models with access to extensive datasets of protein structures. This has led to a growing interest in developing pre-training models that leverage protein structure information [7,5,27].…”
Section: Structure-based Modelsmentioning
confidence: 99%
“…Similarly, the recent ProtSSN [32] model leverages ESM-2 [8] embeddings as input for the EGNN [33] model, resulting in notable advancements. Both ESM-IF1 [34] and MIF-ST [35] target inverse folding, utilizing the structure to predict corresponding protein residues, whereas ProstT5 [7] focuses on the transformation between residue sequences and their structure token sequences [6] as a pre-training objective. SaProt [5] constructs a structure-aware vocabulary using structure tokens generated by foldseek [6].…”
Section: Hybird Structure-sequence Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…More complex architectures have also been explored 33 . Continued unsupervised training can focus models on specific protein families 34 or enrich embeddings with structural information essentially creating a bi-lingual pLM 35 . Training specialist models from scratch on smaller, specific proteins, e.g., antibodies 25,36 , seems an alternative to "continued training" (train pLM on large generic data and refine on specific proteins).…”
Section: Introductionmentioning
confidence: 99%
“…But on the user side, the computational demands of de novo structure prediction for query sequences can still limit the number of searches possible. Methods for converting amino acid sequences directly to Foldseek-compatible encodings, without full atomic coordinate prediction, can greatly expand the search space and scale accessible to the users 21,22 .…”
mentioning
confidence: 99%