2023
DOI: 10.1101/2023.02.03.526917
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Structure-informed Language Models Are Protein Designers

Abstract: This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it wit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
31
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 29 publications
(45 citation statements)
references
References 75 publications
0
31
0
Order By: Relevance
“…We also evaluate InstructPLM on TS50 and TS500 datasets, which consist of 50 and 470 proteins and are often employed as additional benchmarks to further test generalization capability [21,33,34] beyond CATH dataset. The detailed results are shown in Table 7 in Appendix, where InstructPLM demonstrates consistent and robust performance.…”
Section: Instructplm Designs Sequences With High Recoverymentioning
confidence: 99%
See 1 more Smart Citation
“…We also evaluate InstructPLM on TS50 and TS500 datasets, which consist of 50 and 470 proteins and are often employed as additional benchmarks to further test generalization capability [21,33,34] beyond CATH dataset. The detailed results are shown in Table 7 in Appendix, where InstructPLM demonstrates consistent and robust performance.…”
Section: Instructplm Designs Sequences With High Recoverymentioning
confidence: 99%
“…Specifically, pLMs have demonstrated the capability to generate functional protein sequences according to certain conditions. For example, GPT-based pLMs such as ProGen and ProtGPT can generate proteins following homologous samples or control tags specifying protein properties; ESM-based pLMs [21][22][23] design desired protein sequences by applying or sampling from the pre-trained masked language model. However, unlike general language models which exhibit zero-shot generalization and the ability to understand user intent on a wide range of tasks through methods like instruction fine-tuning [24,25] or reinforcement learning [26,27], it still remains an open area of inquiry how pLMs can generate protein sequences following fine-grained and complex biological instructions and even simulate the evolution of life.…”
Section: Introductionmentioning
confidence: 99%
“…Interestingly, such an approach was shown to outperform the supervised fine-tuning of the probability density model pretrained in an unsupervised manner. 193 Another approach 194 combines self-supervised large protein language models with a supervised structure-tosequence predictor in a new and more general framework called LM-design that is claimed to advance the state of the art in predicting a protein sequence corresponding to a starting backbone structure, sometimes called "inverse folding". While inverse folding does not explicitly search the mutational landscape, it can be used to identify promising mutations by inputting an existing protein structure and a partially masked sequence and using the inverse folding tool to propose amino acids for the masked parts.…”
Section: Supervised Learning To Predict the Effects Of Mutationsmentioning
confidence: 99%
“…The CDR design protocol in IgDesign is based on the approach of combining a structure encoder and sequence decoder as proposed in LM-Design [3]. We first execute a forward pass through IgMPNN, as described above.…”
mentioning
confidence: 99%
“…We sample the maximum likelihood estimate of those logits in order to obtain a single tokenized sequence. We provide this sequence as input to the ESM2-3B protein language model [12] 3 and extract the embeddings before the final projection head. We then apply a BottleNeck Adapter layer [17], in which cross-attention is computed by using the final node embeddings from IgMPNN as keys and the embeddings from ESM2-3B as queries and values.…”
mentioning
confidence: 99%