Structure-informed Language Models Are Protein Designers

Zheng, Zaixiang; Deng, Yifan; Xue, Dongyu; Zhou, Yi; YE, Fei; Gu, Quanquan

doi:10.1101/2023.02.03.526917

Cited by 29 publications

(45 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also evaluate InstructPLM on TS50 and TS500 datasets, which consist of 50 and 470 proteins and are often employed as additional benchmarks to further test generalization capability [21,33,34] beyond CATH dataset. The detailed results are shown in Table 7 in Appendix, where InstructPLM demonstrates consistent and robust performance.…”

Section: Instructplm Designs Sequences With High Recoverymentioning

confidence: 99%

“…Specifically, pLMs have demonstrated the capability to generate functional protein sequences according to certain conditions. For example, GPT-based pLMs such as ProGen and ProtGPT can generate proteins following homologous samples or control tags specifying protein properties; ESM-based pLMs [21][22][23] design desired protein sequences by applying or sampling from the pre-trained masked language model. However, unlike general language models which exhibit zero-shot generalization and the ability to understand user intent on a wide range of tasks through methods like instruction fine-tuning [24,25] or reinforcement learning [26,27], it still remains an open area of inquiry how pLMs can generate protein sequences following fine-grained and complex biological instructions and even simulate the evolution of life.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Qiu,

Xu,

et al. 2024

Preprint

View full text Add to dashboard Cite

Large language models are renowned for their efficacy in capturing intricate patterns, including co-evolutionary relationships, and underlying protein languages. However, current methodologies often fall short in illustrating the emergence of genomic insertions, duplications, and insertion/deletions (indels), which account for approximately 14% of human pathogenic mutations. Given that structure dictates function, mutated proteins with similar structures are more likely to persist throughout biological evolution. Motivated by this, we leverage cross-modality alignment and instruct fine-tuning techniques inspired by large language models to align a generative protein language model with protein structure instructions. Specifically, we present a method for generating variable-length and diverse proteins to explore and simulate the complex evolution of life, thereby expanding the repertoire of options for protein engineering. Our proposed protein LM-based approach, InstructPLM, demonstrates significant performance enhancements both in silico and in vitro. On native protein backbones, it achieves a perplexity of 2.68 and a sequence recovery rate of 57.51, surpassing ProteinMPNN by 39.2% and 25.1%, respectively. Furthermore, we validate the efficacy of our model by redesigning PETase and L-MDH. For PETase, all fifteen designed variable-length PETase exhibit depolymerization activity, with eleven surpassing the activity levels of the wild type. Regarding L-MDH, an enzyme lacking an experimentally determined structure, InstructPLM is able to design functional enzymes with an AF2-predicted structure. Code and model weights of InstructPLM are publicly available.

show abstract

Section: Instructplm Designs Sequences With High Recoverymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Qiu,

Xu,

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Interestingly, such an approach was shown to outperform the supervised fine-tuning of the probability density model pretrained in an unsupervised manner. 193 Another approach 194 combines self-supervised large protein language models with a supervised structure-tosequence predictor in a new and more general framework called LM-design that is claimed to advance the state of the art in predicting a protein sequence corresponding to a starting backbone structure, sometimes called "inverse folding". While inverse folding does not explicitly search the mutational landscape, it can be used to identify promising mutations by inputting an existing protein structure and a partially masked sequence and using the inverse folding tool to propose amino acids for the masked parts.…”

Section: Supervised Learning To Predict the Effects Of Mutationsmentioning

confidence: 99%

Machine Learning-Guided Protein Engineering

Kouba,

Kohout,

Haddadi

et al. 2023

ACS Catal.

View full text Add to dashboard Cite

Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.

show abstract

“…The CDR design protocol in IgDesign is based on the approach of combining a structure encoder and sequence decoder as proposed in LM-Design [3]. We first execute a forward pass through IgMPNN, as described above.…”

mentioning

confidence: 99%

“…We sample the maximum likelihood estimate of those logits in order to obtain a single tokenized sequence. We provide this sequence as input to the ESM2-3B protein language model [12] 3 and extract the embeddings before the final projection head. We then apply a BottleNeck Adapter layer [17], in which cross-attention is computed by using the final node embeddings from IgMPNN as keys and the embeddings from ESM2-3B as queries and values.…”

mentioning

confidence: 99%

In vitrovalidated antibody design against multiple therapeutic antigens using generative inverse folding

Shanehsazzadeh,

Alverio,

Kasun

et al. 2023

Preprint

View full text Add to dashboard Cite

Deep learning approaches have demonstrated the ability to design protein sequences given backbone structures [1, 2, 3, 4, 5]. While these approaches have been appliedin silicoto designing antibody complementarity-determining regions (CDRs), they have yet to be validatedin vitrofor designing antibody binders, which is the true measure of success for antibody design. Here we describeIgDesign, a deep learning method for antibody CDR design, and demonstrate its robustness with successful binder design for 8 therapeutic antigens. The model is tasked with designing heavy chain CDR3 (HCDR3) or all three heavy chain CDRs (HCDR123) using native backbone structures of antibody-antigen complexes, along with the antigen and antibody framework (FWR) sequences as context. For each of the 8 antigens, we design 100 HCDR3s and 100 HCDR123s, scaffold them into the native antibody’s variable region, and screen them for binding against the antigen using surface plasmon resonance (SPR). As a baseline, we screen 100 HCDR3s taken from the model’s training set and paired with the native HCDR1 and HCDR2. We observe that both HCDR3 design and HCDR123 design outperform this HCDR3-only baseline. IgDesign is the first experimentally validated antibody inverse folding model. It can design antibody binders to multiple therapeutic antigens with high success rates and, in some cases, improved affinities over clinically validated reference antibodies. Antibody inverse folding has applications to bothde novoantibody design and lead optimization, making IgDesign a valuable tool for accelerating drug development and enabling therapeutic design.

show abstract

Structure-informed Language Models Are Protein Designers

Cited by 29 publications

References 75 publications

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Machine Learning-Guided Protein Engineering

In vitrovalidated antibody design against multiple therapeutic antigens using generative inverse folding

Contact Info

Product

Resources

About