2020
DOI: 10.1101/2020.01.16.908509
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Evolutionary context-integrated deep sequence modeling for protein engineering

Abstract: Protein engineering seeks to design proteins with improved or novel functions. Compared to rational design and directed evolution approaches, machine learning-guided approaches traverse the fitness landscape more effectively and hold the promise for accelerating engineering and reducing the experimental cost and effort. A critical challenge here is whether we are capable of predicting the function or fitness of unseen protein variants. By learning from the sequence and large-scale screening data of characteriz… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 64 publications
0
13
0
Order By: Relevance
“…We found that the pre-trained embeddings contain general yet biologically relevant information regarding the proteins and fine-tuning pushes the embeddings to have more specific information at the cost of generality. While there are notable differences between protein sequences and natural language corpuses [23,33], leveraging the architecture from BERT tuning can capture this biologically relevant information. Altering the architecture to include prior knowledge unique to biological sequences could further improve the embedding space.…”
Section: Discussionmentioning
confidence: 99%
“…We found that the pre-trained embeddings contain general yet biologically relevant information regarding the proteins and fine-tuning pushes the embeddings to have more specific information at the cost of generality. While there are notable differences between protein sequences and natural language corpuses [23,33], leveraging the architecture from BERT tuning can capture this biologically relevant information. Altering the architecture to include prior knowledge unique to biological sequences could further improve the embedding space.…”
Section: Discussionmentioning
confidence: 99%
“…Bepler and Berger ( 70 ) pretrained LSTMs on protein sequences, adding supervision from contacts to produce embeddings. Subsequent to our preprint, related works have built on its exploration of protein sequence modeling, exploring generative models ( 71 , 72 ), internal representations of Transformers ( 73 ), and applications of representation learning and generative modeling such as classification ( 74 , 75 ), mutational effect prediction ( 80 ), and design of sequences ( 76 ā€“ 78 ).…”
Section: Related Workmentioning
confidence: 99%
“…Other learning methods leverage multiple sequence alignments and databases of annotated genetic variants to make qualitative predictions about a mutationā€™s effect on organismal fitness or disease, rather than making quantitative predictions of molecular phenotype [7ā€“9]. Some recent attempts to address these limitations are difficult to implement and use because they lack available code [10, 11]. There is a current need for general, easy to use supervised learning methods that can leverage large sequence-function datasets to predict specific molecular phenotypes with the high accuracy required for protein design.…”
Section: Introductionmentioning
confidence: 99%