2024
DOI: 10.1101/2024.01.12.575432
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TransMEP: Transfer learning on large protein language models to predict mutation effects of proteins from a small known dataset

Tilman Hoffbauer,
Birgit Strodel

Abstract: Machine learning-guided optimization has become a driving force for recent improvements in protein engineering. In addition, new protein language models are learning the grammar of evolutionarily occurring sequences at large scales. This work combines both approaches to make predictions about mutational effects that support protein engineering. To this end, an easy-to-use software tool called TransMEP is developed using transfer learning by feature extraction with Gaussian process regression. A large collectio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 27 publications
(58 reference statements)
0
1
0
Order By: Relevance
“…PLMs are an important component in many existing methods for low-N protein engineering. They have been used to extract protein sequence representations [3,[74][75][76], for finetuning on the low-N function data [76][77][78], and to generate auxiliary training data in more complex models [78][79][80]. Other computational strategies for addressing the low-N problem include Gaussian processes [75,81,82], augmenting regression models with sequence-based [15,83] or structure-based [84] scores, custom protein 10/45 representations that can produce pretraining data [85], representations of proteins' 3D shape [86], meta learning [87], and contrastive finetuning [88].…”
Section: Discussionmentioning
confidence: 99%
“…PLMs are an important component in many existing methods for low-N protein engineering. They have been used to extract protein sequence representations [3,[74][75][76], for finetuning on the low-N function data [76][77][78], and to generate auxiliary training data in more complex models [78][79][80]. Other computational strategies for addressing the low-N problem include Gaussian processes [75,81,82], augmenting regression models with sequence-based [15,83] or structure-based [84] scores, custom protein 10/45 representations that can produce pretraining data [85], representations of proteins' 3D shape [86], meta learning [87], and contrastive finetuning [88].…”
Section: Discussionmentioning
confidence: 99%