2022
DOI: 10.1038/s41467-022-32007-7
|View full text |Cite
|
Sign up to set email alerts
|

ProtGPT2 is a deep unsupervised language model for protein design

Abstract: Protein design aims to build novel proteins customized for specific purposes, thereby holding the potential to tackle many environmental and biomedical problems. Recent progress in Transformer-based architectures has enabled the implementation of language models capable of generating text with human-like capabilities. Here, motivated by this success, we describe ProtGPT2, a language model trained on the protein space that generates de novo protein sequences following the principles of natural ones. The generat… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
205
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 287 publications
(257 citation statements)
references
References 72 publications
0
205
0
1
Order By: Relevance
“…DARK3 [18] is a decoder-only model with 100M parameters trained on synthetic sequences. Following the principles of DARK3, ProtGPT2 leveraged a GPT2-like model [85] an trained on the Uniref50 dataset [83], leading to a model able to generate proteins in unexplored regions of the natural protein space, while presenting natural-like properties [16]. RITA [86] included a study on the scalability of generative Transformer models with several model-specific (e.g.…”
Section: The Deep Learning Era Of Protein Sequence and Structure Gene...mentioning
confidence: 99%
See 3 more Smart Citations
“…DARK3 [18] is a decoder-only model with 100M parameters trained on synthetic sequences. Following the principles of DARK3, ProtGPT2 leveraged a GPT2-like model [85] an trained on the Uniref50 dataset [83], leading to a model able to generate proteins in unexplored regions of the natural protein space, while presenting natural-like properties [16]. RITA [86] included a study on the scalability of generative Transformer models with several model-specific (e.g.…”
Section: The Deep Learning Era Of Protein Sequence and Structure Gene...mentioning
confidence: 99%
“…A step towards this goal would be DL models that prompted with a set of biological mechanism and industry-relevant properties (e.g., desired thermostability, aggregate viscosity, sequence length, subcellular localization, catalytic capabilities, or binding partners) output a sequence or structure satisfying the selected criteria with high precision in a timely fashion. Whilst this may not yet be possible, we present an offline attempt (i.e., not an end-to-end solution) by combining a generative DL model producing sequences [16] with an oracle discriminator DL model helping to query generated sequences for desired properties [22].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent advances in protein representations and machine learning have the potential to help accelerate orphan enzyme annotation beyond EC Class predictions. Particularly exciting are protein language models, a family of machine learning techniques adapted from the field of natural language processing (NLP) and tailored for protein sequence analysis [19][20][21] . These models take amino-acid sequences as inputs and output high-dimensional vector representations or embeddings.…”
Section: Introductionmentioning
confidence: 99%