2023
DOI: 10.1101/2023.01.23.525232
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ProT-VAE: Protein Transformer Variational AutoEncoder for Functional Protein Design

Abstract: The data-driven design of protein sequences with desired function is challenged by the absence of good theoretical models for the sequence-function mapping and the vast size of protein sequence space. Deep generative models have demonstrated success in learning the sequence to function relationship over natural training data and sampling from this distribution to design synthetic sequences with engineered functionality. We introduce a deep generative model termed the Protein Transformer Variational AutoEncoder… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
12
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 14 publications
(17 citation statements)
references
References 60 publications
0
12
0
Order By: Relevance
“…However, previous research has shown inconsistent levels of success with this type of model (Repecka et al, 2021; Russ et al, 2020), mostly due to its inability to learn higher-order relationships that exist in natural protein families. As an alternative, modern deep learning models, including Generative Adversarial Networks (GAN) (Goodfellow et al, 2020; Repecka et al, 2021), VAEs(Hawkins-Hooker et al, 2021; Riesselman et al, 2018; Sevgen et al, 2023; Sinai et al, 2017), and large generative protein language models (Ferruz et al, 2022; Madani et al, 2023; Nijkamp et al, 2022), have been implemented to learn the complex constraints in biological sequence design. However, previous methods have mostly focused on shorter protein sequences with a large number of members from the same family.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…However, previous research has shown inconsistent levels of success with this type of model (Repecka et al, 2021; Russ et al, 2020), mostly due to its inability to learn higher-order relationships that exist in natural protein families. As an alternative, modern deep learning models, including Generative Adversarial Networks (GAN) (Goodfellow et al, 2020; Repecka et al, 2021), VAEs(Hawkins-Hooker et al, 2021; Riesselman et al, 2018; Sevgen et al, 2023; Sinai et al, 2017), and large generative protein language models (Ferruz et al, 2022; Madani et al, 2023; Nijkamp et al, 2022), have been implemented to learn the complex constraints in biological sequence design. However, previous methods have mostly focused on shorter protein sequences with a large number of members from the same family.…”
Section: Discussionmentioning
confidence: 99%
“…Concurrent works, ProT-VAE (Sevgen et al, 2023) and ReLSO (Castro et al, 2022), both involved autoencoder and the use of language model, but 1) neither presented results on designing protein at the same length range as hexons, 2) both models were trained for a different objective of exploring fitness landscape and generating functionally improved sequences, 3) both used larger labeled datasets (ProT-VAE: 6447 and 20,000 sequences, ReLSO: 10 10 , 20 4 , and 51,175 sequences). Briefly, ProT-VAE incorporated a generic CNN network for compressing and decompressing pre-trained language model (PLM) hidden states, and the protein-family-specific VAE part was trained to further reduce the protein-level representation to a single vector and reconstruct the hidden states given the vector.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…[2][3][4][5] Two primary DGM paradigms have demonstrated substantial success in protein engineering: autoregressive (AR) language models [6][7][8][9][10][11] and variational autoencoders (VAEs). [12][13][14][15][15][16][17][18][19][20] AR models can operate on variable length sequences meaning that they do not require the construction of multiple sequence alignments and can be used to learn and generate novel sequences with high variability and diverse lengths. 7,8 Since protein sequences from non-homologous families or within homologous families with high variability present challenges in constructing alignments, 7 AR generative models are well-suited for alignment-free training, prediction, and design.…”
Section: Introductionmentioning
confidence: 99%
“…29 The ProtWave-VAE shares similarities with, but is differentiated from, a number of related approaches in the literature. The ProT-VAE model of Sevgen et al 19 uses a VAE architecture employing a large-scale pre-trained ProtT5 encoder and decoder and has shown substantial promise for alignment-free protein design. ProtWave-VAE is distinguished by its incorporation of latent conditioning and autoregressive sampling along the decoder path, and its lightweight architecture comprising ∼10 6 -10 7 trainable parameters relative to ∼10 9 for ProT-VAE.…”
Section: Introductionmentioning
confidence: 99%