2021
DOI: 10.1101/2021.03.04.433959
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient generative modeling of protein sequences using simple autoregressive models

Abstract: Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally extremely efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lowe… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

3
24
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 12 publications
(27 citation statements)
references
References 37 publications
3
24
0
Order By: Relevance
“…We observe that the performance of many different models on tasks like the prediction of mutational effects is often similar even when using very different architectures and, in addition, is close to what simple, pairwise models achieve (see e.g. [9]). It appears natural to ask then how much of the predictive performance of the more complex models like variational autoencoders is due to higher-order interactions which are inaccessible to more simple models.…”
Section: Introductionsupporting
confidence: 57%
See 1 more Smart Citation
“…We observe that the performance of many different models on tasks like the prediction of mutational effects is often similar even when using very different architectures and, in addition, is close to what simple, pairwise models achieve (see e.g. [9]). It appears natural to ask then how much of the predictive performance of the more complex models like variational autoencoders is due to higher-order interactions which are inaccessible to more simple models.…”
Section: Introductionsupporting
confidence: 57%
“…In the three right plots, the upper parts show the contacts for models extracted with the uniform distribution, the lower parts show the same for models extracted with the original model distribution. The left-most plot shows the contact predictions for ArDCA from the original method in [9].…”
Section: Vae (U/m)mentioning
confidence: 99%
“…In parallel to the evolution of CASP, the past few years have seen a significant improvement in the field of mutational outcome prediction. By leveraging the large amounts of available sequence data, several recent methods have achieved much higher accuracy than established popular approaches relying on a variety of sequence and structure-based features [173,174,175,176,177,178,179,180,181]. These approaches make the estimation of the impact of every possible substitution at every position in a protein-coding genome computationally feasible [182].…”
Section: Protein Mutations and Designmentioning
confidence: 99%
“…The success of these methods lies in their ability to capture dependencies between protein residues either by explicitly estimating inter-residue (pairwise) couplings [178,179] or by implicitly accounting for global sequence contexts [174,176]. In essence, the concepts at play are no different from those implemented for protein contact prediction, suggesting that mutational outcome prediction, protein structure prediction and protein design can be unified in a common theoretical framework extracting information from protein sequences [173,172,176]. Along this line, recent works have shown that NLP models pre-trained on millions of unlabelled protein sequences can be effectively finetuned with small amount of labelled data toward accurately predicting mutational effects as well as 3D contacts [185,184,186].…”
Section: Protein Mutations and Designmentioning
confidence: 99%
“…Design tasks tackled with deep learning include fixed backbone design (O'Connell et al, 2018;Ingraham et al, 2019;Qi and Zhang, 2020;Norn et al, 2021), antibody design (Wang et al, 2018;Saka et al, 2021;Shin et al, 2021), de novo design Moffat and Jones, 2021), and the prediction of whether a sequence has a stable structure from sequence alone (Singer et al, 2021). A variety of neural network architectures have been used including variational autoencoders (Greener et al, 2018;Hawkins-Hooker et al, 2021), deep exploration networks (Linder et al, 2020), graph neural networks (Strokach et al, 2020), recurrent neural networks (Alley et al, 2019) and autoregressive models (Shin et al, 2021;Trinquier et al, 2021). Ultimately the hope is that faster and more accurate protein design with deep learning will lead to the design of functional proteins (Tischer et al, 2020;Caceres-Delpiano et al, 2020).…”
Section: Introductionmentioning
confidence: 99%