2021
DOI: 10.1038/s41467-021-25756-4
|View full text |Cite|
|
Sign up to set email alerts
|

Efficient generative modeling of protein sequences using simple autoregressive models

Abstract: Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computat… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
95
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 69 publications
(108 citation statements)
references
References 64 publications
(125 reference statements)
4
95
0
Order By: Relevance
“…Two reasons are possible: 1) In limited datasets of functional sequences, undersampled neutral variants may appear deleterious and 2) mutations without effect on expression may be deleterious for other phenotypes contributing to protein fitness. This observation agrees with what has been observed across other protein families ( 30 32 ) where phenotypes better describing fitness also correlate better to sequence-based predictions.…”
Section: Resultssupporting
confidence: 90%
“…Two reasons are possible: 1) In limited datasets of functional sequences, undersampled neutral variants may appear deleterious and 2) mutations without effect on expression may be deleterious for other phenotypes contributing to protein fitness. This observation agrees with what has been observed across other protein families ( 30 32 ) where phenotypes better describing fitness also correlate better to sequence-based predictions.…”
Section: Resultssupporting
confidence: 90%
“…Pearson correlation coefficients are 0.994 and 0.995 for panels A and B, respectively, and linear fits yield intercepts of 0.01 and 0 and slopes of 0.97 and 1 for panels A and B, respectively. The two-body frequencies shown for data set 1 include a reweighting of close sequences with Hamming distances under 0.2 [6,7], since bmDCA and arDCA aim to match these reweighted frequencies [63,64]. The fraction of correctly predicted partner pairs is shown versus the number of sequence pairs AB in the training set for data generated under the minimal model, and for data generated using models inferred from this generated data, either by bmDCA [63] (panel A) or by arDCA [64] (panel B).…”
Section: Discussionmentioning
confidence: 99%
“…There are q = 21 states, namely the 20 natural amino acids and the alignment gap. We use state-of-the-art methods that have good generative properties, namely bmDCA [16,63] and arDCA [64]. In practice, we employ bmDCA with its default parameters for q = 21, and with default parameters except t wait,0 = 1000 and ∆t 0 = 100 for q = 2 (motivated by the faster equilibration observed for q = 2).…”
Section: Modeling Structural Constraints With Potts Modelsmentioning
confidence: 99%
See 2 more Smart Citations