2023
DOI: 10.1101/2023.05.15.540857
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Stability Oracle: A Structure-Based Graph-Transformer for Identifying Stabilizing Mutations

Abstract: Stabilizing proteins is a fundamental challenge in protein engineering and is almost always a prerequisite for the development of industrial and pharmaceutical biotechnologies. Here we present Stability Oracle: a structure-based graph-transformer framework that achieves state of the art performance on predicting the effect of a point mutation on a protein's thermodynamic stability ∆∆G). A strength of our model is its ability to identify stabilizing mutations, which often make up a small fraction of a protein's… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 75 publications
0
3
0
Order By: Relevance
“…First, the dynamic range of the proteolysis assay is limited to ~5 kcal/mol 19 , while experimental stability datasets such as our Fireprot dataset may include mutations with up to ±10 kcal/mol DDG°. This means models trained on Megascale have limited capability to predict large changes in stability, a property that we also observe in other recently published models utilizing the Megascale dataset 16,26 . Second, we found that surface mutations to cysteine were often observed to be highly stabilizing in the Megascale dataset, such that ThermoMPNN would heavily favor surface cysteine mutations unless omitted from the permitted residue options (Supplementary Fig.…”
Section: Discussionmentioning
confidence: 54%
See 1 more Smart Citation
“…First, the dynamic range of the proteolysis assay is limited to ~5 kcal/mol 19 , while experimental stability datasets such as our Fireprot dataset may include mutations with up to ±10 kcal/mol DDG°. This means models trained on Megascale have limited capability to predict large changes in stability, a property that we also observe in other recently published models utilizing the Megascale dataset 16,26 . Second, we found that surface mutations to cysteine were often observed to be highly stabilizing in the Megascale dataset, such that ThermoMPNN would heavily favor surface cysteine mutations unless omitted from the permitted residue options (Supplementary Fig.…”
Section: Discussionmentioning
confidence: 54%
“…As expected, Megascale-trained models outperformed those trained on the sparse, unbalanced Fireprot dataset, and that advantage was more dependent on the total number of mutations included (sparsity) than on the number of unique proteins in the training dataset. Another recent structurebased transfer learning method, Stability Oracle, observed similar performance boosts from both pre-training for sequence recovery and transfer learning using the larger, more robust Megascale dataset 26 .…”
Section: Discussionmentioning
confidence: 82%
“…Therefore, few methods can consistently perform well in different test sets ( Pucci et al, 2022 ; Benevenuta et al, 2023 ). Moreover, since most mutations lead to decreased fitness, datasets are dominated by harmful mutations, causing prediction models to overfit on predicting harmful mutations ( Montanucci et al, 2019 ; Benevenuta et al, 2023 ; Diaz et al, 2023 ). Presently, prediction models are primarily evaluated using Pearson correlation coefficients, classification accuracy, and error, with high accuracy often stemming from the prediction of the relatively high proportion of harmful mutations in the test set ( Diaz et al, 2023 ).…”
Section: Discussionmentioning
confidence: 99%
“…Despite some previous efforts to address these issues, including extending relevant datasets and achieving certain results ( Diaz et al, 2023 ), the severe lack of real data obtained from biological experiments continues to plague the community. In this context, the utilization of AI models with large-scale pretraining techniques could be one effective solution to this problem.…”
Section: Discussionmentioning
confidence: 99%
“…Alternatively, we can avoid the need for labels by masking a part of the input, e.g. , a residue in a protein sequence or structure, and training a model that will predict the masked part. In other words, the original data ( e.g.…”
Section: Principles Of Machine Learningmentioning
confidence: 99%