2023
DOI: 10.1101/2023.11.08.566287
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Neural network extrapolation to distant regions of the protein fitness landscape

Sarah A Fahlberg,
Chase R Freschlin,
Pete Heinzelman
et al.

Abstract: Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks’ capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immun… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 37 publications
(64 reference statements)
0
6
0
Order By: Relevance
“…We began with fitness scores of double mutants (considered at a codon level) as the target parameter to predict. Direct fitness prediction is a common task in protein engineering by ML (13,(32)(33)(34), however, in our input data we observed no strong correlation between double mutant fitness (F1,2) and respective single mutant fitness values (F1 & F2) or their sum (F1+F2) (Figure S8), indicating significant levels of epistasis. The task of direct F1,2 fitness prediction for variants with mutation sites spanning the entire myoglobin cDNA sequence was apparently too complex given our sparse double mutant training data, and models trained for direct F1,2 prediction performed poorly.…”
Section: Machine Learning Models To Predict Epistasismentioning
confidence: 71%
See 1 more Smart Citation
“…We began with fitness scores of double mutants (considered at a codon level) as the target parameter to predict. Direct fitness prediction is a common task in protein engineering by ML (13,(32)(33)(34), however, in our input data we observed no strong correlation between double mutant fitness (F1,2) and respective single mutant fitness values (F1 & F2) or their sum (F1+F2) (Figure S8), indicating significant levels of epistasis. The task of direct F1,2 fitness prediction for variants with mutation sites spanning the entire myoglobin cDNA sequence was apparently too complex given our sparse double mutant training data, and models trained for direct F1,2 prediction performed poorly.…”
Section: Machine Learning Models To Predict Epistasismentioning
confidence: 71%
“…Prior work in the field has combined DMS and next-generation sequencing (NGS) and trained models to predict fitness of unseen in silico sequences (13,14). For example, NGS and DMS were combined to study epistasis using GB1 variant libraries containing single and double mutations (15).…”
mentioning
confidence: 99%
“…Over 25 generations of in silico evolution, LASE-based evolution converges at the predicted fitness peak within 2 minutes, whereas the equivalent ESM-1b-based evolution converges at an equivalent predicted fitness peak after 2 hours on identical hardware (Fig 3b). Recent work on protein G and immunoglobulin G has indicated that neural machines can accurately extrapolate beyond the sequence space of the training data 63 , suggesting that this approach in silico evolution may sample the PTE fitness peaks not yet discovered by directed evolution.…”
Section: Resultsmentioning
confidence: 99%
“…208,233 It is still an open question whether supervised models can extrapolate beyond their training data to predict novel proteins. 234,235 More expressive deep learning methods, such as deep kernels, 236,237 could be explored as an alternative to Gaussian processes for uncertainty quantification in BO. Overall, there is significant potential to improve ML-based protein fitness prediction to help guide the search toward proteins with ideal fitness.…”
Section: Expanding the Power Of ML Methods To Optimize Protein Fitnessmentioning
confidence: 99%
“…At the same time, new classes of ML models should be developed for protein fitness prediction to take advantage of uncertainty and introduce helpful inductive biases for the domain. , There exist methods that take advantage of inductive biases and prior information about proteins, such as the assumption that most mutation effects are additive or incorporation of biophysical knowledge into models as priors. Another method biases the search toward variants with fewer mutations, which are more likely to be stable and functional . Domain-specific self-supervision has been explored by training models on codons rather than amino acid sequences. ,, There are also efforts to utilize calibrated uncertainty about predicted fitnesses of proteins that lie out of the domain of previously screened proteins from the training set, but there is a need to expand and further test these methods in real settings. , It is still an open question whether supervised models can extrapolate beyond their training data to predict novel proteins. , More expressive deep learning methods, such as deep kernels, , could be explored as an alternative to Gaussian processes for uncertainty quantification in BO. Overall, there is significant potential to improve ML-based protein fitness prediction to help guide the search toward proteins with ideal fitness.…”
Section: Navigating Protein Fitness Landscapes Using Machine Learningmentioning
confidence: 99%