2022
DOI: 10.1073/pnas.2204569119
|View full text |Cite
|
Sign up to set email alerts
|

Conformal prediction under feedback covariate shift for biomolecular design

Abstract: Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 13 publications
(16 citation statements)
references
References 47 publications
0
16
0
Order By: Relevance
“…An alternative technique for uncertainty quantification that could be worth exploring is conformal prediction . In particular, recent work in this area presents a solution to the bias that arises during optimization when iteratively adding proposals from a regressor to the available training data by an acquisition function (38) .…”
Section: Discussionmentioning
confidence: 99%
“…An alternative technique for uncertainty quantification that could be worth exploring is conformal prediction . In particular, recent work in this area presents a solution to the bias that arises during optimization when iteratively adding proposals from a regressor to the available training data by an acquisition function (38) .…”
Section: Discussionmentioning
confidence: 99%
“…90,231,232 There are also efforts to utilize calibrated uncertainty about predicted fitnesses of proteins that lie out of the domain of previously screened proteins from the training set, but there is a need to expand and further test these methods in real settings. 208,233 It is still an open question whether supervised models can extrapolate beyond their training data to predict novel proteins. 234,235 More expressive deep learning methods, such as deep kernels, 236,237 could be explored as an alternative to Gaussian processes for uncertainty quantification in BO.…”
Section: Expanding the Power Of ML Methods To Optimize Protein Fitnessmentioning
confidence: 99%
“…At the same time, new classes of ML models should be developed for protein fitness prediction to take advantage of uncertainty and introduce helpful inductive biases for the domain. , There exist methods that take advantage of inductive biases and prior information about proteins, such as the assumption that most mutation effects are additive or incorporation of biophysical knowledge into models as priors. Another method biases the search toward variants with fewer mutations, which are more likely to be stable and functional . Domain-specific self-supervision has been explored by training models on codons rather than amino acid sequences. ,, There are also efforts to utilize calibrated uncertainty about predicted fitnesses of proteins that lie out of the domain of previously screened proteins from the training set, but there is a need to expand and further test these methods in real settings. , It is still an open question whether supervised models can extrapolate beyond their training data to predict novel proteins. , More expressive deep learning methods, such as deep kernels, , could be explored as an alternative to Gaussian processes for uncertainty quantification in BO. Overall, there is significant potential to improve ML-based protein fitness prediction to help guide the search toward proteins with ideal fitness.…”
Section: Navigating Protein Fitness Landscapes Using Machine Learningmentioning
confidence: 99%
“…Many studies have assessed the predictive performance of ML models on existing protein sequence-function datasets [11][12][13][14][15][16] , but there is little work that rigorously benchmarks performance in real-world protein design scenarios with experimental validation 5,7,17 . ML-guided protein design is inherently an extrapolation task that requires making predictions far beyond the training data, and evaluating models in this task is challenging due to the massive number of sequence configurations that must be searched and tested 18 .…”
Section: Introductionmentioning
confidence: 99%
“…Many studies have assessed the predictive performance of ML models on existing protein sequence-function datasets [11][12][13][14][15][16] , but there is little work that rigorously benchmarks performance in real-world protein design scenarios with experimental validation 5,7,17 . ML-guided protein design is inherently an extrapolation task that requires making predictions far beyond the training data, and evaluating models in this task is challenging due to the massive number of sequence configurations that must be searched and tested 18 .In this paper, we evaluate different neural network architectures' ability to extrapolate beyond training data for protein design. We develop a general protein design framework that uses an ML model to guide an in silico search over the sequence-function landscape.…”
mentioning
confidence: 99%