2023
DOI: 10.3390/pharmaceutics15051337
|View full text |Cite
|
Sign up to set email alerts
|

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Abstract: Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over se… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 76 publications
0
1
0
Order By: Relevance
“…Thus, a DE campaign can get stuck at a local optimum, even when high fitness sequences are nearby (Figure A). To address this limitation, protein fitness prediction methods using supervised ML models have emerged to learn a mapping between protein sequences and their associated fitness values to approximate protein fitness landscapes. These models can then predict the fitnesses of previously unseen protein variants, increasing screening efficiency by evaluating proteins in silico and expanding exploration to a greater scope of sequences compared to conventional DE approaches. , At the same time, zero-shot (ZS) predictorssuch as implicit fitness constraints learned from naturally occurring protein sequences (evolutionary conservation)can also guide the prediction of protein fitness. …”
Section: Navigating Protein Fitness Landscapes Using Machine Learningmentioning
confidence: 99%
“…Thus, a DE campaign can get stuck at a local optimum, even when high fitness sequences are nearby (Figure A). To address this limitation, protein fitness prediction methods using supervised ML models have emerged to learn a mapping between protein sequences and their associated fitness values to approximate protein fitness landscapes. These models can then predict the fitnesses of previously unseen protein variants, increasing screening efficiency by evaluating proteins in silico and expanding exploration to a greater scope of sequences compared to conventional DE approaches. , At the same time, zero-shot (ZS) predictorssuch as implicit fitness constraints learned from naturally occurring protein sequences (evolutionary conservation)can also guide the prediction of protein fitness. …”
Section: Navigating Protein Fitness Landscapes Using Machine Learningmentioning
confidence: 99%
“…Biases, data imbalance (underrepresented domains, motifs, and functions), lack of relevant features, architectural limitations (e.g., incapacity of capturing complex patterns, inability to integrate diverse data sources, lack of regularization, etc. ), and poor uncertainty calibration could potentially cause models to "force" common labels from the training data when they cannot make an informed prediction for a given protein [107][108][109][110] .…”
Section: To Date Machine Learning Models Largely Fail At Discovering ...mentioning
confidence: 99%
“…Indexing amino acid features based on the amino acid index list has become a common method in bioinformatics [30]. Unified representation (UniRep) is a method that transforms any protein sequence into a fixed-length vector representation, addressing the scarcity of protein informatics data by leveraging full utilization of the original sequence [31,32]. A key feature of UniRep is the numerical encoding of oligonucleotides, allowing for comparison and analysis of all oligonucleotide pairs occurring in the sequence (including overlaps).…”
Section: Introductionmentioning
confidence: 99%