2021
DOI: 10.1016/j.cels.2020.10.007
|View full text |Cite
|
Sign up to set email alerts
|

Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Abstract: Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It's challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Importantly, most DMS data do not contain e… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
28
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3
1

Relationship

2
8

Authors

Journals

citations
Cited by 42 publications
(35 citation statements)
references
References 55 publications
(68 reference statements)
0
28
0
Order By: Relevance
“…In metabolic and protein engineering, computational models informed by high-throughput experimental measurements have been used to design optimized protein sequences with high protein stability 28 or identify promoter combinations driving enzymes in a metabolic pathway to maximize valuable molecule production 29,30 . In the microbiome field, empirical and top-down approaches are frequently used to design microbial communities as opposed to data-driven approaches 14 .…”
mentioning
confidence: 99%
“…In metabolic and protein engineering, computational models informed by high-throughput experimental measurements have been used to design optimized protein sequences with high protein stability 28 or identify promoter combinations driving enzymes in a metabolic pathway to maximize valuable molecule production 29,30 . In the microbiome field, empirical and top-down approaches are frequently used to design microbial communities as opposed to data-driven approaches 14 .…”
mentioning
confidence: 99%
“…When the space becomes too large, a layer of optimization must be added because a comprehensive screen is no longer possible [16][17][18][19] . In both cases, a key component of the ML-guided protein engineering approach is a reliance on an accurate MLbased fitness model-one that predicts protein property from protein sequence [20][21][22][23][24][25] .…”
mentioning
confidence: 99%
“…The reads from the Illumina FASTQ files were mapped to the caspase reference gene using Bowtie2 22 , and translated to amino acid sequences. The fitness effect of each observed amino acid substitution was estimated using a positive-unlabeled learning framework that compares sequences from the presorted population with the sorted population 23,24 .…”
Section: Methodsmentioning
confidence: 99%