2022
DOI: 10.1371/journal.pcbi.1009853
|View full text |Cite
|
Sign up to set email alerts
|

Machine learning modeling of family wide enzyme-substrate specificity screens

Abstract: Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

4
69
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 60 publications
(100 citation statements)
references
References 56 publications
4
69
0
Order By: Relevance
“…The ESM-1 b model was trained in a self-supervised fashion, i.e., 10-15 % of the amino acids in a sequence were masked at random, and the model was trained to predict the identity of the masked amino acids. It has been shown that the resulting representations contain rich information about the structure and the function of the proteins 17,30,31 . Using the pre-trained ESM-1 b model 17 , we calculated these 1280-dimensional representations for all enzymes in our dataset, in the following referred to as ESM-1b vectors.…”
Section: Resultsmentioning
confidence: 99%
“…The ESM-1 b model was trained in a self-supervised fashion, i.e., 10-15 % of the amino acids in a sequence were masked at random, and the model was trained to predict the identity of the masked amino acids. It has been shown that the resulting representations contain rich information about the structure and the function of the proteins 17,30,31 . Using the pre-trained ESM-1 b model 17 , we calculated these 1280-dimensional representations for all enzymes in our dataset, in the following referred to as ESM-1b vectors.…”
Section: Resultsmentioning
confidence: 99%
“…High performing DTI prediction methods should be able to generalize broadly to unseen types of drugs and targets, while also discriminating between highly similar molecules with different binding properties. Previous work demonstrated the utility of PLMs to improve the generalizability of DTI prediction methods[6, 20]; we now add a contrastive learning approach which improves specificity. The contrastive approach taken by CON-Plex is directly enabled by the architecture of the base PLM-enabled lexicographic model— to compute the triplet distance loss, the protein and drugs must be co-embedded, and the distance between them must be meaningful and simply computed.…”
Section: Discussionmentioning
confidence: 99%
“…Trained over hundreds of millions of protein sequences, PLMs apply the distributional hypothesis [2, 3, 18, 4] and learn a rich implicit featurization of proteins that has proven useful in a variety of tasks [12, 19, 21]. Sledzieski et al [20], and later Goldman et al [6] independently, have demonstrated the power of PLMs for DTI prediction. The use of pre-trained PLMs unlocks the full richness and diversity of data across the protein universe, whereas models trained solely on DTI data can leverage only the very limited percentage of protein space that has been tested experimentally for interactions.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, methods to directly identify substrate promiscuity from sequence alone would greatly increase the efficiency of bioprospecting for new esterase candidates, a task ideal for machine learning algorithms (see for example, other recently developed methods for activity predictions [ 12 ]). Several studies have already predicted enzyme substrate promiscuity using molecular descriptors [ 13 ] or machine learning [ 14 , 15 , 16 ] approaches, although for other enzyme families. In addition, there are several differences in our approach.…”
Section: Introductionmentioning
confidence: 99%