2021
DOI: 10.1021/acs.jcim.0c01304
|View full text |Cite
|
Sign up to set email alerts
|

A Machine Learning Bioinformatics Method to Predict Biological Activity from Biosynthetic Gene Clusters

Abstract: Research in natural products, the genetically encoded small molecules produced by organisms in an idiosyncratic fashion, deals with molecular structure, biosynthesis, and biological activity. Bioinformatics analyses of microbial genomes can successfully reveal the genetic instructions, biosynthetic gene clusters, that produce many natural products. Genes to molecule predictions made on biosynthetic gene clusters have revealed many important new structures. There is no comparable method for genes to biological … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
54
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 46 publications
(61 citation statements)
references
References 58 publications
2
54
2
Order By: Relevance
“…We intend to add to this multi-omics approach: 1) BGC bioactivity (20); 2) MS/MS bioactivity, by creating a new machine learning tool to predict bioactivity straight from the MS/MS spectra, and; 3) MS/MS substructure predictions, by integrating tools like MS2LDA (37), CSI: FingerID /SIRIUS 4 (38) and MassQL (not published yet). NPClassScore (33) demonstrated that biosynthetic class can be predicted with a combination of CANOPUS and MolNetEnhancer, hence, the NPOmix users can already run the KNN version with similarity and biosynthetic class as features, a version that yielded a precision of 92.9% in the validation set.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…We intend to add to this multi-omics approach: 1) BGC bioactivity (20); 2) MS/MS bioactivity, by creating a new machine learning tool to predict bioactivity straight from the MS/MS spectra, and; 3) MS/MS substructure predictions, by integrating tools like MS2LDA (37), CSI: FingerID /SIRIUS 4 (38) and MassQL (not published yet). NPClassScore (33) demonstrated that biosynthetic class can be predicted with a combination of CANOPUS and MolNetEnhancer, hence, the NPOmix users can already run the KNN version with similarity and biosynthetic class as features, a version that yielded a precision of 92.9% in the validation set.…”
Section: Discussionmentioning
confidence: 99%
“…We stress that a larger training dataset with more complete genomes is likely to increase the size of the validation set by adding more valid BGCs into the analysis. We were also able to combine NPOmix with in silico metabolomics tools like Dereplicator+ (20) to make new links between MS/MS spectra, BGCs, and molecular structures, as we will exemplify by brasilicardin A. This was accomplished by annotating cryptic MS/MS spectra (without a GNPS library hit and therefore not present in the GNPS database) to known BGCs (found in the MIBiG database).…”
Section: Validation and Multi-omics Dereplication: Linking Known Meta...mentioning
confidence: 99%
See 1 more Smart Citation
“…String information, such as DNA nucleotides and/or protein amino acid sequences, can be used to build a ML model for a particular purpose. Walker et al [65] developed a machine learning bioinformatics method for predicting a natural product's antibiotic activity directly from the sequence of its biosynthetic gene cluster. In this study, they assembled a training dataset from the MiBIG (Minimum Information about Biosynthetic Gene Custer) database, comprising sequences of biosynthetic gene clusters (BGCs).…”
Section: How a Machine Learns From Data And Creates A Model For A Tas...mentioning
confidence: 99%
“…Machine learning models reveal that most BGCs may encode antagonistic SMs Given the abundance and prevalence of LAB BGCs in the human microbiome, we next want to study the potential bioactivities of BGC-encoding SMs. The bioactivity of SMs encoded by BGCs was recently predicted using machine learning strategies based on chemical ngerprints of predicted compound structure, protein family (PFAM) domains, and other genetic features [32][33][34]. Here, we adapted four common machine learning classi ers (logistic regression, elastic net regression, random forest, and support vector machines) to predict the bioactivities of LAB-derived SMs.…”
Section: Introductionmentioning
confidence: 99%