A Machine Learning Bioinformatics Method to Predict Biological Activity from Biosynthetic Gene Clusters

Walker, Allison S.; Clardy, Jon

doi:10.1021/acs.jcim.0c01304

Cited by 46 publications

(61 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We intend to add to this multi-omics approach: 1) BGC bioactivity (20); 2) MS/MS bioactivity, by creating a new machine learning tool to predict bioactivity straight from the MS/MS spectra, and; 3) MS/MS substructure predictions, by integrating tools like MS2LDA (37), CSI: FingerID /SIRIUS 4 (38) and MassQL (not published yet). NPClassScore (33) demonstrated that biosynthetic class can be predicted with a combination of CANOPUS and MolNetEnhancer, hence, the NPOmix users can already run the KNN version with similarity and biosynthetic class as features, a version that yielded a precision of 92.9% in the validation set.…”

Section: Discussionmentioning

confidence: 99%

“…We stress that a larger training dataset with more complete genomes is likely to increase the size of the validation set by adding more valid BGCs into the analysis. We were also able to combine NPOmix with in silico metabolomics tools like Dereplicator+ (20) to make new links between MS/MS spectra, BGCs, and molecular structures, as we will exemplify by brasilicardin A. This was accomplished by annotating cryptic MS/MS spectra (without a GNPS library hit and therefore not present in the GNPS database) to known BGCs (found in the MIBiG database).…”

Section: Validation and Multi-omics Dereplication: Linking Known Meta...mentioning

confidence: 99%

“…In silico predictions using raw DNA sequence and unlabeled mass fragmentation data can be used to predict: bioactivity, either by gene content or docking experiments; full planar structures, by dereplication, homology, or de novo ; partial stereochemistry, and; novelty, based on the abundance from all bacteria sampled from nature so far. For example, some random forest classifiers by the Clardy lab (20) at Harvard medical center use only the BGC sequences to predict with about 80% precision if a BGC will produce an anticancer, antifungal, or antibacterial metabolite. Coelichelin was isolated using an in silico structure prediction that indicated the peptide to be a siderophore, which was confirmed by culturing the producer in an iron-deficient media and isolating the induced metabolite (overexpressed when compared to the control).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

NPOmix: a machine learning classifier to connect mass spectrometry fragmentation data to biosynthetic gene clusters

Leão

Wang

Silva

et al. 2021

Preprint

View full text Add to dashboard Cite

Microbial natural products, in particular secondary or specialized metabolites, are an important source and inspiration for many pharmaceutical and biotechnological products. However, bioactivity-guided methods widely employed in natural product discovery programs do not explore the full biosynthetic potential of microorganisms, and they usually miss metabolites that are produced at low titer. As a complementary method, the use of genome-based mining in natural products research has facilitated the charting of many novel natural products in the form of predicted biosynthetic gene clusters that encode for their production. Linking the biosynthetic potential inferred from genomics to the specialized metabolome measured by metabolomics would accelerate natural product discovery programs. Here, we applied a supervised machine learning approach, the K-Nearest Neighbor (KNN) classifier, for systematically connecting metabolite mass spectrometry data to their biosynthetic gene clusters. This pipeline offers a method for annotating the biosynthetic genes for known, analogous to known and cryptic metabolites that are detected via mass spectrometry. We demonstrate this approach by automated linking of six different natural product mass spectra, and their analogs, to their corresponding biosynthetic genes. Our approach can be applied to bacterial, fungal, algal and plant systems where genomes are paired with corresponding MS/MS spectra. Additionally, an approach that connects known metabolites to their biosynthetic genes potentially allows for bulk production via heterologous expression and it is especially useful for cases where the metabolites are produced at low amounts in the original producer.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Validation and Multi-omics Dereplication: Linking Known Meta...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NPOmix: a machine learning classifier to connect mass spectrometry fragmentation data to biosynthetic gene clusters

Leão

Wang

Silva

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…String information, such as DNA nucleotides and/or protein amino acid sequences, can be used to build a ML model for a particular purpose. Walker et al [65] developed a machine learning bioinformatics method for predicting a natural product's antibiotic activity directly from the sequence of its biosynthetic gene cluster. In this study, they assembled a training dataset from the MiBIG (Minimum Information about Biosynthetic Gene Custer) database, comprising sequences of biosynthetic gene clusters (BGCs).…”

Section: How a Machine Learns From Data And Creates A Model For A Tas...mentioning

confidence: 99%

A Brief Review of Machine Learning-Based Bioactive Compound Research

et al. 2022

View full text Add to dashboard Cite

Bioactive compounds are often used as initial substances for many therapeutic agents. In recent years, both theoretical and practical innovations in hardware-assisted and fast-evolving machine learning (ML) have made it possible to identify desired bioactive compounds in chemical spaces, such as those in natural products (NPs). This review introduces how machine learning approaches can be used for the identification and evaluation of bioactive compounds. It also provides an overview of recent research trends in machine learning-based prediction and the evaluation of bioactive compounds by listing real-world examples along with various input data. In addition, several ML-based approaches to identify specific bioactive compounds for cardiovascular and metabolic diseases are described. Overall, these approaches are important for the discovery of novel bioactive compounds and provide new insights into the machine learning basis for various traditional applications of bioactive compound-related research.

show abstract

“…Machine learning models reveal that most BGCs may encode antagonistic SMs Given the abundance and prevalence of LAB BGCs in the human microbiome, we next want to study the potential bioactivities of BGC-encoding SMs. The bioactivity of SMs encoded by BGCs was recently predicted using machine learning strategies based on chemical ngerprints of predicted compound structure, protein family (PFAM) domains, and other genetic features [32][33][34]. Here, we adapted four common machine learning classi ers (logistic regression, elastic net regression, random forest, and support vector machines) to predict the bioactivities of LAB-derived SMs.…”

Section: Introductionmentioning

confidence: 99%

A systematically biosynthetic investigation of lactic acid bacteria reveals diverse antagonistic bacteriocins that potentially shape the human microbiome

Zhang

Kalimuthu

et al. 2022

Preprint

View full text Add to dashboard Cite

Background: Lactic acid bacteria (LAB) produce various bioactive secondary metabolites (SMs), which endow LAB with a protective role for the host. However, the biosynthetic potentials of LAB-derived SMs remain elusive, particularly in their diversity, abundance, and distribution in the human microbiome. Thus, it is still unknown to what extent LAB-derived SMs are involved in microbiome homeostasis.Results: Here, we systematically investigate the biosynthetic potential of LAB from 31,977 LAB genomes, identifying 130,051 BGCs of 2,849 gene cluster families (GCFs). Most of these GCFs are species-specific or even strain-specific and uncharacterized yet. Analyzing 748 human-associated metagenomes, we gain an insight into the profile of LAB BGCs, which are highly diverse and niche-specific in the human microbiome. We discover that most LAB BGCs may encode bacteriocins with pervasive antagonistic activities predicted by machine learning models, potentially playing protective roles in the human microbiome. Class II bacteriocins, one of the most abundant and diverse LAB SMs, are particularly enriched and predominant in the vaginal microbiomes. Together with experimental validation, our metagenomic and metatranscriptomic analysis show that antagonistic class II bacteriocins potentially regulate microbial communities in the vagina, thereby contributing to microbiome homeostasis. Conclusions: Our study systematically investigates LAB biosynthetic potential and their profile in the human microbiome, linking them to the antagonistic contributions to microbiome homeostasis via omics analysis. These discoveries of the diverse and prevalent antagonistic SMs are expected to stimulate the mechanism study of LAB’s protective roles for the microbiome and host, highlighting the potential of LAB and their bacteriocins as therapeutic alternatives.

show abstract

A Machine Learning Bioinformatics Method to Predict Biological Activity from Biosynthetic Gene Clusters

Cited by 46 publications

References 58 publications

NPOmix: a machine learning classifier to connect mass spectrometry fragmentation data to biosynthetic gene clusters

NPOmix: a machine learning classifier to connect mass spectrometry fragmentation data to biosynthetic gene clusters

A Brief Review of Machine Learning-Based Bioactive Compound Research

A systematically biosynthetic investigation of lactic acid bacteria reveals diverse antagonistic bacteriocins that potentially shape the human microbiome

Contact Info

Product

Resources

About