A deep learning genome-mining strategy for biosynthetic gene cluster prediction

Hannigan, Geoffrey D.; Prihoda, David; Palička, Andrej; Soukup, Jindřich; Klempíř, Ondřej; Rampula, Lena; Durcak, Jindrich; Wurst, Michael; Kotowski, Jakub; Chang, Dan; Wang, Rurun; Piizzi, Grazia; Temesi, Gergely; Hazuda, Daria J.; Woelk, Christopher H.; Bitton, Danny A.

doi:10.1093/nar/gkz654

Cited by 187 publications

(215 citation statements)

References 39 publications

Supporting

Mentioning

205

Contrasting

Order By: Relevance

“…Supervised learning was shown to perform well at BGC discovery in previous work that focused on handling bacteria data [5], [6]. Given that annotated data are needed to perform a supervised learning approach, we propose here fungal BGC datasets to support the development of this approach for fungi.…”

Section: A Proposed Datasetsmentioning

confidence: 99%

“…To generate classification models based on a supervised learning method, we extracted Pfam [22] 6 IDs from the positive and negative instances. All datasets were converted into pfamtsv format [6], which is required as input in the supervised learning approach applied in this work. For each dataset, 80% were randomly selected for the training phase, while 20% were held out for the validation phase, as shown in Table I.…”

Section: A Proposed Datasetsmentioning

confidence: 99%

“…In this section we describe the methods applied to analyse the performance of a supervised learning approach using the fungal BGC datasets presented in Section III-A and the test data presented in Section III-B. To generate classification models with our fungal BGC datases, we utilized the Deep-BGC system [6]. DeepBGC executable, source code and other resources are openly available 8 .…”

Section: Classification Modelsmentioning

confidence: 99%

“…Among these resources, there are pre-built BGC classification models and word2vec-based embeddings built using Pfam IDs, referred to as pfam2vec embeddings. In [6] the authors explained that pfam2vec embeddings were trained based in a skipgram architecture with 100 dimensions and over 15,686 unique Pfam IDs. DeepBGC classification is based on a Bidirectional Long Short Term Memory (BiLSTM) neural network, for which the input are pfam2vec embeddings.…”

Section: Classification Modelsmentioning

confidence: 99%

“…Supervised learning has been previously used to predicting bacterial BGCs [5], [6] and shown to perform well. Supervised learning methods however are developed primarily based on annotated datasets, for which all instances are labeled as belonging to a specific class.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets

Almeida

Tsang

Diallo

2019

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

View full text Add to dashboard Cite

Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are clusters of genes capable of producing natural products, compounds that play an important role in the production of a wide variety of bioactive compounds, including antibiotics and pharmaceuticals. Identifying BGCs can lead to the discovery of novel natural products to benefit human health. Previous work has been focused on developing automatic tools to support BGC discovery in plants, fungi, and bacteria. Datadriven methods, as well as probabilistic and supervised learning methods have been explored in identifying BGCs. Most methods applied to identify fungal BGCs were data-driven and presented limited scope. Supervised learning methods have been shown to perform well at identifying BGCs in bacteria, and could be well suited to perform the same task in fungi. But labeled data instances are needed to perform supervised learning.Openly accessible BGC databases contain only a very small portion of previously curated fungal BGCs. Making new fungal BGC datasets available could motivate the development of supervised learning methods for fungal BGCs and potentially improve prediction performance compared to data-driven methods. In this work we propose new publicly available fungal BGC datasets to support the BGC discovery task using supervised learning. These datasets are prepared to perform binary classification and predict candidate BGC regions in fungal genomes. In addition we analyse the performance of a well supported supervised learning tool developed to predict BGCs.

show abstract

Section: A Proposed Datasetsmentioning

confidence: 99%

Section: A Proposed Datasetsmentioning

confidence: 99%