2018
DOI: 10.1186/s12859-018-2340-x
|View full text |Cite
|
Sign up to set email alerts
|

Machine Learning for detection of viral sequences in human metagenomic datasets

Abstract: BackgroundDetection of highly divergent or yet unknown viruses from metagenomics sequencing datasets is a major bioinformatics challenge. When human samples are sequenced, a large proportion of assembled contigs are classified as “unknown”, as conventional methods find no similarity to known sequences. We wished to explore whether machine learning algorithms using Relative Synonymous Codon Usage frequency (RSCU) could improve the detection of viral sequences in metagenomic sequencing data.ResultsWe trained Ran… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
44
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 45 publications
(44 citation statements)
references
References 40 publications
0
44
0
Order By: Relevance
“…The performance of the classifiers at this point far exceeded the outcomes of other classifiers built for the same task in previous studies [11,16,17,21]. Given that the focus of these previous classifier were genomic DNA data, in order to test whether this performance amplification was entirely attributable to the greater homogeneity RNA sequences and their shorter lengths compared to genomic DNA sequences, we studied the performance of the classifier on…”
Section: Ability To Generalize Across Speciesmentioning
confidence: 92%
“…The performance of the classifiers at this point far exceeded the outcomes of other classifiers built for the same task in previous studies [11,16,17,21]. Given that the focus of these previous classifier were genomic DNA data, in order to test whether this performance amplification was entirely attributable to the greater homogeneity RNA sequences and their shorter lengths compared to genomic DNA sequences, we studied the performance of the classifier on…”
Section: Ability To Generalize Across Speciesmentioning
confidence: 92%
“…In such a case, while losing information 289 about the maximal activation (best match), we gain information about frequency -the 290 average cannot be high if only a few good matches were found. In previous work the 291 authors of DVF and the authors of the current article have shown that methods based 292 on pattern frequency (k-mer counts, relative synonymous codon usage) are effective in 293 separating viral samples from non-viral ones [24,28]. Using convolution + average is a 294 natural extension to these pattern counting-based models.…”
mentioning
confidence: 89%
“…To compare against the performance of such 112 PLOS 5/16 methods, we also extracted k-mers from the investigated dataset and trained Random 113 Forest (RF) classifiers on the extracted values, while keeping the same data partitioning 114 as above. RF is a competitive machine learning algorithm for non-linearly separable 115 classes and it has already been used on this type of datasets [23,28]. The best test 116 performance with RF models was achieved with 6-mers and it produced test AUROC 117 0.875 (Fig 4).…”
Section: K-mer Models 110mentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, machine learning based tools have the advantage of extracting required features and encapsulating the necessary information for sequence classification in a computationally efficient model. This approach has been successfully used in the past for sequence classification problems (20). For example, DeepMicrobes (21)…”
Section: Introductionmentioning
confidence: 99%