Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

McIntyre, Alexa B. R.; Ounit, Rachid; Afshinnekoo, Ebrahim; Prill, Robert J.; Hénaff, Elizabeth; Alexander, Noah; Minot, Samuel S.; Danko, David; Foox, Jonathan; Ahsanuddin, Sofia; Tighe, Scott; Hasan, Nur A.; Subramanian, Poorani; Moffat, Kelly; Levy, Shawn; Lonardi, Stefano; Greenfield, Nick; Colwell, Rita R.; Rosen, Gail; Mason, Christopher E.

doi:10.1186/s13059-017-1299-7

Cited by 283 publications

(265 citation statements)

References 73 publications

Supporting

Mentioning

249

Contrasting

Order By: Relevance

“…Well characterized reference standards and controls are needed to ensure mNGS assay quality and stability over time. Most available metagenomic reference materials are highly customized to specific applications (for exam ple, ZymoBIOMICS Microbial Community Standard for microbiome analyses and bacterial and fungal meta genomics 105 ) and/or focused on a more limited spec trum of organisms (for example, the National Institute of Standards and Technology (NIST) reference materials for mixed microbial DNA detection, which contain only bacteria 106 ). Thus, these materials may not be applicable to untargeted mNGS analyses.…”

Section: Reference Standardsmentioning

confidence: 99%

See 1 more Smart Citation

Clinical metagenomics

2019

View full text Add to dashboard Cite

| Clinical metagenomic next-generation sequencing (mNGS), the comprehensive analysis of microbial and host genetic material (DNA and RNA) in samples from patients, is rapidly moving from research to clinical laboratories. This emerging approach is changing how physicians diagnose and treat infectious disease, with applications spanning a wide range of areas, including antimicrobial resistance, the microbiome, human host gene expression (transcriptomics) and oncology. Here, we focus on the challenges of implementing mNGS in the clinical laboratory and address potential solutions for maximizing its impact on patient care and public health.

show abstract

Section: Reference Standardsmentioning

confidence: 99%

“…Customized data sets can be prepared to mimic input sequence data and expand the range of microorganisms detected through in silico analysis 37 . The use of standardized reference mate rials and NGS data sets is also helpful in comparative evaluation of different bioinformatics pipelines 105 .…”

Section: Bioinformatics Challengesmentioning

confidence: 99%

Clinical metagenomics

2019

View full text Add to dashboard Cite

show abstract

“…However, while many methods have been proposed for taxonomic classification [12,13], the accuracy of these methods using different training databases has not been fully tested. This is an important issue, because as new genome data are generated, training data sets, such as the commonly used the NCBI Reference Sequence Database (RefSeq) will change over time.…”

Section: /19mentioning

confidence: 99%

“…Taxonomic classification is usually one of the first steps in a metagenomic pipeline [11]. Once these organisms are identified, they are then used in downstream analyses, such as alpha/beta diversity measures, ordination, feature selection, phenotype classification, etc.However, while many methods have been proposed for taxonomic classification [12,13], the accuracy of these methods using different training databases has not been fully tested. This is an important issue, because as new genome data are generated, training data sets, such as the commonly used the NCBI Reference Sequence Database (RefSeq) will change over time.…”

mentioning

confidence: 99%

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

Zhao

Cristian

Rosen

2019

Preprint

View full text Add to dashboard Cite

Current metagenomic taxonomic classifiers cannot computationally keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of "incremental learning" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model's knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4 th of the non-incremental time with no accuracy loss. In conclusion, it is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. 1/19Figure 1. Number of updates in the NCBI Bacteria Genome database A: Accumulative number of genomes updates per year; B: compared with last year, the number of new updates per yearreads in metagenome sequencing data -uses aligners, read mappers, classifiers and other "base techniques" to solve this problem [6][7][8][9][10]. Taxonomic classification is usually one of the first steps in a metagenomic pipeline [11]. Once these organisms are identified, they are then used in downstream analyses, such as alpha/beta diversity measures, ordination, feature selection, phenotype classification, etc.However, while many methods have been proposed for taxonomic classification [12,13], the accuracy of these methods using different training databases has not been fully tested. This is an important issue, because as new genome data are generated, training data sets, such as the commonly used the NCBI Reference Sequence Database (RefSeq) will change over time. Nasko et al. [14] recently demonstrated that more reads are classified (as opposed to be assigned to an unclassified/unknown class) by the Kraken classifier with newer database versions. Nasko's analysis also suggested that changes in RefSeq over time may influence but not necessarily reduce the misclassification rate, with genus/species false positives shown to be 1% and 8% respectively. These metrics were, however, calculated only on a selection of 10 genomes. Accordingly, given that there is an ongoing rapid expansion of genomic data of microbial diversity, it is imperative to update taxonomic classifiers as new genomes/genes are discovered. Simply failing to update the model will result in lower accuracy due to incomplete knowledge. In addition, the way that most researchers tra...

show abstract

“…Methods in Ecology and Evoluঞon KAHLKE And RALPH estimation implemented in the amplicon analysis framework QIIME (Caporaso et al, 2010) which uses the three best blast hits of a read for classification of amplicon sequences. Despite its reliance on well curated target databases, LCA algorithms have been shown to be highly accurate (McIntyre et al, 2017) and, depending on the comparison tool used, computationally efficient. Current LCA implementations, however, are restricted to NGS reads or short sequences and lack the ability to classify sequences such as predicted aminoacid sequences, assembled contigs from genome and metagenome projects or the increasingly common long-read sequences produced by 3GS technologies.…”

mentioning

confidence: 99%

BASTA – Taxonomic classification of sequences and sequence bins using last common ancestor estimations

Kahlke

Ralph

2018

Methods Ecol Evol

View full text Add to dashboard Cite

Identification of the taxonomic origin of a DNA sequence is crucial for many sequencing projects, e.g. metagenomics studies, identification of contaminations in whole genome sequencing projects and filtering of organisms of interest in marker‐gene based community analyses. Last common ancestor algorithms are powerful approaches to estimate the taxonomy of a given sequence and have been widely used for classification of next‐generation sequencing (NGS) reads, also known as 2nd generation sequencing reads. Here, we present BASTA ( https://github.com/timkahlke/BASTA), a basic sequence taxonomy annotator, which extends last common ancestor estimations from sequencing reads to any kind of nucleotide or amino acid sequence utilizing NCBI taxonomies of user‐defined best hits. BASTA can be configured to use the output of many common sequence comparison tools, e.g. BLAST and Diamond, in conjunction with either provided or user‐defined target sequence databases.

show abstract

Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

Cited by 283 publications

References 73 publications

Clinical metagenomics

Clinical metagenomics

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

BASTA – Taxonomic classification of sequences and sequence bins using last common ancestor estimations

Contact Info

Product

Resources

About