2022
DOI: 10.1101/2022.04.27.489753
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools

Abstract: In metagenomic analyses of microbiomes, one of the first steps is usually the taxonomic classification of reads by comparison to a database of previously taxonomically classified genomes. While different studies comparing metagenomic taxonomic classification methods have determined that different tools are "best", there are two tools that have been used the most to-date: Kraken (k-mer based classification against a user-constructed database) and MetaPhlAn (classification by alignment to clade-specific marker g… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 12 publications
(22 citation statements)
references
References 67 publications
0
22
0
Order By: Relevance
“…As viruses have relatively low abundance in a typical metagenomic sample, we chose to use whole genome classification, which is typically more sensitive in the low-coverage regime than methods relying on clade-specific marker genes 22,39 . Specifically, Phanta classifies reads to the lowest possible taxonomic rank by Kraken2 22,24 , a k -mer-based method that has been shown to be both fast and accurate given the correct database and optimized parameters 39 . Second, Phanta reduces false positive species by filtering out species based on a calculated proxy for genome coverage (see Methods), a known issue in taxonomic classification 40 .…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…As viruses have relatively low abundance in a typical metagenomic sample, we chose to use whole genome classification, which is typically more sensitive in the low-coverage regime than methods relying on clade-specific marker genes 22,39 . Specifically, Phanta classifies reads to the lowest possible taxonomic rank by Kraken2 22,24 , a k -mer-based method that has been shown to be both fast and accurate given the correct database and optimized parameters 39 . Second, Phanta reduces false positive species by filtering out species based on a calculated proxy for genome coverage (see Methods), a known issue in taxonomic classification 40 .…”
Section: Resultsmentioning
confidence: 99%
“…Boxplots display the percentage distribution across the set of metagenomes. Database abbreviations: STD = standard Kraken2 44 , UHGG = Unified Human Gastrointestinal Genome Collection 18 , RefSeq = RefSeq Complete v205 39 , HumGut = HumGut 19 , Phanta = Phanta’s default database. The insert shows the same information as the boxplots for STD and Phanta.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Viral reads are often unaligned for similar reasons: viral reference genomes and reference transcriptomes cannot capture the complexity of viral quasispecies ( 19 ) or the vast extent of viral polymorphism and splicing ( 20 ), and new viral assemblies are constantly being discovered and added to reference databases ( 21, 22 ). It is impossible to imagine pre-specifying a set of reference genomes or transcriptomes due to the rapid genomic changes that define the microbial world and have significant clinical impact ( 23 ) and where the use of databases limits inference ( 24 ). In plant genomics and non-model organism work, it is common to lack a reference genome entirely, making inference on differential isoform expression through alignment impossible.…”
Section: Introductionmentioning
confidence: 99%
“…In the microbial world, pre-specifying a set of reference genomes is infeasible due to its inherent rapid genomic changes. References also cannot capture insertional diversity of mobile elements, which have significant phenotypic and clinical impact ( 11 ) and are only partially cataloged in references ( 12 ).…”
Section: Introductionmentioning
confidence: 99%