Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.
Although Kraken's k-mer-based approach provides fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed five-fold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.Assigning taxonomic labels to sequencing reads is an important part of many computational genomics pipelines for metagenomics projects. Recent years have seen several approaches to accomplish this task in a time-efficient manner 1-3 . Kraken 4 used a memory-intensive algorithm that associates short genomic substrings (k-mers) with lowest common ancestor (LCA) taxa. Kraken and related tools like KrakenUniq 5 have proven highly efficient and accurate in other tool comparisons 6,7 . But Kraken's high memory requirements force many researchers to either use a reduced-sensitivity MiniKraken database 8,9 , or to build and use many indexes over subsets of the reference sequences 10,11 . Its memory requirements can easily exceed 100 GB 7 , especially when the reference data includes large eukaryotic genomes 12,13 . Here we introduce Kraken 2, which provides a major reduction in memory usage as well as faster classification, a spaced-seed searching scheme, a translated search mode for matching in amino acid space, and continued compatibility with the Bracken 14 species-level quantification algorithm.Kraken 2 addresses the issue of large memory requirements through two changes to Kraken 1's data structures and algorithms. While Kraken 1 used a sorted list of k-mer/LCA pairs indexed by minimizers 15 , Kraken 2 introduces a probabilistic, compact hash table to map minimizers to LCAs. This table uses one-third of the memory of a standard hash table, at the cost of some specificity and accuracy. Additionally, Kraken 2 only stores minimizers (of length ℓ, ℓ ≤ k) from the reference sequence library in the data structure, whereas Kraken 1's stored all k-mers. Kraken 2's index for a reference database consisting of 9.1 Gbp of genomic sequence uses 10.6 gigabytes of memory at classification time. Kraken 1's index for the same reference set uses 72.4 gigabytes of memory for classification (Figure 1a, Supplementary Table S1). In general, a Kraken 2 database is about 15% as large as a Kraken 1 database over the same references (Supplementary Figure S1).Kraken 2's approach is faster than Kraken 1's because only the distinct minimizers from the query (read) trigger accesses to the hash table. A similar minimizer-based approach has proven useful in accelerating read alignment 16 . Kraken 2 additionally provides a hash-based filtering approach that subsamples the set of minimizer/LCA pairs included in the table, allowing the user to specify a target hash table size; smaller hash tables yield lower memory footprint and higher classification throughput at the expens...
Metagenomic experiments attempt to characterize microbial communities using high-throughput DNA sequencing. Identification of the microorganisms in a sample provides information about the genetic profile, population structure, and role of microorganisms within an environment. Until recently, most metagenomics studies focused on high-level characterization at the level of phyla, or alternatively sequenced the 16S ribosomal RNA gene that is present in bacterial species. As the cost of sequencing has fallen, though, metagenomics experiments have increasingly used unbiased shotgun sequencing to capture all the organisms in a sample. This approach requires a method for estimating abundance directly from the raw read data. Here we describe a fast, accurate new method that computes the abundance at the species level using the reads collected in a metagenomics experiment. Bracken (Bayesian Reestimation of Abundance after Classification with KrakEN) uses the taxonomic assignments made by Kraken, a very fast read-level classifier, along with information about the genomes themselves to estimate abundance at the species level, the genus level, or above. We demonstrate that Bracken can produce accurate species-and genus-level abundance estimates even when a sample contains multiple near-identical species.
We describe a new, highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample. Bracken (Bayesian Reestimation of Abundance after Classification with KrakEN) uses the taxonomy labels assigned by Kraken, a highly accurate metagenomics classification algorithm, to estimate the number of reads originating from each species present in a sample. Kraken classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species. We use the Kraken database itself to derive probabilities that describe how much sequence from each genome is shared with other genomes in the database, and combine this information with the assignments for a particular sample to estimate abundance at the species level, the genus level, or above. Combined with the Kraken classifier, Bracken produces accurate species-and genus-level abundance estimates even when a sample contains multiple near-identical species.
Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read classification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact on our ability to interpret a microbiome sample.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.