REINDEER: efficient indexing of <i>k</i>-mer presence and abundance in sequencing datasets

Marchet, Camille; Iqbal, Zamin; Gautheret, Daniel; Salson, Mikaël; Chikhi, Rayan

doi:10.1093/bioinformatics/btaa487

Cited by 44 publications

(55 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We recall that in this work, we are interested in SPSS that represents a set of k -mers and will refer to them, and will not take into account multi-sets. Unitigs are one SPSS, super- k -mers of unitigs are another [14]. Two other equivalent SPSSs schemes, UST [21] and simplitigs [32], longer than unitigs, were recently independently proposed.…”

Section: Resultsmentioning

confidence: 99%

“…The challenge of indexing colored de Bruijn graphs [34] (or more generally to answer large sequence search problems as defined in [10]) have caught the interest of a community and could be a direct application of this work. For example, BLight is successfully integrated as an indexing structure in REINDEER [14], a k -mer data structure that enables the quantification of query sequences in thousands of raw read samples.…”

Section: Discussionmentioning

confidence: 99%

“…Based on hash tables [6] and/or filters [5], they allow to store ( k-mer, value ) pairs. This way, k -mers can be associated with pieces of information of any nature, for instance, with their original dataset(s) [4], or counts [14]. The presented work pertains to this latter category.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient exact associative structure for sequencing data

Marchet

Kerbiriou

Limasset

2019

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high throughput sequence analysis. Indexing billions of k-mers is promptly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of the properties of the k-mer sets to leverage this challenge. They exploit the overlaps shared among k-mers by using a de Bruijn graph as a compact k-mer set to provide lightweight structures. Results: We present Blight, a static and exact index structure able to associate unique identifiers to indexed k-mers and to reject alien k-mers that scales to the largest kmer sets with a low memory cost. The proposed index combines an extremely compact representation along with very high throughput. Besides, its construction from the de Bruijn graph sequences is efficient and does not need supplementary memory. The efficient index implementation achieves to index the k-mers from the human genome with 8GB within 10 minutes and can scale up to the large axolotl genome with 63 GB within 76 minutes. Furthermore, while being memory efficient, the index allows above a million queries per second on a single CPU in our experiments, and the use of multiple cores raises its throughput. Finally, we also present how the index can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. Availability: The index is implemented as a C++ library, is open source under AGPL3 license, and available at github.com/Malfoy/Blight. It is designed as a user-friendly library and comes along with samples code usage.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient exact associative structure for sequencing data

Marchet

Kerbiriou

Limasset

2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…For instance, data structures for membership queries [ 78 ] relying on unitigs [ 38 , 40 – 43 ] could be redesigned to use simplitigs instead. In many applications, including some of the traditional alignment-free methods [ 13 , 14 ], it is desirable to consider k -mers with counts, which leads to so-called weighted de Bruijn graphs [ 79 ]; a recent manuscript [ 80 ] introduced monotigs which are a form of short simplitigs to encode this information. Furthermore, multiple de Bruijn graphs are often considered simultaneously; the resulting structure is usually referred to as a colored de Bruijn graph [ 15 ] and the associated data structures have been also widely studied [ 41 , 43 , 51 , 81 – 89 ].…”

Section: Discussionmentioning

confidence: 99%

Simplitigs as an efficient and scalable representation of de Bruijn graphs

2021

View full text Add to dashboard Cite

Abstractde Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.

show abstract

“…The vast majority of these large-scale k -mer indexing tools are based on common building blocks, three of them being: 1) k -mer counting, which summarizes sequencing data into a set of k -mers along with their abundances, 2), k -mer matrix construction, which aggregates lists of k -mer counts over a collection of samples (e.g. as in Marchet et al (2020); Muggli et al (2019)) in the form of a k -mer/sample matrix with abundances as values, and 3) Bloom filters construction, where the k-mer presence/absence information for each sample is converted into a Bloom filter to save space and allow fast queries. Note that these building blocks are not specific to k -mer indexing tools, e.g.…”

Section: Introductionmentioning

confidence: 99%

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

Lemane

Medvedev

Chikhi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

When indexing large collection of sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI, ..) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting hashes instead of k-mers; 2/ a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. In addition, our experimental results highlight that the usual yet crude filtering of low-abundant k-mers is inappropriate for complex data such as metagenomes.

show abstract

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

Cited by 44 publications

References 29 publications

Efficient exact associative structure for sequencing data

Efficient exact associative structure for sequencing data

Simplitigs as an efficient and scalable representation of de Bruijn graphs

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

Contact Info

Product

Resources

About