Lossless Indexing with Counting de Bruijn Graphs

Karasikov, Mikhail; Mustafa, Harun; Rätsch, Gunnar; Kahles, André

doi:10.1007/978-3-031-04749-7_34

Cited by 6 publications

(21 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lastly in this section, we report that other works [17, 14] considered the multi-document version of the problem studied here, that is, how to retrieve a vector of weights for a query k -mer, where each component of the vector represents the weight of the k -mer in a distinct document. Also such count vectors are usually very “regular” (or can be made so by introducing some approximation) [17] and present runs of equal symbols that can be compressed effectively with run-length encoding (RLE).…”

Section: Related Workmentioning

confidence: 99%

On Weighted K-Mer Dictionaries

Pibiri

2022

Preprint

View full text Add to dashboard Cite

We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

show abstract

Section: Related Workmentioning

confidence: 99%

On Weighted K-Mer Dictionaries

Pibiri

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…At constant memory usage, adding the abundance information would yield an extremely high false-positive rate. As such, methods storing abundances mostly rely on compression by clustering abundance with neighbouring k-mers or across datasets, as Reindeer [10] or Counting de Bruijn graphs [6]. These methods do not rely on counting AMQ, but rather on exact data structures.…”

Section: Introductionmentioning

confidence: 99%

fimpera: drastic improvement of Approximate Membership Query data-structures with counts

Robidou

Peterlongo

2022

Preprint

View full text Add to dashboard Cite

MotivationsApproximate membership query data structures (AMQ) such as Cuckoo filters or Bloom filters are widely used for representing and indexing large sets of elements. AMQ can be generalized for additionally counting indexed elements, they are then called “counting AMQ”. This is for instance the case of the “counting Bloom filters”. However, counting AMQs suffer from false positive and overestimated calls.ResultsIn this work we propose a novel computation method, called fimpera, consisting of a simple strategy for reducing the false-positive rate of any AMQ indexing all k-mers (words of length k) from a set of sequences, along with their abundance information.This method decreases the false-positive rate of a counting Bloom filter by an order of magnitude while reducing the number of overestimated calls, as well as lowering the average difference between the overestimated calls and the ground truth. In addition, it slightly decreases the query run time. fimpera does not require any modification of the original counting Bloom filter, it does not generate false-negative calls, and it causes no memory overhead. The unique drawback is that fimpera yields a new kind of false positives and overestimated calls. However their amount is negligible. fimpera requires a unique parameter, and its results are only little impacted when using this parameter within recommended values. As a side note, for the algorithmic needs of the method, we also propose a novel generic algorithm for finding minimal values of a sliding window over a vector of x integers in O(x) time with zero memory allocation.Availabilityhttps://github.com/lrobidou/fimpera

show abstract

“…While a sequence graph by itself can be used to check for the presence or absence of a query sequence within a set, it cannot classify or profile the query without additional metadata, called graph annotations . Graph annotations are a key-value store associating each graph node with a number of annotations, where annotations can include the labels of the indexed samples [32,49,28,26], node abundances [33,44], genomic coordinates [33,1,20], geographic coordinates [32], etc. [59].…”

Section: Introductionmentioning

confidence: 99%

“…For jointly indexing unassembled read sets, annotated De Bruijn graphs typically scale better than variation graphs in representation size due to the collapse of shared k-mers between samples onto single graph nodes [32]. Before joint graph construction, many indexing tools [32,33,49,58,55,28,59,28] perform error correction ( cleaning ) on each sample to remove uncertain k-mers [62]. Thus, these joint graphs can represent the samples’ respective assembly graphs [32].…”

Section: Introductionmentioning

confidence: 99%

“…One of the inherent limitations of De Bruijn graphs is the intimate connection between the choice of k and the graph topology. Setting k too small leads to a graph topology where the majority of walks will spell sequences that are not present in the underlying genome and do not represent biologically-viable recombination [57] (called spurious walks [33]), while setting a large value will reduce the potential for shared substrings to collapse onto shared nodes, and hence, lead to lower coverage and more sequence content being removed during cleaning. Variable-order De Bruijn graphs [7] of maximum order k overcome this limitation by representing all observed k’-mers, where 1 < k′ ≤ k , including additional edges for transitioning between nodes of different k′ values as long as they share a common suffix.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Mustafa

Karasikov

Rätsch

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies' processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node's source samples (called labels), genomic coordinates, expression levels, etc. An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment. To address this problem, we propose a novel alignment strategy called multi-label alignment (MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, called MetaGraph-MLA. MLA extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introduce node length change as an operation. In this model, traversal between nodes that share a suffix of length < k-1 acts as a proxy for inserting nodes into the graph. MetaGraph-MLA constructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluate MetaGraph-MLA on simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads' ground-truth samples. We envision MetaGraph-MLA as a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.

show abstract

Lossless Indexing with Counting de Bruijn Graphs

Cited by 6 publications

References 6 publications

On Weighted K-Mer Dictionaries

On Weighted K-Mer Dictionaries

fimpera: drastic improvement of Approximate Membership Query data-structures with counts

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs

Contact Info

Product

Resources

About