2020
DOI: 10.1101/2020.01.07.896928
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Representation of k-mer sets using spectrum-preserving string sets

Abstract: Given the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn grap… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(21 citation statements)
references
References 51 publications
(73 reference statements)
0
21
0
Order By: Relevance
“…Lastly we show that our index can also be used on other SPSS than the unitigs of the compacted de Bruijn graph. We compared the index performances on raw de Bruijn graph and using UST [21] showing a significant gain in performance using this approach. All experiments were performed on a single cluster node running with Intel(R) Xeon(R) CPU E5-2420 @ 1.90GHz with 192GB of RAM and Ubuntu 16.04.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Lastly we show that our index can also be used on other SPSS than the unitigs of the compacted de Bruijn graph. We compared the index performances on raw de Bruijn graph and using UST [21] showing a significant gain in performance using this approach. All experiments were performed on a single cluster node running with Intel(R) Xeon(R) CPU E5-2420 @ 1.90GHz with 192GB of RAM and Ubuntu 16.04.…”
Section: Resultsmentioning
confidence: 99%
“…Indexing k-mers is closely tangled to the notion of the representation of a k-mer set. Recently, spectrumpreserving string sets (SPSS) [21] were defined as an exact representation of a multiset of k-mers coming from a set of strings of length ≥ k. In the literature, SPSS indeed narrow this definition to the representation of a k-mer set and forget the k-mer multiplicities. According to this definition and as noticed in Pufferfish, de Bruijn graphs are relevant SPSS since they collapse redundancy in their vertices that represent the k-mers set.…”
Section: Introductionmentioning
confidence: 99%
“…So far, we have reviewed the following SPSSs for a set of k-mers X: X itself, the unitigs of X, any set of super-kmers that together contains all k-mers of X (such as the super-k-mers of the sequencing reads where X originated from), and the super-k-mers of the unitigs of X. To this list we can add the recently-introduced (and equivalent) concepts of UST and simplitigs [16,22]. They are SPSSs that aim to minimize their total number of nucleotides.…”
Section: Spectrum-preserving String Sets In Relation To K-mer Indexingmentioning
confidence: 99%
“…KMC [19]) Note: unlike all others, this SPSS represents the multiset of k-mers (with duplicates) from the reads, not a set of distinct k-mers super-k-mers of unitigs same as above, except substrings of unitigs instead of substrings of reads (e.g. BLight [17]) UST [16] set of sequences obtained by greedily concatenating unitigs in order to minimize the total number of nucleotides in the SPSS simplitigs [22] similar to [16] monotigs a set of paths that covers the (uncompacted) de Bruijn graph such that all k-mers have an identical count-vector and minimizer Table 1: Categories of spectrum-preserving string set schemes known from previous literature (and monotigs, introduced in this article). See also Fig.…”
Section: Spss Schemementioning
confidence: 99%
See 1 more Smart Citation