2016
DOI: 10.1016/j.is.2016.04.002
|View full text |Cite
|
Sign up to set email alerts
|

Universal indexes for highly repetitive document collections

Abstract: Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These Collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near copy regularity. They are based on r… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
26
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
6
3
1

Relationship

3
7

Authors

Journals

citations
Cited by 30 publications
(26 citation statements)
references
References 64 publications
0
26
0
Order By: Relevance
“…We compared r-index with the state-of-the-art index for each compressibility measure: lzi 23 [73,24] (z), slp 23 [25,24] (g), rlcsa 24 [79,80] (r), and cdawg 25 [9] (e). We also included hyb 26 [30,31], which combines a Lempel-Ziv index with an FM-index, with parameter M = 8, which is optimal for our experiment.…”
Section: Methodsmentioning
confidence: 99%
“…We compared r-index with the state-of-the-art index for each compressibility measure: lzi 23 [73,24] (z), slp 23 [25,24] (g), rlcsa 24 [79,80] (r), and cdawg 25 [9] (e). We also included hyb 26 [30,31], which combines a Lempel-Ziv index with an FM-index, with parameter M = 8, which is optimal for our experiment.…”
Section: Methodsmentioning
confidence: 99%
“…Numerous variants of LZ77 have been developed and several widely used implementations are available (such as gzip [1]). Recently, LZ77 has been shown to be particularly effective at handling highly-repetitive data sets [31,33,28,10,4] and LZ77 compression is always at least as powerful as any grammar representation [38,9].…”
Section: Introductionmentioning
confidence: 99%
“…All serious short read aligners (see e.g., [19,20,6]) implement a kind of compressed suffix array [23] called the FM-index [5], usually in n log σ bits of space (2n bits on DNA data), or even near nH k bits, where H k is the kth-order empirical entropy of the underlying data 1 . However, as genome sequencing becomes cheaper, and the number of genome sequences in genomic databases grows, indexes of linear size (even those taking around nH k bits) quickly become too large.…”
Section: Introductionmentioning
confidence: 99%