Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct measures of repetition all grow sublinearly in the length of a highly-repetitive string. In this paper we explore the practical advantages of combining data structures whose size depends on distinct measures of repetition. The main ingredient of our structures is the run-length encoded BWT (RLBWT), which takes space proportional to the number of runs in the Burrows-Wheeler transform of a string. We describe a range of practical variants that combine RLBWT with the set of boundaries of the Lempel-Ziv 77 factors of a string, which take space proportional to the number of factors. Such variants use, respectively, the RLBWT of a string and the RLBWT of its reverse, or just one RLBWT inside a bidirectional index, or just one RLBWT with support for unidirectional extraction. We also study the practical advantages of combining RLBWT with the compact directed acyclic word graph of a string, a data structure that takes space proportional to the number of one-character extensions of maximal repeats. Our approaches are easy to implement, and provide competitive tradeoffs on significant datasets. feucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578-595, 1987. 5 Timothy M Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the twenty-seventh annual symposium on computational geometry, pages 1-10. ACM, 2011. 6 Maxime Crochemore and Christophe Hancart. Automata for matching patterns. In Handbook of formal languages, pages 399-462. Springer, 1997. 7 Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. A faster grammar-based self-index. In
Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how they relate to the space used. In this article we prove a strong lower bound for rank, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations access and select can be solved in constant or almost-constant time, which is optimal for large alphabets. Our new upper bounds dominate all of the previous work in the time/space map.
We show that the compressed suffix array and the compressed suffix tree for a string of length n over an integer alphabet of size σ ≤ n can both be built in O(n) (randomized) time using only O(n log σ) bits of working space. The previously fastest construction algorithms that used O(n log σ) bits of space took times O(n log log σ) and O(n log ǫ n) respectively (where ǫ is any positive constant smaller than 1). In the passing, we show that the Burrows-Wheeler transform of a string of length n over an alphabet of size σ can be built in deterministic O(n) time and space O(n log σ). We also show that within the same time and space, we can carry many sequence analysis tasks and construct some variants of the compressed suffix array and compressed suffix tree.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.