We consider the indexable dictionary problem, which consists of storing a set S ⊆ {0, . . . , m − 1} for some integer m while supporting the operations of rank(x ), which returns the number of elements in S that are less than x if x ∈ S, and −1 otherwise; and select(i ), which returns the ith smallest element in S. We give a data structure that supports both operations in O(1) time on the RAM model and requires B(n, m) + o(n) + O(lg lg m) bits to store a set of size n, where B(n, m) = lg m n is the minimum number of bits required to store any n-element subset from a universe of size m. Previous dictionaries taking this space only supported (yes/no) membership queries in O(1) time. In the cell probe model we can remove the O(lg lg m) additive term in the space bound, answering a question raised by Fich and Miltersen [1995] and Pagh [2001].We present extensions and applications of our indexable dictionary data structure, including:-an information-theoretically optimal representation of a k-ary cardinal tree that supports standard operations in constant time; -a representation of a multiset of size n from {0, . . . , m − 1} in B(n, m + n) + o(n) bits that supports (appropriate generalizations of) rank and select operations in constant time; and + O(lg lg m)-a representation of a sequence of n nonnegative integers summing up to m in B(n, m + n) + o(n) bits that supports prefix sum queries in constant time. ACM Reference Format:Raman, R., Raman, V., and Rao, S. S. 2007. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans.
We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. The main advantage of succinct indexes as opposed to succinct (integrated data/index) encodings is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this article, we present succinct indexes for various data types, namely strings, binary relations and multilabeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data.Using our techniques, we design a succinct encoding that represents a string of length n over an alphabet of size σ using nH k (S)+lg σ ·o(n)+O(nlg σ /lg lg lg σ ) bits to support access/rank/select operations in o((lg lg σ ) 1+ ) time, for any fixed constant > 0. We also design a succinct text index using nH 0 (S) + O(n lg σ /lg lg σ ) bits that supports finding all the occ occurrences of a given pattern of length m in O(mlg lg σ + occ lg n/ lg σ ) time, for any fixed constant 0 < < 1. Previous results on these two problems either have a lg σ factor instead of lg lg σ in the running time, or are not compressed. Finally, we present succinct encodings of binary relations and multi-labeled trees that are more compact than previous structures.
We consider space-efficient solutions to two dynamic data structuring problems. We first give a representation of a set S ⊆ U = {0, . . . , m − 1}, |S| = n that supports membership queries in O(1) worst case time and insertions into/deletions from S in O(1) expected amortised time. The representation uses B + o(B) bits, where B = lg m n is the information-theoretic minimum space to represent S. This improves upon the O(B)-bit solutions of Brodnik and Munro [2] and Pagh [16],and uses up to a log-factor less space than search trees or hash tables. The representation can also associate satellite data with elements of S.We also show that a binary tree on n nodes, where each node has b = O(lg n)-bit data stored at it, can be maintained under node insertions while supporting navigation in O(1) time and updates in O((lg lg n) 1+ ) amortised time, for any constant > 0. The space used is within o(n) bits of the information-theoretic minimum. This improves upon the equally space-efficient structure of Munro et al. [15], in which updates take O(lg c n) time, for some c ≥ 1.
Abstract. Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures (sometimes with slight reduction in efficiency) many of the popular compression schemes, including the Lempel-Ziv family, Run-Length Encoding, Byte-Pair Encoding, Sequitur, and Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string.Let S be a string of length N compressed into a context-free grammar S of size n. We present two representations of S achieving O(log N ) random access time, and either O(n·α k (n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, α k (n) is the inverse of the k th row of Ackermann's function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P |k, k 4 + |P |} + log N ) + occ), where occ is the number of occurrences of P in S. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees.All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.Key words. grammar-based compression, straight-line program, approximate string matching, tree compression AMS subject classifications. 68P05, 68P301. Introduction. Modern textual or semi-structured databases, e.g. for biological and WWW data, are huge, and are typically stored in compressed form. A query to such databases will typically retrieve only a small portion of the data. This presents several challenges: how to query the compressed data directly and efficiently, without the need for additional data structures (which can be many times larger than the compressed data), and how to retrieve the answers to the queries. In many practical cases, the naive approach of first decompressing the entire data and then processing it is completely unacceptable -for instance XML data compresses by an order of magnitude on disk [25] but expands by an order of magnitude when represented in-memory [22]; as we will shortly see, this approach is very problematic from an
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N ) random access time, and either O(n · α k (n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, α k (n) is the inverse of the k th row of Ackermann's function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P |k, k 4 + |P |} + log N ) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammar-compressed trees.All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy-paths in grammars.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.