Gad M. Landau scite author profile

Abstract. Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures (sometimes with slight reduction in efficiency) many of the popular compression schemes, including the Lempel-Ziv family, Run-Length Encoding, Byte-Pair Encoding, Sequitur, and Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string.Let S be a string of length N compressed into a context-free grammar S of size n. We present two representations of S achieving O(log N ) random access time, and either O(n·α k (n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, α k (n) is the inverse of the k th row of Ackermann's function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P |k, k 4 + |P |} + log N ) + occ), where occ is the number of occurrences of P in S. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees.All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.Key words. grammar-based compression, straight-line program, approximate string matching, tree compression AMS subject classifications. 68P05, 68P301. Introduction. Modern textual or semi-structured databases, e.g. for biological and WWW data, are huge, and are typically stored in compressed form. A query to such databases will typically retrieve only a small portion of the data. This presents several challenges: how to query the compressed data directly and efficiently, without the need for additional data structures (which can be many times larger than the compressed data), and how to retrieve the answers to the queries. In many practical cases, the naive approach of first decompressing the entire data and then processing it is completely unacceptable -for instance XML data compresses by an order of magnitude on disk [25] but expands by an order of magnitude when represented in-memory [22]; as we will shortly see, this approach is very problematic from an

show abstract

A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

Crochemore¹,

Landau²,

Ziv-Ukelson³

2003

SIAM J. Comput.

135

117

View full text Add to dashboard Cite

International audienceGiven two strings of size n over a constant alphabet, the classical algorithm for computing the similarity between two sequences [D. Sankoff and J. B. Kruskal, eds., Time Warps, String Edits, and Macromolecules; Addison-Wesley, Reading, MA, 1983; T. F. Smith and M. S. Waterman, J. Molec. Biol., 147 (1981), pp. 195-197] uses a dynamic programming matrix and compares the two strings in O(n²) time. We address the challenge of computing the similarity of two strings in subquadratic time for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations. The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n² / log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(h n² / log n), where h≤1 is the entropy of the text. We also present an algorithm for comparing two run-length encoded strings of length m and n, compressed into m' and n' runs, respectively, in O(m'n+n'm) complexity. This result extends to all distance or similarity scoring schemes that use an additive gap penalty

show abstract

Fast parallel and serial approximate string matching

Landau

Vishkin

1989

Journal of Algorithms

263

100

View full text Add to dashboard Cite

Text Indexing and Dictionary Matching with One Error

Amir¹,

Keselman

Landau

et al. 2000

Journal of Algorithms

View full text Add to dashboard Cite

Random Access to Grammar-Compressed Strings

Bille¹,

Landau²,

Raman³

et al. 2011

View full text Add to dashboard Cite

Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N ) random access time, and either O(n · α k (n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, α k (n) is the inverse of the k th row of Ackermann's function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P |k, k 4 + |P |} + log N ) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammar-compressed trees.All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy-paths in grammars.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Gad M. Landau

Random Access to Grammar-Compressed Strings and Trees

A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

Fast parallel and serial approximate string matching

Text Indexing and Dictionary Matching with One Error

Random Access to Grammar-Compressed Strings

Contact Info

Product

Resources

About