Compressed automata for dictionary matching

Tomohiro, I; Nishimoto, Takaaki; Inenaga, Shunsuke; Bannai, Hideo; Takeda, Masayuki

doi:10.1016/j.tcs.2015.01.019

Cited by 7 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We estimated the effectiveness of the compression using the size of the generated grammars instead of the length of the output bits. Reducing the grammar size has important implications since the majority of the existing text algorithms applied to grammar-compressed texts, including grammar-based self indexes [21,22], edit distance computation [23], q-gram mining [24,25], and pattern matching [26][27][28], have time/space complexities that are dependent on the input grammar size. For instance, the compressed indexes proposed by Claude and Navarro [21,22] can be directly built on MR-RePair grammar-compressed texts.…”

Section: Discussionmentioning

confidence: 99%

“…Our experiments show that MR-RePair constructs smaller grammars compared to RePair. We emphasize that generating a grammar of small size is of great importance since most, if not all, existing algorithms/data structures that work on grammar-compressed texts have running time dependent on the grammar sizes (see e.g., [21][22][23][24][25][26][27][28] and the references therein) and not directly on the encoded sizes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Practical Grammar Compression Based on Maximal Repeats

et al. 2020

Self Cite

View full text Add to dashboard Cite

This study presents an analysis of RePair, which is a grammar compression algorithm known for its simple scheme, while also being practically effective. First, we show that the main process of RePair, that is, the step by step substitution of the most frequent symbol pairs, works within the corresponding most frequent maximal repeats. Then, we reveal the relation between maximal repeats and grammars constructed by RePair. On the basis of this analysis, we further propose a novel variant of RePair, called MR-RePair, which considers the one-time substitution of the most frequent maximal repeats instead of the consecutive substitution of the most frequent pairs. The results of the experiments comparing the size of constructed grammars and execution time of RePair and MR-RePair on several text corpora demonstrate that MR-RePair constructs more compact grammars than RePair does, especially for highly repetitive texts.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Practical Grammar Compression Based on Maximal Repeats

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…We can obtain a faster algorithm using Theorem 6. We can also solve the grammar compressed dictionary matching problem [17] with our data structures. We preprocess an input dictionary SLP (DSLP) S, m with n productions that represent m patterns.…”

Section: Applicationsmentioning

confidence: 99%

Dynamic index and LZ factorization in compressed space

Nishimoto

Tomohiro

Inenaga

et al. 2020

Discrete Applied Mathematics

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a new dynamic compressed index of O(w) space for a dynamic text T , where w = O(min(z log N log * M, N )) is the size of the signature encoding of T , z is the size of the Lempel-Ziv77 (LZ77) factorization of T , N is the length of T , and M ≥ 4N is an integer that can be handled in constant time under word RAM model. Our index supports searching for a pattern P in T in O(|P |fA + log w log |P | log * M (log N + log |P | log * M ) + occ log N ) time and insertion/deletion of a substring of length y in O((y + log N log * M ) log w log N log * M ) time, where fA = O(min{ log log M log log w log log log M , log w log log w }). Also, we propose a new space-efficient LZ77 factorization algorithm for a given text of length N , which runs in O(N fA + z log w log 3 N (log * N ) 2 ) time with O(w) working space.

show abstract

“…Hon et al 51 achieved an entropy compressed space while matching time remains optimal. Tomohiro et al 52 designed a matching algorithm working on grammar‐based compressed AC automata. However, these studies were accomplished through theoretical discussions, and we are unaware of any actual implementation.…”

Section: Related Workmentioning

confidence: 99%

Engineering faster double‐array Aho–Corasick automata

Kanda¹,

Akabe²,

Oda

2023

Softw Pract Exp

View full text Add to dashboard Cite

Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This article studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that comprehensive descriptions and experimental analyses on their ideas are not provided. Engineers face difficulties in implementing an efficient DAAC. In this article, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github. com/daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.

show abstract

Compressed automata for dictionary matching

Cited by 7 publications

References 14 publications

Practical Grammar Compression Based on Maximal Repeats

Practical Grammar Compression Based on Maximal Repeats

Dynamic index and LZ factorization in compressed space

Engineering faster double‐array Aho–Corasick automata

Contact Info

Product

Resources

About