Max Rishøj Pedersen scite author profile

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P1 and P2 and a gap range [α, β] we can quickly find the consecutive occurrences of P1 and P2 with distance in [α, β], i.e., pairs of occurrences immediately following each other and with distance within the range. We present data structures that use O(n) space and query time O(|P1| + |P2| + n 2/3 ) for existence and counting and O(|P1| + |P2| + n 2/3 occ 1/3 ) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using O(n) space must use Ω(|P1| + |P2| + √ n) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

show abstract

Gapped Indexing for Consecutive Occurrences

Bille

Gørtz

Pedersen

et al. 2022

Algorithmica

View full text Add to dashboard Cite

String Indexing for Top-k Close Consecutive Occurrences

Bille

Gørtz

Pedersen

et al. 2020

View full text Add to dashboard Cite

String Indexing for Top-$k$ Close Consecutive Occurrences

Bille¹,

Gørtz²,

Pedersen³

et al. 2020

Preprint

View full text Add to dashboard Cite

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P , report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i, j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j − i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to length of P and k. We give two new time-space trade-offs for the problem. Our first result achieves near-linear space and optimal query time, and our second result achieves linear space and near optimal query time. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

show abstract

Sliding Window String Indexing in Streams

Bille¹,

Fischer²,

Gørtz³

et al. 2023

Preprint

View full text Add to dashboard Cite

Given a string S over an alphabet Σ, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching.Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and the pattern string P . Reporting each occurrence from P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w + δ) space data structure for this problem that improves the above time bounds to O(log(w/δ)). In particular, for a delay of δ = w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.