Artículo de publicación ISI.We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems
The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of "intrinsic dimensionality." We also present a unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.
Full-text indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into self-indexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously.In this paper we present the main concepts underlying self-indexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant self-indexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Abstract. Given a sequence S = s1s2 . . . sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an Information Theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/ log log n) time. Another contribution of this paper is to show how to combine our compressed representation of integer sequences with an existing compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Namely, we design a variant of the FM-index that indexes a string T [1, n] within nH k (T ) + o(n) bits of storage, where H k (T ) is the k-th order empirical entropy of T . This space bound holds simultaneously for all k ≤ α log |Σ| n, constant 0 < α < 1, and |Σ| = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P [1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log 1+ε n) time, for any constant 0 < ε < 1; and it reports a text substring of length in O( + log 1+ε n) time. Compared to all previous works, our index is the first one that removes the alphabet-size dependance from all query times, in particular counting time is linear in the pattern length. Still, our index uses essentially the same space of the k-th order entropy of the text T , which is the best space obtained in previous work. We can also handle larger alphabets of size |Σ| = O(n β ), for any 0 < β < 1, by paying o(n log |Σ|) extra space and by multiplying all query times by O(log |Σ|/ log log n).
A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.
We propose new succinct representations of ordinal trees, which have been studied extensively. It is known that any n-node static tree can be represented in 2n + o(n) bits and a number of operations on the tree can be supported in constant time under the word-RAM model. However the data structures are complicated and difficult to dynamize. We propose a simple and flexible data structure, called the range min-max tree, that reduces the large number of relevant tree operations considered in the literature to a few primitives that are carried out in constant time on sufficiently small trees. The result is extended to trees of arbitrary size, achieving 2n + O(n/polylog(n)) bits of space, which is optimal for some operations. The redundancy is significantly lower than any previous proposal. For the dynamic case, where insertion/deletion of nodes is allowed, the existing data structures support very limited operations. Our data structure builds on the range min-max tree to achieve 2n + O(n/ log n) bits of space and O(log n) time for all the operations. We also propose an improved data structure using 2n + O(n log log n/ log n) bits and improving the time to the optimal O(log n/ log log n) for most operations. We extend our support to forests, where whole subtrees can be attached to or detached from others, in time O(log 1+ n) for any > 0.Our techniques are of independent interest. An immediate derivation gives improved solution to range minimum/maximum queries where consecutive elements differ by ±1, achieving O(n + n/polylog(n)) bits of space. A second one stores an array of numbers supporting operations sum and search and limited updates, in optimal time O(log n/ log log n). A third one allows representing dynamic bitmaps and sequences supporting rank/select and indels, within zeroorder entropy bounds and optimal time O(log n/ log log n) for all operations on bitmaps and polylog-sized alphabets, and O(log n log σ/(log log n) 2 ) on larger alphabet sizes σ. This improves upon the best existing bounds for entropy-bounded storage of dynamic sequences, compressed full-text self-indexes, and compressed-space construction of the Burrows-Wheeler transform.
A succinct full-text self-index is a data structure built on a text T = t 1 t 2 . . . t n , which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p 1 p 2 . . . p m in T , and is able to reproduce any text substring, so the self-index replaces the text.Several remarkable self-indexes have been developed in recent years. Many of those take space proportional to nH 0 or nH k bits, where H k is the kth order empirical entropy of T . The time to count how many times does P occur in T ranges from O(m) to O(m log n).In this paper we present a new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nH k log σ + O(n) bits of space, for any k ≤ α log σ n and constant 0 < α < 1. Previous indexes that achieve O(m) counting time either require more than nH 0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ.In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the Burrows-Wheeler transform of T . This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes.Finally, we present some practical considerations in order to implement the RLFM index, obtaining two implementations with different space-time tradeoffs. We empirically compare our indexes against the best existing implementations and show that they are practical and competitive against those.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.