Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared to the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which seems more convenient in practice. Experiments show that our algorithm achieves competitive time-space tradeoffs on most standard benchmark data.
Summary Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared with the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work, we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which is more convenient in practice. Experiments show that our algorithm (in a few variants) achieves competitive time‐space tradeoffs on most standard benchmark data. Copyright © 2017 John Wiley & Sons, Ltd.
The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rankhandling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2-3 times faster than the fastest known ones, for the price of using typically 1.5-5 times more space.Count-Occs(T bwt , n, P , m)
1 By the latter we mean indexes with space bounded by O (nH 0 ) or even O (nH k ) bits, where n is the text length, and H 0 (H k ) the order-0 (order-k) entropy. The former term, compact full-text indexes, is less definite, and may fit any structure with less than nlog 2 n bits of space, at least for ՚՚typical՚՚ texts. * e-mail: sgrabow@kis.p.lodz.pl Manuscript submitted 2016Manuscript submitted -11-01, revised 2016 One may define a full-text index over text T of length n as a data structure supporting at least two types of queries, both with respect to a pattern P of length m, where T and P share an integer alphabet of size σ. One query type is count: return the number occ ¸ 0 of occurrences of P in T. The other query type is locate: for each pattern occurrence report its position in T, that is, such j thatThe suffix Related workThe full-text indexing history starts with the suffix tree (ST) [7], a trie whose string collection is the set of all the suffixes of a given text, with an additional requirement that all nonbranching paths of edges are converted into single edges. Each ST path is terminated as soon as it points to a unique suffix, whose start position is kept in the corresponding leaf. As there are n leaves, up to n ¡ 1 internal nodes (as each internal node must have at least two children) and edge labels are represented with pointers to the text, it is easy to see that the suffix tree takes O(n) words of space, i.e., O(nlogn) bits.Suffix trees can be built in linear time for integer alphabets [8]. Assuming constant-time access to any child of a given node, the search in the ST takes only O(m + occ) time in the worst case. In practice, this is cumbersome for a large alphabet, of size n ω(1) , as it requires using perfect hashing, which also makes the construction time linear only in expectation. A small alphabet is easier to handle, which is one of the reasons of the wide use of suffix trees in bioinformatics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.