Marcin Raniszewski scite author profile

2015

Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared to the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which seems more convenient in practice. Experiments show that our algorithm achieves competitive time-space tradeoffs on most standard benchmark data.

Sampled suffix array with minimizers

2017

Softw Pract Exp

Summary Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches compared with the sparse suffix array, for long enough patterns. A drawback of their approach is the requirement that sought patterns need to contain at least one character from the chosen subalphabet. In this work, we propose an alternative suffix sampling approach with only a minimum pattern length as a requirement, which is more convenient in practice. Experiments show that our algorithm (in a few variants) achieves competitive time‐space tradeoffs on most standard benchmark data. Copyright © 2017 John Wiley & Sons, Ltd.

FM-index for Dummies

Deorowicz

2017

The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rankhandling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2-3 times faster than the fastest known ones, for the price of using typically 1.5-5 times more space.Count-Occs(T bwt , n, P , m)

Sequential Reduction Algorithm for Nearest Neighbor Rule

2010

Compact and hash based variants of the suffix array

2017

1 By the latter we mean indexes with space bounded by O (nH 0 ) or even O (nH k ) bits, where n is the text length, and H 0 (H k ) the order-0 (order-k) entropy. The former term, compact full-text indexes, is less definite, and may fit any structure with less than nlog 2 n bits of space, at least for ՚՚typical՚՚ texts. * e-mail: sgrabow@kis.p.lodz.pl Manuscript submitted 2016Manuscript submitted -11-01, revised 2016 One may define a full-text index over text T of length n as a data structure supporting at least two types of queries, both with respect to a pattern P of length m, where T and P share an integer alphabet of size σ. One query type is count: return the number occ ¸ 0 of occurrences of P in T. The other query type is locate: for each pattern occurrence report its position in T, that is, such j thatThe suffix Related workThe full-text indexing history starts with the suffix tree (ST) [7], a trie whose string collection is the set of all the suffixes of a given text, with an additional requirement that all nonbranching paths of edges are converted into single edges. Each ST path is terminated as soon as it points to a unique suffix, whose start position is kept in the corresponding leaf. As there are n leaves, up to n ¡ 1 internal nodes (as each internal node must have at least two children) and edge labels are represented with pointers to the text, it is easy to see that the suffix tree takes O(n) words of space, i.e., O(nlogn) bits.Suffix trees can be built in linear time for integer alphabets [8]. Assuming constant-time access to any child of a given node, the search in the ST takes only O(m + occ) time in the worst case. In practice, this is cumbersome for a large alphabet, of size n ω(1) , as it requires using perfect hashing, which also makes the construction time linear only in expectation. A small alphabet is easier to handle, which is one of the reasons of the wide use of suffix trees in bioinformatics.