Jan Holub scite author profile

In a string x on an alphabet Σ, a position i is said to be indeterminate iff x[i] may be any one of a specified subset {λ 1 , λ 2 , . . . , λ j } of Σ, 2 j |Σ|. A string x containing indeterminate positions is therefore also said to be indeterminate. Indeterminate strings can arise in DNA and amino acid sequences as well as in cryptological applications and the analysis of musical texts. In this paper we describe fast algorithms for finding all occurrences of a pattern p = p[1..m] in a given text x = x[1..n], where either or both of p and x can be indeterminate. Our algorithms are based on the Sunday variant of the Boyer-Moore patternmatching algorithm, one of the fastest exact pattern-matching algorithms known. The methodology we describe applies more generally to all variants of Boyer-Moore (such as Horspool's, for example) that depend only on calculation of the δ ("rightmost shift") array: our method therefore assumes that Σ is indexed (essentially, an integer alphabet), a requirement normally satisfied in practice.

show abstract

Tuning BNDM withq-Grams

Ďurian¹,

Holub²,

Peltola³

et al. 2009

View full text Add to dashboard Cite

We develop bit-parallel algorithms for exact string matching. Our algorithms are variations of the BNDM and Shift-Or algorithms. At each alignment the algorithms read a q-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction. Our experiments show that many of the new variations are substantially faster than any previous string matching algorithm on x86 processors for English and DNA data.

show abstract

Data Structures to Represent a Set of k -long DNA Sequences

Chikhi

Holub

Medvedev³

2021

ACM Comput. Surv.

View full text Add to dashboard Cite

The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k -mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k -mer set has emerged as a shared underlying component. A set of k -mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k -mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.

show abstract

On Parallel Implementations of Deterministic Finite Automata

Holub

Stekr

2009

View full text Add to dashboard Cite

Suffix Tree of Alignment: An Efficient Index for Similar Data

Park

Crochemore

et al. 2013

View full text Add to dashboard Cite

Abstract. We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings A and B is a compacted trie representing all suffixes in A and B. It has |A|+|B| leaves and can be constructed in O(|A|+|B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of A and B. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of A and B has |A|+l d +l1 leaves where l d is the sum of the lengths of all parts of B different from A and l1 is the sum of the lengths of some common parts of A and B. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern P in O(|P | + occ) time where occ is the number of occurrences of P in A and B. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(|A| + l d + l1 + l2) time where l2 is the sum of the lengths of other common substrings of A and B. When the suffix tree of A is already given, it requires O(l d + l1 + l2) time.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jan Holub

Fast pattern-matching on indeterminate strings

Tuning BNDM withq-Grams

Data Structures to Represent a Set of k -long DNA Sequences

On Parallel Implementations of Deterministic Finite Automata

Suffix Tree of Alignment: An Efficient Index for Similar Data

Contact Info

Product

Resources

About