Lossless filter for multiple repeats with bounded edit distance

Peterlongo, Pierre; Sacomoto, Gustavo Akio Tominaga; Lago, Alair Pereira do; Pisanti, Nadia; Sagot, Marie‐France

doi:10.1186/1748-7188-4-3

Cited by 16 publications

(20 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For future work, we will explore the possibility of optimising our algorithms and the corresponding library implementation for the approximate case by using lossless filters for eliminating a possibly large fraction of the input that is guaranteed not to contain any approximate occurrence, such as [ 31 ] for the Hamming distance model or [ 32 ] for the edit distance model. In addition, we will try to improve our algorithms for the approximate case in order to achieve average-case optimality.…”

Section: Discussionmentioning

confidence: 99%

Fast algorithms for approximate circular string matching

Barton

Iliopoulos

Pissis

2014

Algorithms Mol Biol

View full text Add to dashboard Cite

BackgroundCircular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal average-case algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area.ResultsIn this article, we present a suboptimal average-case algorithm for exact circular string matching requiring time scriptO(n). Based on our solution for the exact case, we present two fast average-case algorithms for approximate circular string matching with k-mismatches, under the Hamming distance model, requiring time scriptO(n) for moderate values of k, that is k=scriptO(m/logm). We show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach.ConclusionsWe present two fast average-case algorithms for approximate circular string matching with k-mismatches; and show that they also perform very well in practice. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any biological pipeline. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.

show abstract

Section: Discussionmentioning

confidence: 99%

Fast algorithms for approximate circular string matching

Barton

Iliopoulos

Pissis

2014

Algorithms Mol Biol

View full text Add to dashboard Cite

show abstract

“…The literature of algorithmic approaches and software tools for finding motifs and repetitions is vast, as the variability of the problem formulations leads to a variability of algorithmic strategies, and often to combinations of them. For finding long repetitions [39], for example, a preprocessing with an efficient and effective filtering [24,23,28] turns out to be the only possible combinatorial approach. For short motifs there are several enumerative pattern-driven algorithms [19,30,31,34].…”

Section: Related Workmentioning

confidence: 99%

Motif trie: An efficient text index for pattern discovery with don't cares

Grossi

Menconi

Pisanti

et al. 2018

Theoretical Computer Science

Self Cite

View full text Add to dashboard Cite

 Users may download and print one copy of any publication from the public portal for the purpose of private study or research.  You may not further distribute the material or use it for any profit-making activity or commercial gain  You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

show abstract

“…An exception to this are the q-gram filtering techniques [32] that have successfully been used for string matching under the edit distance model (e.g. [7,30,26]), as well as for multiple local alignments both under the Hamming [27] and edit [26] distance model.…”

Section: Introductionmentioning

confidence: 99%

“…We introduce the β-blockwise q-gram distance between two strings x and y, that is, a more powerful generalization of the q-gram distance introduced as a string distance measure in [32]. Intuitively, and similarly to [7,30,26], this generalization comprises partitioning x and y in β blocks each, as evenly as possible, computing the q-gram distance between the corresponding block pairs, and then summing up the distances computed blockwise. 2.…”

Section: Introductionmentioning

confidence: 99%

Circular Sequence Comparison with q-grams

Grossi

Iliopoulos

Mercaş

et al. 2015

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialized alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed eciently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate orders of-magnitude superiority of our approach in terms of effciency, while maintaining an accuracy very competitive to the state of the art

show abstract

Lossless filter for multiple repeats with bounded edit distance

Cited by 16 publications

References 22 publications

Fast algorithms for approximate circular string matching

Fast algorithms for approximate circular string matching

Motif trie: An efficient text index for pattern discovery with don't cares

Circular Sequence Comparison with q-grams

Contact Info

Product

Resources

About