Fast, Practical Algorithms for Computing All the Repeats in a String

Puglisi, Simon J.; Smyth, W. Franklin; Yusufu, Munina

doi:10.1007/s11786-010-0033-6

Cited by 13 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It turns out that RSF can be used to compute all the non-extendible repeating substrings in w. These data structures are important in bioinformatics applications; algorithms to compute them were described in [14,15,16] using suffix trees or suffix arrays. We introduce the inverse RSF array IRSF to compute all non-extendible repeating substrings in w.…”

Section: Computing Non-extendible Repeating Substrings In Strings Usimentioning

confidence: 99%

See 1 more Smart Citation

Frequency Covers for Strings

Mhaskar

Smyth

2018

Self Cite

View full text Add to dashboard Cite

We study a central problem of string processing: the compact representation of a string by its frequently-occurring substrings. In this paper we propose an effective, easily-computed form of quasi-periodicity in strings, the frequency cover; that is, the longest of those repeating substrings u of w, |u| > 1, that occurs the maximum number of times in w. The advantage of this generalization is that it is not only applicable to all strings but also that it is the only generalized notion of cover yet proposed, which can be computed efficiently in linear time and space. We describe a simple data structure called the repeating substring frequency array (RSF array) for the string w, which we show can be constructed in O(n) time and O(n) space, where |w| = n. We then use RSF to compute all the frequency covers of w in linear time and space. Our research also allows us to give an alternate algorithm to compute all non-extendible repeating substrings in w, also in O(n) time and space.

show abstract

Section: Computing Non-extendible Repeating Substrings In Strings Usimentioning

confidence: 99%

“…Consequently, we get Theorems 5.4 and 5.5. Note that there are other linear time algorithms proposed to compute non-extendible repeating substrings in a string that are more space efficient [16]. We present this algorithm to show the usefulness of the RSF data structure.…”

Section: Proofmentioning

confidence: 99%

Frequency Covers for Strings

Mhaskar

Smyth

2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…We are using an algorithm by Puglisi et al . based on suffix arrays to identify all repeats with at least two tokens in a cloned fragment . On the basis of the repeats, we use the fractions of tokens of a fragment that are not covered by any repeat NR as a metric for repetitiveness.…”

Section: Our Approachmentioning

confidence: 99%

Large‐scale inter‐system clone detection using suffix trees and hashing

Koschke

2013

J Software Evolu Process

View full text Add to dashboard Cite

Detecting a similar code between two systems has various applications such as comparing two software variants or versions or finding potential license violations. Techniques detecting suspiciously similar code must scale in terms of resources needed to very large code corpora and need to have high precision because a human needs to inspect the results. This paper demonstrates how suffix trees can be used to obtain a scalable comparison. The evaluation is carried out for very large code corpora. Our evaluation shows that our approach is faster than index-based techniques when the analysis is run only once. If the analysis is to be conducted multiple times, creating an index pays off. We report how much code can be filtered out from the analysis using an index-based filter. In addition to that, this paper proposes a method to improve precision through user feedback. A user validates a sample of the found clone candidates. An automated data mining technique learns a decision tree on the basis of the user decisions and different code metrics. We investigate the relevance of several metrics and whether criteria learned from one application domain can be generalized to other domains.All of the aforementioned variants of clone detection are facing challenges with respect to detection quality and scalability. Detection quality requires high recall and high precision in finding the relevant code. Relevance depends on the use case. In particular, inter-system and intra-system clone detections need to deal with re-occurring similar code that is similar from a lexical or syntactical point of view, but that is not interesting for the given task. Frequent examples of such irrelevant similar code are import statement lists, array initializers, setter/getter sequences, or sequences of pure declarations or simple assignments.Another challenge is scalability. Whereas intra-system clone detection searches only within one system, inter-system clone search may face a much larger code base, often larger by orders of magnitude. Also, fragment search may face this problem, when the code is searched in very large software repositories [3,4].Several researchers have recently proposed to use an index-based code search to address scalability for the search in very large code bases [13,3,4,17,18].The index-based techniques first create an index against which code of a subject system is compared later. The purpose of the index is to identify the code that has a chance of being similar. The code filtered out by the index is not compared. The index is a first seed of a similar code fragment. This seed is then extended by merging with neighboring similar code fragments [13,3,4].Creating the index can be expensive. The idea is to invest upfront in an index that is created only once but whose cost is amortized in multiple subsequent searches.Contributions. Our conference paper introduced a way to extend traditional suffix-tree-based clone detection for inter-system clone search that scales for very large programs [19]. This approach avoids the nee...

show abstract

“…There are well-known algorithms for computing maximal repeats in linear time, using a data structure from the suffix family (like suffix tree or suffix array) [12] and Gusfield [7] outlines an algorithm to compute largest-maximal repeat. For our experiments we used in all cases a linear implementation using the suffix array which processes roughly 500K- …”

Section: Of Course ωLmr(n) Is Upper-bounded By O(n 2 ) It Is Howevermentioning

confidence: 99%

The bag-of-repeats representation of documents

Gallé

2013

Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption of word and introducing context. However, this comes at a cost of adding features which are nondescriptive, and increasing the dimension of the vector space model exponentially.We present new representations that avoid both pitfalls. They are based on sound theoretical notions of stringology, and can be computed in optimal asymptotic time with algorithms using data structures from the suffix family. While maximal repeats have been used in the past for similar tasks, we show how another equivalence class of repeats -largestmaximal repeats -obtain similar or better results, with only a fraction of the features. This class acts as a minimal generative basis of all repeated substrings. We also report their use for topic modeling, showing easier to interpret models.

show abstract

Fast, Practical Algorithms for Computing All the Repeats in a String

Cited by 13 publications

References 16 publications

Frequency Covers for Strings

Frequency Covers for Strings

Large‐scale inter‐system clone detection using suffix trees and hashing

The bag-of-repeats representation of documents

Contact Info

Product

Resources

About