2005
DOI: 10.1007/11575832_13
|View full text |Cite
|
Sign up to set email alerts
|

N-Gram Similarity and Distance

Abstract: Abstract. In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, resp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
101
0
5

Year Published

2014
2014
2019
2019

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 214 publications
(109 citation statements)
references
References 7 publications
0
101
0
5
Order By: Relevance
“…Damerau-Levenshtein [7], [8], Needleman-Wunsch [9], Longest Common Subsequence [10]. Smith-Waterman [11], Jaro [12], JaroWinkler [13], and N-gram [14], [15]. Character-based measure is useful for recognizing typographical errors, but it is useless in recognition of the rearranged terms (e.g.…”
Section: Text Similarity Algorithmsmentioning
confidence: 99%
“…Damerau-Levenshtein [7], [8], Needleman-Wunsch [9], Longest Common Subsequence [10]. Smith-Waterman [11], Jaro [12], JaroWinkler [13], and N-gram [14], [15]. Character-based measure is useful for recognizing typographical errors, but it is useless in recognition of the rearranged terms (e.g.…”
Section: Text Similarity Algorithmsmentioning
confidence: 99%
“…EFL learners often fail to recognise such words as cognates to the effect that they remain as unpredictable as any other non-cognate word (Nagy et.al., 1993). Orthographic similarity was checked when in doubt with the BI-SIM string comparison method (Kondrak, 2005) using the web interface designed by Bhargava at http://www.cs.toronto.edu/~aditya/ strcmp2/. This method involves a comparison of all pairs of adjacent letters (bi-gram comparisons) in two orthographic strings (an English word and its Turkish equivalent in this case).…”
Section: Cognatesmentioning
confidence: 99%
“…There are several string-based techniques that could be applied for phonetic transcriptions similarity matching: Edit Distance -finds how dissimilar two strings are by counting the minimum number of operations required to transform one string into another; Jaro-Winkler measure (Winkler, 1999), N-gram similarity function (Kondrak, 2005), Soundex (Russell and Odell, 1918) -phonetic similarity measure, which principle of operation is based on the partition of consonants in the group with serial numbers from which then compiled the resulting value; Daitch-Mokotoff (Mokotoff, 1997) has much more complex conversion rules than in Soundex -now shaping the resulting code involved not only single characters, but also a sequence of several characters; Metaphone -transforms the original word with the rules of English language, using much more complex rules, and thus lost significantly less information as letters are not divided into groups (Euzenat and Shvaiko, 2013). In our solution we allow utilisation of several measuring functions with further weighted aggregation of the results (e.g., weighted product or weighted sum).…”
Section: Phonetic Similaritymentioning
confidence: 99%