A technique for computer detection and correction of spelling errors

Damerau, Fred J.

doi:10.1145/363958.363994

Cited by 1,129 publications

(674 citation statements)

References 3 publications

Supporting

Mentioning

581

Contrasting

Unclassified

Order By: Relevance

“…Some examples of this approach are Hamming distance [6], Levenshtein distance [7]. Damerau-Levenshtein [7], [8], Needleman-Wunsch [9], Longest Common Subsequence [10]. Smith-Waterman [11], Jaro [12], JaroWinkler [13], and N-gram [14], [15].…”

Section: Text Similarity Algorithmsmentioning

confidence: 99%

The performance of text similarity algorithms

Prasetya

Wibawa

Hirashima

2018

Int. J. Adv. Intell. Informatics

View full text Add to dashboard Cite

Text similarity measurement compares text with available references to indicate the degree of similarity between those objects. There have been many studies of text similarity and resulting in various approaches and algorithms. This paper investigates four majors text similarity measurements which include String-based, Corpus-based, Knowledge-based, and Hybrid similarities. The results of the investigation showed that the semantic similarity approach is more rational in finding substantial relationship between texts.

show abstract

Section: Text Similarity Algorithmsmentioning

confidence: 99%

The performance of text similarity algorithms

Prasetya

Wibawa

Hirashima

2018

Int. J. Adv. Intell. Informatics

View full text Add to dashboard Cite

show abstract

“…To identify plausible misspellings, we rely on the Damerau-Levenshtein distance [2,6]: the minimum number of insertions, deletions, substitutions or transpositions required to transform one string into another. For example, faceboolk, facebok, faceboik, and faceboko each have a Damerau-Levenshtein distance of 1 from facebook.…”

Section: Identifying Typosquatting Domainsmentioning

confidence: 99%

Measuring the Perpetrators and Funders of Typosquatting

Moore

Edelman²

2010

Financial Cryptography and Data Security

View full text Add to dashboard Cite

Abstract. We describe a method for identifying "typosquatting", the intentional registration of misspellings of popular website addresses. We estimate that at least 938 000 typosquatting domains target the top 3 264 .com sites, and we crawl more than 285 000 of these domains to analyze their revenue sources. We find that 80% are supported by pay-per-click ads, often advertising the correctly spelled domain and its competitors. Another 20% include static redirection to other sites. We present an automated technique that uncovered 75 otherwise legitimate websites which benefited from direct links from thousands of misspellings of competing websites. Using regression analysis, we find that websites in categories with higher pay-per-click ad prices face more typosquatting registrations, indicating that ad platforms such as Google AdWords exacerbate typosquatting. However, our investigations also confirm the feasibility of significantly reducing typosquatting. We find that typosquatting is highly concentrated: Of typo domains showing Google ads, 63% use one of five advertising IDs, and some large name servers host typosquatting domains as much as four times as often as the web as a whole.

show abstract

“…The Damerau-Levenshtein metric, also known as edit distance, is a measure of string similarity defined as the minimal number of operations needed to transform one string into another (Damerau, 1964). The distance between the names s and t would be the number of edit operations that convert s into t. Assuming that most misspellings are single-character errors, as has been shown by different studies (Damerau, 1989;Petersen, 1986;Pollok & Zamora, 1983), the edit operations would consist of the insertion, deletion, or substitution of a single character, or the transposition of two characters, taking into account the cost of each operation.…”

Section: Similarity Relationsmentioning

confidence: 99%

Approximate personal name‐matching through finite‐state graphs

Gálvez

Anegón

2007

J. Am. Soc. Inf. Sci.

View full text Add to dashboard Cite

This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of information from databases, and the citation statistics used for the evaluation of scientists' work. A number of approximate string matching techniques have been developed to validate variant forms, based on similarity and equivalence relations. We classify the personal name variants as nonvalid and valid forms. In establishing an equivalence relation between valid variants and the standard form of its equivalence class, we defend the application of finite-state transducers. The process of variant identification requires the elaboration of: (a) binary matrices and (b) finite-state graphs. This procedure was tested on samples of author names from bibliographic records, selected from the Library and Information Science Abstracts and Science Citation Index Expanded databases. The evaluation involved calculating the measures of precision and recall, based on completeness and accuracy. The results demonstrate the usefulness of this approach, although it should be complemented with methods based on similarity relations for the recognition of spelling variants and misspellings.

show abstract

A technique for computer detection and correction of spelling errors

Cited by 1,129 publications

References 3 publications

The performance of text similarity algorithms

The performance of text similarity algorithms

Measuring the Perpetrators and Funders of Typosquatting

Approximate personal name‐matching through finite‐state graphs

Contact Info

Product

Resources

About