Text induced spelling correction

Reynaert, Martin

doi:10.3115/1220355.1220475

Cited by 23 publications

(30 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This technique also provides ways of incorporating phonetic similarity, proximity to the keyword and data from the actual spelling mistakes made by users. Its greatest advantage, however, is the possibility of generating contextual information, which adds linguistically-motivated features (Hirst and Budanitsky, 2005;Reynaert, 2004) to the string distance module (Jiang and Conrath, 1997) and suggests that the difference in average precision in misspelled texts can be reduced to a few percentage points in comparison with properly-spelled ones (Ruch, 2002). More appropriate for dealing with real-word errors, its success depends as much on the wealth of knowledge accumulated as on the way in which this is acquired and then used.…”

Section: The Spelling Correction Approachmentioning

confidence: 99%

“…Focusing first on entire dictionary entries, spelling correction is a well known subject matter in NLP (Mitton, 2009;Reynaert, 2004;Savary, 2001;Vilares et al, 2004), often based on the notion of edit distance 2 (Levenshtein, 1966). When dealing with misspelled queries, the aim is to replace the erroneous term or terms in the query with those considered to be the correct ones and whose edit distance with regard to the former is the smallest possible.…”

Section: The Spelling Correction Approachmentioning

confidence: 99%

“…• Context-dependent word correction (Otero et al, 2007;Reynaert, 2004), which is able to address the real-word error case and the correction of non-word errors that have more than one potential correction.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Managing misspelled queries in IR applications

Vilares

Otero

2011

Information Processing & Management

View full text Add to dashboard Cite

Section: The Spelling Correction Approachmentioning

confidence: 99%

Section: The Spelling Correction Approachmentioning

confidence: 99%

See 1 more Smart Citation

Managing misspelled queries in IR applications

Vilares

Otero

2011

Information Processing & Management

View full text Add to dashboard Cite

“…We propose an adaptation of the core correction algorithm we have described in depth in [18]. Anagram Hashing first uses a bad hashing function to identify all word strings in the corpus at hand that consist of the same subset of characters and assigns a large natural number to them, to be used as an index.…”

Section: Anagram Hashingmentioning

confidence: 99%

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

Reynaert

Computational Linguistics and Intelligent Text Processing

Self Cite

View full text Add to dashboard Cite

Abstract. This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.

show abstract

“…Hupkes [6] explored semi-supervised learning for tagging historical Dutch texts. Reynaert [17] developed TiCCl, a tool for normalizing Dutch texts by performing automatic spelling correction. The program Adelheid has specifically been developed for lemmatizing and tagging fourteenth-century Dutch [16].…”

Section: Related Workmentioning

confidence: 99%

Improving Part-of-Speech Tagging of Historical Text by First Translating to Modern Text

Sang

2016

IFIP Advances in Information and Communication Technology

View full text Add to dashboard Cite

Abstract. We explore the task of automatically assigning syntactic tags (known as part-of-speech tags) like Noun and Verb to words in seventeenth-century Dutch text. Tools exist for performing this task for modern texts but they perform poorly on historical texts because of language changes. We test several methods for translating the words in the historical text to modern equivalents before applying the tag assignment tools. We show that this additional translation step improves the quality of the automatic syntactic analysis. Further improvements are possible when the lexicons and text collections used for developing the translation process, are extended in size.

show abstract

Text induced spelling correction

Cited by 23 publications

References 4 publications

Managing misspelled queries in IR applications

Managing misspelled queries in IR applications

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

Improving Part-of-Speech Tagging of Historical Text by First Translating to Modern Text

Contact Info

Product

Resources

About