An improved error model for noisy channel spelling correction

Brill, Eric; Moore, Robert C.

doi:10.3115/1075218.1075255

Cited by 374 publications

(300 citation statements)

References 13 publications

Supporting

Mentioning

289

Contrasting

Unclassified

Order By: Relevance

“…Through decoding, we want to find the target page T that most likely lead to the observed output S. The process is visualized in Figure 1. Therefore, like in the noisy channel model (Brill and Moore, 2000), to decode the input T , we estimate the probability of T given the output observation S, P (T |S). Following Bayes' rule, the problem is characterized by Equation 2:…”

Section: Language Model-based Approach (Ufal-2)mentioning

confidence: 99%

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

Le¹,

Vu²,

Oberländer³

et al. 2016

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

View full text Add to dashboard Cite

The WMT Bilingual Document Alignment Task requires systems to assign source pages to their "translations", in a big space of possible pairs. We present four methods: The first one uses the term position similarity between candidate document pairs. The second method requires automatically translated versions of the target text, and matches them with the candidates. The third and fourth methods try to overcome some of the challenges presented by the nature of the corpus, by considering the string similarity of source URL and candidate URL, and combining the first two approaches.

show abstract

Section: Language Model-based Approach (Ufal-2)mentioning

confidence: 99%

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

Le¹,

Vu²,

Oberländer³

et al. 2016

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

View full text Add to dashboard Cite

show abstract

“…Ahmed et al (2010) propose a spell checker that works by selecting the most promising candidates from a ranked list that is derived from n-gram statistics and lexical resources. Other approaches that correct spelling include rule-based techniques (Mangu and Brill, 1997), a noisy channel model (Brill and Moore, 2000;Toutanova and Moore, 2002) and a ternary tree search (Martins and Silva, 2004). As far as we know, little work has been made to date on the subject for Spanish, with the exception of Alonso (2010).…”

Section: Correction Candidate Selectionmentioning

confidence: 99%

Selection of correction candidates for the normalization of Spanish user-generated content

et al. 2014

View full text Add to dashboard Cite

We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing (NLP), since UGC (micro-blog, blog, and, generally, Web 2.0 user generated texts) presents a number of non-standard communicative and linguistic characteristics -often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging.

show abstract

“…Going forward, noisychannel models [14] specifically trained on the substitution and segmentation errors of a particular OCR shape classifier, such as e->c is more likely than o->x, would seem more appropriate than blind wildcarding. That should reduce the frequency of hallucinations of incorrect dictionary words, making the language model relatively more powerful.…”

Section: Conclusion: Speech Vs Ocr and Furtherworkmentioning

confidence: 99%

Limits on the Application of Frequency-Based Language Models to OCR

Smith

2011

2011 International Conference on Document Analysis and Recognition

View full text Add to dashboard Cite

Abstract-Although large language models are used in speech recognition and machine translation applications, OCR systems are "far behind" in their use of language models. The reason for this is not the laggardness of the OCR community, but the fact that, at high accuracies, a frequency-based language model can do more damage than good, unless carefully applied. This paper presents an analysis of this discrepancy with the help of the Google Books n-gram Corpus, and concludes that noisy-channel models that closely model the underlying classifier and segmentation errors are required.

show abstract

An improved error model for noisy channel spelling correction

Cited by 374 publications

References 13 publications

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

Selection of correction candidates for the normalization of Spanish user-generated content

Limits on the Application of Frequency-Based Language Models to OCR

Contact Info

Product

Resources

About