Error correction vs. query garbling for Arabic OCR document retrieval

Darwish, Kareem; Magdy, Walid

doi:10.1145/1292591.1292596

Cited by 7 publications

(6 citation statements)

References 18 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our formulation is similar to approaches taken in OCR document retrieval, using degradations of character sequences (Darwish and Magdy, 2007;Darwish, 2003). For vocabulary-independent spoken term detection, perhaps the most closely related formulation is provided by (Mamou and Ramabhadran, 2008).…”

Section: Incorporating Query Degradationsmentioning

confidence: 99%

Phrase-based query degradation modeling for vocabulary-independent ranked utterance retrieval

Olsson

Oard

2009

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Com

View full text Add to dashboard Cite

This paper introduces a new approach to ranking speech utterances by a system's confidence that they contain a spoken word. Multiple alternate pronunciations, or degradations, of a query word's phoneme sequence are hypothesized and incorporated into the ranking function. We consider two methods for hypothesizing these degradations, the best of which is constructed using factored phrasebased statistical machine translation. We show that this approach is able to significantly improve upon a state-of-the-art baseline technique in an evaluation on held-out speech. We evaluate our systems using three different methods for indexing the speech utterances (using phoneme, phoneme multigram, and word recognition), and find that degradation modeling shows particular promise for locating out-of-vocabulary words when the underlying indexing system is constructed with standard word-based speech recognition.

show abstract

Section: Incorporating Query Degradationsmentioning

confidence: 99%

Phrase-based query degradation modeling for vocabulary-independent ranked utterance retrieval

Olsson

Oard

2009

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Com

View full text Add to dashboard Cite

show abstract

“…The main concept in [2] was creating a character level alignment from random words and then using a garbler to select a single edit operation, and accordingly a new character is inserted, deleted or substituted. In [2,3] the language models were used to obtain a better ranking of candidate words that corrects the OCR output. Our suggested improvements are based on: (a) adding more edit operations, (b) modeling correction rules and (c) improving the language models.…”

Section: Related Workmentioning

confidence: 99%

“…The criterion considered in [2,3] for alignment was the position of the erroneous characters in the word. From our point of view, this method needs to be improved so that an edit operation depends on other factors (e.g.…”

Section: Related Workmentioning

confidence: 99%

“…In this paper, we aim to tackle three aspects: the first is to model 1 general correction rules that target classes of characters rather than a single specific one to improve the concept of error-n-gram [1] and alignment [2,3] that will be discussed in the next section. These rules will be used to correct not only the lexical errors, but also the semantic ones, as well as other determined types of errors.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Correcting Arabic OCR Errors Using Improved Topic-Based Language Models

Mamish

Cheriet

2009

Int. J. Comp. Proc. Lang.

View full text Add to dashboard Cite

The OCR output of scanned document images suffers from recognition errors especially when dealing with languages that are characterized by particularities and rich morphology such as the Arabic language, thus an effective error correction model is greatly needed. This paper focuses on three aspects of post-processing correction. First, improving the alignment and error n-gram models by adding correction rules based on character meta-classes rather than on specific characters, which is more suitable for the Arabic language. Second, using the language models to understand and correct the Arabic word fragment resulting from agglutinated affixes or isolated letters. The last will concern improving the language models by adding semantic information to the correction process, by using the bidirectional n-grams, stemming and removing stop words, which gives higher weights to n-grams sharing semantic meanings. In addition, we use a topic corpus, not a global one for a better probability distribution. The proposed model is effective in correcting the lexical errors and covered the semantic ones, that were not frequently reported by OCRs and are corrected after a manual proofreading. The proposed method shows an increase in the correction rate of almost 13% especially in meaningful terms.

show abstract

“…The work of Darwish and Magdy (2007), for example, although distantly-related to ours, differs significantly since it is focused on monolingual retrieval of scanned documents containing OCR errors, instead of multilingual retrieval with misspelling errors present in the queries, as is our case.…”

Section: Introductionmentioning

confidence: 99%

Studying the effect and treatment of misspelled queries in Cross-Language Information Retrieval

Vilares

Alonso

Doval

et al. 2016

Information Processing & Management

View full text Add to dashboard Cite

In contrast with their monolingual counterparts, little attention has been paid to the effects that misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in order to study the impact that the progressive addition of misspellings to input queries has, this time, on the output of CLIR systems. Two approaches for dealing with this problem are analyzed in this paper. Firstly, the use of automatic spelling correction techniques for which, in turn, we consider two algorithms: the first one for the correction of isolated words and the second one for a correction based on the linguistic context of the misspelled word. The second approach to be studied is the use of character n-grams both as index terms and translation units, seeking to take advantage of their inherent robustness and language-independence. All these approaches have been tested on a from-Spanish-to-English CLIR system, that is, Spanish queries on English documents. Real, user-generated spelling errors have been used under a methodology that allows us to study the effectiveness of the different approaches to be tested and their behavior when confronted with different error rates. The results obtained show the great sensitiveness of classic word-based approaches to misspelled queries, although spelling correction techniques can mitigate such negative effects. On the other hand, the use of character n-grams provides great robustness against misspellings.

show abstract

Error correction vs. query garbling for Arabic OCR document retrieval

Cited by 7 publications

References 18 publications

Phrase-based query degradation modeling for vocabulary-independent ranked utterance retrieval

Phrase-based query degradation modeling for vocabulary-independent ranked utterance retrieval

Correcting Arabic OCR Errors Using Improved Topic-Based Language Models

Studying the effect and treatment of misspelled queries in Cross-Language Information Retrieval

Contact Info

Product

Resources

About