Multi-Input Attention for Unsupervised OCR Correction

Dong, Rui; Smith, David A.

doi:10.18653/v1/p18-1220

Cited by 28 publications

(44 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• BASE: This system is the base sequence-tosequence architecture described in Section 5.1. Both the single-source and multi-source variants of this system are used for English OCR post-correction in Dong and Smith (2018).…”

Section: Methodsmentioning

confidence: 99%

“…OCR post-correction The goal of postcorrection is to reduce recognition errors in the first pass transcription -often caused by low quality scanning, physical deterioration of the paper book, or diverse layouts and typefaces (Dong and Smith, 2018). The focus of our work is on using post-correction to counterbalance the lack of OCR training data in the target endangered languages.…”

Section: Formulationmentioning

confidence: 99%

“…Although OCR post-correction is relatively wellstudied, most existing methods rely on considerable resources in the target language, including a substantial amount of textual data to train a language model (Schnober et al, 2016;Dong and Smith, 2018;Rigaud et al, 2019) or to create synthetic data (Krishna et al, 2018). While readily available for high-resource languages, these resources are severely limited in endangered languages, preventing the direct application of existing post-correction methods in our setting.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

OCR Post Correction for Endangered Language Texts

Rijhwani¹,

Anastasopoulos²,

Neubig³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR postcorrection method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Formulationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

OCR Post Correction for Endangered Language Texts

Rijhwani¹,

Anastasopoulos²,

Neubig³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Once all the candidates are obtained, a hypothesis lattice is created and a single word selected using a supervised discriminative machine learning tool [26]. Similarly, Dong and Smith [4] create an unsupervised framework for OCR error correction for both single-input and multi-input correction tasks. They focus specifically on the data with several OCR versions which they align to create parallel OCR data.…”

Section: Generation Of Correction Candidates 2 Decision Making To Acmentioning

confidence: 99%

Optical character recognition with neural networks and post-correction with finite state methods

Drobac

Lindén

2020

IJDAR

View full text Add to dashboard Cite

The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/tesseract), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.

show abstract

“…Another approach is to train a supervised system from synthetic training data, using features such as proposed spelling corrections (Lund et al, 2011). Dong and Smith (2018) propose an alternative unsupervised training technique for OCR post-correction, which builds on character-level LSTMs. In their method, which they call seq2seq-noisy, they build an ensemble of post-processing systems.…”

Section: Related Workmentioning

confidence: 99%

Correcting Whitespace Errors in Digitized Historical Texts

Soni

Klein

Eisenstein

2019

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities An

View full text Add to dashboard Cite

Whitespace errors are common to digitized archives. This paper describes a lightweight unsupervised technique for recovering the original whitespace. Our approach is based on count statistics from Google n-grams, which are converted into a likelihood ratio test computed from interpolated trigram and bigram probabilities. To evaluate this approach, we annotate a small corpus of whitespace errors in a digitized corpus of newspapers from the 19th century United States. Our technique identifies and corrects most whitespace errors while introducing a minimal amount of oversegmentation: it achieves 77% recall at a false positive rate of less than 1%, and 91% recall at a false positive rate of less than 3%.

show abstract

Multi-Input Attention for Unsupervised OCR Correction

Cited by 28 publications

References 29 publications

OCR Post Correction for Endangered Language Texts

OCR Post Correction for Endangered Language Texts

Optical character recognition with neural networks and post-correction with finite state methods

Correcting Whitespace Errors in Digitized Historical Texts

Contact Info

Product

Resources

About