Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018
DOI: 10.18653/v1/p18-1220
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Input Attention for Unsupervised OCR Correction

Abstract: We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
36
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(44 citation statements)
references
References 29 publications
0
36
0
Order By: Relevance
“…• BASE: This system is the base sequence-tosequence architecture described in Section 5.1. Both the single-source and multi-source variants of this system are used for English OCR post-correction in Dong and Smith (2018).…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…• BASE: This system is the base sequence-tosequence architecture described in Section 5.1. Both the single-source and multi-source variants of this system are used for English OCR post-correction in Dong and Smith (2018).…”
Section: Methodsmentioning
confidence: 99%
“…OCR post-correction The goal of postcorrection is to reduce recognition errors in the first pass transcription -often caused by low quality scanning, physical deterioration of the paper book, or diverse layouts and typefaces (Dong and Smith, 2018). The focus of our work is on using post-correction to counterbalance the lack of OCR training data in the target endangered languages.…”
Section: Formulationmentioning
confidence: 99%
See 1 more Smart Citation
“…Once all the candidates are obtained, a hypothesis lattice is created and a single word selected using a supervised discriminative machine learning tool [26]. Similarly, Dong and Smith [4] create an unsupervised framework for OCR error correction for both single-input and multi-input correction tasks. They focus specifically on the data with several OCR versions which they align to create parallel OCR data.…”
Section: Generation Of Correction Candidates 2 Decision Making To Acmentioning
confidence: 99%
“…Another approach is to train a supervised system from synthetic training data, using features such as proposed spelling corrections (Lund et al, 2011). Dong and Smith (2018) propose an alternative unsupervised training technique for OCR post-correction, which builds on character-level LSTMs. In their method, which they call seq2seq-noisy, they build an ensemble of post-processing systems.…”
Section: Related Workmentioning
confidence: 99%