Combination of multiple aligned recognition outputs using WFST and LSTM

Azawi, Mayce Al; Liwicki, Marcus; Breuel, Thomas M.

doi:10.1109/icdar.2015.7333720

Cited by 10 publications

(8 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead of aligning OCR versions of the same scan, an approach of Wemhoener et al [163] enables to create a sequence alignment of OCR outputs with the scans of different copies of the same book, or its different editions. Al Azawi et al [4,8] apply Line-to-Page alignment that aligns each line of the 1st OCR with the whole page of the second OCR using Weighted Finite-State Transducers (WFST).…”

Section: Isolated-word Approachesmentioning

confidence: 99%

“…In the last step, several techniques are applied to choose the best sequence. Lopresti et al [91], Lin [87], Wemhoener et al [163], and Reul et al [129] utilize voting policy, Al Azawi et al [4,8] use Long Short-Term Memory (LSTM) [64] to decide the most relevant output. Different kinds of features (voting, number, dictionary, gazetteer, and lexical feature) are used in learning decision list, maximum entropy classification or conditional random fields (CRF) methods to choose the best possible correction by Lund et al [92,[94][95][96].…”

Section: Isolated-word Approachesmentioning

confidence: 99%

See 1 more Smart Citation

Survey of Post-OCR Processing Approaches

et al. 2021

View full text Add to dashboard Cite

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.

show abstract

Section: Isolated-word Approachesmentioning

confidence: 99%

Section: Isolated-word Approachesmentioning

confidence: 99%

Survey of Post-OCR Processing Approaches

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Ensemble methods have been shown to be effective in OCR postcorrection by combining OCR output from multiple scans of the same document (Lopresti and Zhou, 1997;Klein and Kopel, 2002;Cecotti and Belaïd, 2005;Lund et al, 2013). Existing methods aim at generating consensus results by aligning multiple inputs, followed by supervised methods such as classification (Boschetti et al, 2009;Lund et al, 2011;Al Azawi et al, 2015), or unsupervised methods such as dictionary-based selection (Lund and Ringger, 2009) and voting (Wemhoener et al, 2013;Xu and Smith, 2017). While supervised ensemble methods require human annotation for training, unsupervised selection methods work only when the correct word or character exists in one of the inputs.…”

Section: Related Workmentioning

confidence: 99%

“…Most of these ensemble methods, however, require aligning multiple OCR outputs (Lund and Ringger, 2009;Lund et al, 2011), which is intractable in general and might introduce noise into the later correction stage. Furthermore, voting-based ensemble methods (Lund and Ringger, 2009;Wemhoener et al, 2013;Xu and Smith, 2017) only work where the correct output exists in one of the inputs, while classification methods (Boschetti et al, 2009;Lund et al, 2011;Al Azawi et al, 2015) are also trained on human annotations.…”

Section: Introductionmentioning

confidence: 99%

Multi-Input Attention for Unsupervised OCR Correction

Dong¹,

Smith²

2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noisily observed textual variants or bootstrapping from a uniform error model. On two corpora of historical newspapers and books, we show that these unsupervised techniques cut the character and word error rates nearly in half on single inputs and, with the addition of multi-input decoding, can rival supervised methods.

show abstract

“…Azawi et al [13] used weighted finite-state transducers based on edit rules to align the output of two different OCR engines. Neural LSTM networks trained on the aligned outputs are used to return a best voting.…”

Section: Related Workmentioning

confidence: 99%

Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting

Reul

Springmann

Wick

et al. 2018

2018 13th IAPR International Workshop on Document Analysis Systems (DAS)

View full text Add to dashboard Cite

In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration. Experiments on seven early printed books show that the proposed method outperforms the standard approach considerably by reducing the amount of errors by up to 50% and more.

show abstract

Combination of multiple aligned recognition outputs using WFST and LSTM

Cited by 10 publications

References 10 publications

Survey of Post-OCR Processing Approaches

Survey of Post-OCR Processing Approaches

Multi-Input Attention for Unsupervised OCR Correction

Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting

Contact Info

Product

Resources

About