Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts

Englmeier, Tobias; Fink, Florian; Springmann, Uwe; Schulz, Klaus U.

doi:10.21248/jlcl.35.2022.232

JLCL

2022

DOI: 10.21248/jlcl.35.2022.232

|View full text |Cite

Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts

Tobias Englmeier

Florian Fink²,

Uwe Springmann³

et al.

Abstract: Systems for post-correction of OCR-results for historical texts are based on statistical correction models obtained by supervised learning. For training, suitable collections of ground truth materials are needed. In this paper we investigate the dependency of the power of automated OCR post-correction on the form of ground truth data and other training settings used for the computation of a post-correction model. The post-correction system A-PoCoTo considered here is based on a profiler service that computes a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(1 citation statement)

References 13 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Evaluation Metrics and Benchmarks [34]: Establishing appropriate evaluation metrics and benchmarks for multilingual OCR systems [35] is vital for assessing performance, identifying areas for improvement, and facilitating model comparison. These metrics should include character-and word-level recognition rates, language identification accuracy, and domain-specific evaluation measures.…”

mentioning

confidence: 99%

Optimal Training Dataset Preparation for AI-Supported Multilanguage Real-Time OCRs Using Visual Methods

Biró,

Szilágyi,

Szilágyi

2023

Applied Sciences

View full text Add to dashboard Cite

In the realm of multilingual, AI-powered, real-time optical character recognition systems, this research explores the creation of an optimal, vocabulary-based training dataset. This comprehensive endeavor seeks to encompass a range of criteria: comprehensive language representation, high-quality and diverse data, balanced datasets, contextual understanding, domain-specific adaptation, robustness and noise tolerance, and scalability and extensibility. The approach aims to leverage techniques like convolutional neural networks, recurrent neural networks, convolutional recurrent neural networks, and single visual models for scene text recognition. While focusing on English, Hungarian, and Japanese as representative languages, the proposed methodology can be extended to any existing or even synthesized languages. The development of accurate, efficient, and versatile OCR systems is at the core of this research, offering societal benefits by bridging global communication gaps, ensuring reliability in diverse environments, and demonstrating the adaptability of AI to evolving needs. This work not only mirrors the state of the art in the field but also paves new paths for future innovation, accentuating the importance of sustained research in advancing AI’s potential to shape societal development.

show abstract

mentioning

confidence: 99%

Optimal Training Dataset Preparation for AI-Supported Multilanguage Real-Time OCRs Using Visual Methods

Biró,

Szilágyi,

Szilágyi

2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Optimizing the Training of Models for Automated Post-Correction of Arbitrary OCR-ed Historical Texts

Cited by 1 publication

References 13 publications

Optimal Training Dataset Preparation for AI-Supported Multilanguage Real-Time OCRs Using Visual Methods

Optimal Training Dataset Preparation for AI-Supported Multilanguage Real-Time OCRs Using Visual Methods

Contact Info

Product

Resources

About