General Models for Handwritten Text Recognition: Feasibility and State-of-the Art. German Kurrent as an Example

Hodel, Tobias; Schoch, David; Schneider, Christa; Purcell, Jake

doi:10.5334/johd.46

Cited by 10 publications

(7 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is probably what motivated Ströbel et al (2022) in using the perplexity of lan-guage models to detect the erroneous output in an unsupervised manner. On the other hand, although we consider that recognition errors are overall more for handwritten text compared to printed material, the quality of recognition can vary significantly for the former (Hodel et al, 2021), as is also shown in Fig. 1, and does not always come with a high error rate.…”

Section: Related Workmentioning

confidence: 92%

Detecting Erroneously Recognized Handwritten Byzantine Text

Pavlopoulos,

Kougia,

Platanou

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Handwritten text recognition (HTR) yields textual output that comprises errors, which are considerably more compared to that of recognised printed (OCRed) text. Post-correcting methods can eliminate such errors but may also introduce errors. In this study, we investigate the issues arising from this reality in Byzantine Greek. We investigate the properties of the texts that lead post-correction systems to this adversarial behaviour and we experiment with text classification systems that learn to detect incorrect recognition output. A large masked language model, pre-trained in modern and fine-tuned in Byzantine Greek, achieves an Average Precision score of 95%. The score improves to 97% when using a model that is pretrained in modern and then in ancient Greek, the two language forms Byzantine Greek combines elements from. A century-based analysis shows that the advantage of the classifier that is further-pre-trained in ancient Greek concerns texts of older centuries. The application of this classifier before a neural post-corrector on HTRed text reduced significantly the postcorrection mistakes.

show abstract

Section: Related Workmentioning

confidence: 92%

Detecting Erroneously Recognized Handwritten Byzantine Text

Pavlopoulos,

Kougia,

Platanou

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…However, recent developments in technology combined with new infrastructures and software have made these methods more and more accessible. Methods such as CRNN aim to reduce training data requirements and modern models can now achieve character error rates (CER) below 2% for manuscripts, indicating the effectiveness of these technologies (Hodel et al [2021]).…”

Section: A Brief History Of Atrmentioning

confidence: 99%

Historical Documents and Automatic Text Recognition: Introduction

Pinche,

Stokes

2024

Journal of Data Mining &Amp; Digital Humanities

View full text Add to dashboard Cite

With this special issue of the Journal of Data Mining and Digital Humanities (JDMDH), we bringtogether in one single volume several experiments, projects and reflections related to automatic textrecognition applied to historical documents.More and more research projects1 now include automatic text acquisition in their data processing chain,and this is true not only for projects focussed on Digital or Computational Humanities but increasinglyalso for those that are simply using existing digital tools as the means to an end. The increasing useof this technology has led to an automation of tasks that affects the role of the researcher in the textualproduction process. This new data-intensive practice makes it urgent to collect and harmonise the corporanecessary for the constitution of training sets, but also to make them available for exploitation. Thisspecial issue is therefore an opportunity to present articles combining philological and technical questionsto make a scientific assessment of the use of automatic text recognition for ancient documents, itsresults, its contributions and the new practices induced by its use in the process of editing and exploringtexts. We hope that practical aspects will be questioned on this occasion, while raising methodologicalchallenges and its impact on research data.The special issue on Automatic Text Recognition (ATR) is therefore dedicated to providing a comprehensiveoverview of the use of ATR in the humanities field, particularly concerning historical documentsin the early 2020s. This issue presents a fusion of engineering and philological aspects, catering to bothbeginners and experienced users interested in launching projects with ATR. The collection encompassesa diverse array of approaches, covering topics such as data creation or collection for training genericmodels, reaching specific objectives, technical and HTR machine architecture, segmentation methods,and image processing.

show abstract

“…However, given the substantial variation in writing styles and hands across the medieval period and the scarcity of domain-specific ground truth, a more comprehensive approach to handwriting classification is necessary. With sufficient training data the merging of distinct hands into a single family-script model is achievable (Hodel et al [2021]). In our case, we adopt the classification based on Latin script families, as proposed by the CLAMM corpus, which encompasses 12 book-script families spanning the period from the 9th to the 15th centuries (Kestemont et al [2017]).…”

Section: Related Workmentioning

confidence: 99%

Handwritten Text Recognition for Documentary Medieval Manuscripts

Torres Aguilar,

Jolivet

2023

Journal of Data Mining &Amp; Digital Humanities

View full text Add to dashboard Cite

Handwritten Text Recognition (HTR) techniques aim to accurately recognize sequences of characters in input manuscript images by training artificial intelligence models to capture historical writing features. Efficient HTR models can transform digitized manuscript collections into indexed and quotable corpora, providing valuable research insight for various historical inquiries. However, several challenges must be addressed, including the scarcity of relevant training corpora, the consequential variability introduced by different scribal hands and writing scripts, and the complexity of page layouts. This paper presents two models and one cross-model approach for automatic transcription of Latin and French medieval documentary manuscripts, particularly charters and registers, written between the 12th and 15th centuries and classified into two major writing scripts: Textualis (from the late-11th to 13th century) and Cursiva (from the 13th to the 15th century). The architecture of the models is based on a Convolutional Recurrent Neural Network (CRNN) coupled with a Connectionist Temporal Classification (CTC) loss. The training and evaluation of the models, involving 120k lines of text and almost 1M tokens, were conducted using three available ground-truth corpora : The e-NDP corpus, the Alcar-HOME database and the Himanis project. This paper describes the training architecture and corpora used, while discussing the main training challenges, results, and potential applications of HTR techniques on medieval documentary manuscripts.

show abstract

General Models for Handwritten Text Recognition: Feasibility and State-of-the Art. German Kurrent as an Example

Cited by 10 publications

References 5 publications

Detecting Erroneously Recognized Handwritten Byzantine Text

Detecting Erroneously Recognized Handwritten Byzantine Text

Historical Documents and Automatic Text Recognition: Introduction

Handwritten Text Recognition for Documentary Medieval Manuscripts

Contact Info

Product

Resources

About