Historical document image analysis using controlled data for pre-training

Rahal, Najoua; Vögtlin, Lars; Ingold, Rolf

doi:10.1007/s10032-023-00437-8

Cited by 1 publication

(1 citation statement)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In related fields, several studies have been conducted to investigate the impact of ground truth quality on deep learning, for example in the context of object detection [2,13], text-line segmentation [3,22], and semantic segmentation [20,25] in natural images or historical document images. However, the problems encountered for HTR are specific and to the best of our knowledge, there are currently no comprehensive studies on the impact of ground-truth quality for deep learning-based HTR.…”

Section: Introductionmentioning

confidence: 99%

Impact of the ground truth quality for handwriting recognition

Jungo,

Vögtlin,

Fakhari

et al. 2023

Proceedings of the 12th International Symposium on Information and Communication Technology

Self Cite

View full text Add to dashboard Cite

Handwriting recognition is a key technology for accessing the content of old manuscripts, helping to preserve cultural heritage. Deep learning shows an impressive performance in solving this task. However, to achieve its full potential, it requires a large amount of labeled data, which is difficult to obtain for ancient languages and scripts. Often, a trade-off has to be made between ground truth quantity and quality, as is the case for the recently introduced Bullinger database. It contains an impressive amount of over a hundred thousand labeled text line images of mostly premodern German and Latin texts that were obtained by automatically aligning existing page-level transcriptions with text line images. However, the alignment process introduces systematic errors, such as wrongly hyphenated words. In this paper, we investigate the impact of such errors on training and evaluation and suggest means to detect and correct typical alignment errors.

show abstract