A two-stage method for text line detection in historical documents

Grüning, Tobias; Leifert, Gundram; Strauß, Tobias; Michael, Johannes; Labahn, Roger

doi:10.1007/s10032-019-00332-1

Cited by 109 publications

(102 citation statements)

References 49 publications

(79 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Modern approaches have overcome these obstacles using deep learning, presenting solutions that are robust in the presence of noise, arbitrary document layouts, arbitrary orientation, and curved text lines. Grüning et al [2] use a FCN for pixel labeling followed by post-processing to extract text lines from the pixel predictions. Wigington et al [3] use a FCN to detect the beginning of text lines and have a network segment the line by stepping along it.…”

Section: A Text Detectionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Visual Template-Free Form Parsing

Davis

Morse

Cohen

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

Automatic, template-free extraction of information from form images is challenging due to the variety of form layouts. This is even more challenging for historical forms due to noise and degradation. A crucial part of the extraction process is associating input text with pre-printed labels. We present a learned, template-free solution to detecting preprinted text and input text/handwriting and predicting pairwise relationships between them. While previous approaches to this problem have been focused on clean images and clear layouts, we show our approach is effective in the domain of noisy, degraded, and varied form images. We introduce a new dataset of historical form images (late 1800s, early 1900s) for training and validating our approach. Our method uses a convolutional network to detect pre-printed text and input text lines. We pool features from the detection network to classify possible relationships in a language-agnostic way. We show that our proposed pairing method outperforms heuristic rules and that visual features are critical to obtaining high accuracy.

show abstract

Section: A Text Detectionmentioning

confidence: 99%

“…The primary exception is comments, which are often oriented independently of the document. While work has been done to detect accurate bounding regions for skewed and even curved text [3], [2], we choose a simpler method that is robust to small amounts of skew and assumes straight lines.…”

Section: Detectionmentioning

confidence: 99%

Deep Visual Template-Free Form Parsing

Davis

Morse

Cohen

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

show abstract

“…21 The technical partner for the development of the layout analysis and training and recognition software is the CITlab 22 team at the University of Rostock whose approach performed best on the sub task of the detection of baselines, that is the line supporting the main bodies of characters within a text line, at a competition layout analysis for challenging medieval manuscripts at ICDAR2017 [28]. Several related publications are available (see for example [29,30] for layout analysis and [31] for HTR) but to the best of our knowledge the exact state of the software actually incorporated in Transkribus is not publicly known. Therefore, the best source for results seems a recently (May 2019) published talk 23 which biefly sums up some evaluations: After training on close to 36,000 words corresponding to 182 pages a CER of 3.1% and a WER of 13.1% was achieved on a dataset from the 18 th century written by a single writer in German.…”

Section: Transkribusmentioning

confidence: 99%

“…These models have been trained on a wide variety of books and typesets and, depending on the material used, can usually provide at least a valid starting point to start off the manual GT production or even already provide a satisfactory final result. OCR4all comes with four single standard models 30 which are automatically incorporated and made available when building the Docker image: antiqua_modern, antiqua_historical, fraktur_19th_century, and fraktur_historical. Since voting ensembles have proven to be very effective, we additionally provide a full set of model ensembles 31 consisting of five models for each of the four single model areas mentioned above, which can be downloaded and directly added into OCR4all.…”

Section: Character Recognitionmentioning

confidence: 99%

“…Despite the general workflow being very similar, there are several steps that would require adaptations, for example the line segmentation step since handwritten text often does not consist of single characters, which the current line segmentation approach heavily relies on, but features joined-up writing. Yet, there are already a few pertinent open-source algorithms available, for example [30], which can at least serve as a valid starting point. Apart from the line segmentation the remaining steps work quite similarly to when dealing with printed texts.…”

Section: Future Workmentioning

confidence: 99%

See 1 more Smart Citation

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

et al. 2019

View full text Add to dashboard Cite

Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. Experiments showed that users with minimal or no experience were able to capture the text of even the earliest printed books with manageable effort and great quality, achieving excellent character error rates (CERs) below 0.5%. The fully automated application on 19 th century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.

show abstract

Enhancing optical character recognition: Efficient techniques for document layout analysis and text line detection

Fateh,

Abolghasemi

2023

Engineering Reports

View full text Add to dashboard Cite

In recent years, automatic document and text analysis has gained significant importance, driven by advancements in optical character recognition (OCR) technology and the need for efficient processing of large volumes of printed or handwritten documents. This article specifically focuses on document layout analysis (DLA) and text line detection (TLD), both of which are crucial components of OCR systems. Our objective is to develop an effective method for extracting both textual and non‐textual regions, addressing challenges unique to the Persian (and Persian‐like) language(s). In the DLA stage, we employ deep learning models and a voting system to accurately determine the regions of interest. Additionally, we introduce methods such as optimum font size concepts, angle correction, and a line curvature elimination algorithm in the TLD process to enhance OCR accuracy. Comparative evaluations against state‐of‐the‐art methods demonstrate the superiority of our approach, showcasing a 2.8% improvement in the accuracy of Tesseract‐OCR 5.1.0 (a well‐established commercial OCR system) on the official Iranian newspapers dataset. These findings underscore the importance of addressing DLA and TLD challenges to advance OCR technology for Persian language documents and provide a solid foundation for future research in this domain.

show abstract

A two-stage method for text line detection in historical documents

Cited by 109 publications

References 49 publications

Deep Visual Template-Free Form Parsing

Deep Visual Template-Free Form Parsing

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Enhancing optical character recognition: Efficient techniques for document layout analysis and text line detection

Contact Info

Product

Resources

About