Layout Detection and Table Recognition – Recent Challenges in Digitizing Historical Documents and Handwritten Tabular Data

Lehenmeier, Constantin; Burghardt, Manuel; Mischka, Bernadette

doi:10.1007/978-3-030-54956-5_17

Cited by 12 publications

(3 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When we started this project, Transkribus was not able to extract the structure of the source image. Similar challenges were reported for other projects transcribing historical sources with tabular structure (Lehenmeier et al, 2020). However, Transkribus offers a good GUI solution for the manual labelling and exporting of images that is useful for professional transcribers (Vézina et al, 2019).…”

Section: Comparison To Related Workmentioning

confidence: 61%

Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Pedersen

Holsbø

Andersen

et al. 2022

hlcs

View full text Add to dashboard Cite

Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification . We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be representative for the census as a whole. We believe our approach and lessons learned may be useful for other transcription projects that plan to use machine learning in production. The source code is available at https://github.com/uit-hdl/rhd-codes.

show abstract

Section: Comparison To Related Workmentioning

confidence: 61%

Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Pedersen

Holsbø

Andersen

et al. 2022

hlcs

View full text Add to dashboard Cite

show abstract

“…A study for the layout analysis of historical newspapers has been conducted which achieved very good results e.g. [27]. One study explores visual trends in newspapers and models them as a multimodal construct consisting of text and images.…”

Section: Applications Of Image Processing In Digital Humanitiesmentioning

confidence: 99%

Deep learning for historical books: classification of printing technology for digitized images

Kim

Mandl

2021

Multimed Tools Appl

View full text Add to dashboard Cite

Printing technology has evolved through the past centuries due to technological progress. Within Digital Humanities, images are playing a more prominent role in research. For mass analysis of digitized historical images, bias can be introduced in various ways. One of them is the printing technology originally used. The classification of images to their printing technology e.g. woodcut, copper engraving, or lithography requires highly skilled experts. We have developed a deep learning classification system that achieves very good results. This paper explains the challenges of digitized collections for this task. To overcome them and to achieve good performance, shallow networks and appropriate sampling strategies needed to be combined. We also show how class activation maps (CAM) can be used to analyze the results.

show abstract

“…The extraction of different layout elements of articles is an important component of scientific data curation, with the accuracy of extraction of the elements such as tables, figures and their captions increasing significantly over the past several years [4,15,25,51]. A large field of study within document layout analysis is the "mining" of PDFs as newer PDFs are generally in "vector" format -the document is rendered from a set of instructions instead of pixel-by-pixel as in a raster format, and, in theory, the set of instructions can be parsed to determine the locations of figures, captions and tables [3,9,23].…”

Section: Introductionmentioning

confidence: 99%

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Naiman

Williams

Goodman

2022

Linking Theory and Practice of Digital Libraries

View full text Add to dashboard Cite

Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OCR-features. When applied to the astrophysics literature holdings of the Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the intersection-over-union (IOU) cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

show abstract

Layout Detection and Table Recognition – Recent Challenges in Digitizing Historical Documents and Handwritten Tabular Data

Cited by 12 publications

References 23 publications

Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Deep learning for historical books: classification of printing technology for digitized images

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Contact Info

Product

Resources

About