How to Do Lexical Quality Estimation of a Large OCRed Historical Finnish Newspaper Collection with Scarce Resources

Kettunen, Kimmo

doi:10.16995/dscn.315

Cited by 2 publications

(2 citation statements)

References 20 publications

(38 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The original OCR for Uusi Suometar was performed using a line of ABBYY FineReader® products. The quality of the digitization of the whole collection of Finnish newspapers from 1771 to 1910 has been estimated by Kettunen and P€ a€ akk€ onen (2016). They conclude that ca 70-75% of the words in the Finnish language 2.4-billion-word index database could be recognized by using automatic morphological analysers.…”

Section: Topic Creation For the Studymentioning

confidence: 99%

Optical character recognition quality affects subjective user perception of historical newspaper clippings

Kettunen

Keskustalo

Kumpulainen

et al. 2023

Self Cite

View full text Add to dashboard Cite

PurposeThis study aims to identify user perception of different qualities of optical character recognition (OCR) in texts. The purpose of this paper is to study the effect of different quality OCR on users' subjective perception through an interactive information retrieval task with a collection of one digitized historical Finnish newspaper.Design/methodology/approachThis study is based on the simulated work task model used in interactive information retrieval. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869–1918 which consists of ca. 1.45 million autosegmented articles. The article search database had two versions of each article with different quality OCR. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top 10 results using a graded relevance scale of 0–3. Users were not informed about the OCR quality differences of the otherwise identical articles.FindingsThe main result of the study is that improved OCR quality affects subjective user perception of historical newspaper articles positively: higher relevance scores are given to better-quality texts.Originality/valueTo the best of the authors’ knowledge, this simulated interactive work task experiment is the first one showing empirically that users' subjective relevance assessments are affected by a change in the quality of an optically read text.

show abstract

Section: Topic Creation For the Studymentioning

confidence: 99%

Optical character recognition quality affects subjective user perception of historical newspaper clippings

Kettunen

Keskustalo

Kumpulainen

et al. 2023

Self Cite

View full text Add to dashboard Cite

show abstract

“…This has a direct consequence on a proper rendering of the paper in digital form. Consequently, this impacts on the adequate interpretation of the language in the text using Natural Language Processing techniques such as word and sentence tokenisation, part-of-speech tagging ( Lopresti, 2009 , Tanner et al, 2009 , Kettunen and Pääkkönen, 2016 ) in correctly interpreting text. Lately, the use of Machine Learning techniques has been gaining traction in building models that can be trained to recognise information entities from textual sources.…”

Section: Related Workmentioning

confidence: 99%

Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science

Nundloll

Smail

Blair

2022

Heliyon

View full text Add to dashboard Cite

How to Do Lexical Quality Estimation of a Large OCRed Historical Finnish Newspaper Collection with Scarce Resources

Abstract: This is a peer-reviewed article in Digital Studies/Le champ numérique, a journal published by the Open Library of Humanities.

Cited by 2 publications

References 20 publications

Optical character recognition quality affects subjective user perception of historical newspaper clippings

Optical character recognition quality affects subjective user perception of historical newspaper clippings

Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science

Contact Info

Product

Resources

About