2020
DOI: 10.16995/dscn.315
|View full text |Cite
|
Sign up to set email alerts
|

How to Do Lexical Quality Estimation of a Large OCRed Historical Finnish Newspaper Collection with Scarce Resources

Abstract: This is a peer-reviewed article in Digital Studies/Le champ numérique, a journal published by the Open Library of Humanities.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 20 publications
(38 reference statements)
0
2
0
Order By: Relevance
“…The original OCR for Uusi Suometar was performed using a line of ABBYY FineReader® products. The quality of the digitization of the whole collection of Finnish newspapers from 1771 to 1910 has been estimated by Kettunen and P€ a€ akk€ onen (2016). They conclude that ca 70-75% of the words in the Finnish language 2.4-billion-word index database could be recognized by using automatic morphological analysers.…”
Section: Topic Creation For the Studymentioning
confidence: 99%
“…The original OCR for Uusi Suometar was performed using a line of ABBYY FineReader® products. The quality of the digitization of the whole collection of Finnish newspapers from 1771 to 1910 has been estimated by Kettunen and P€ a€ akk€ onen (2016). They conclude that ca 70-75% of the words in the Finnish language 2.4-billion-word index database could be recognized by using automatic morphological analysers.…”
Section: Topic Creation For the Studymentioning
confidence: 99%
“…This has a direct consequence on a proper rendering of the paper in digital form. Consequently, this impacts on the adequate interpretation of the language in the text using Natural Language Processing techniques such as word and sentence tokenisation, part-of-speech tagging ( Lopresti, 2009 , Tanner et al, 2009 , Kettunen and Pääkkönen, 2016 ) in correctly interpreting text. Lately, the use of Machine Learning techniques has been gaining traction in building models that can be trained to recognise information entities from textual sources.…”
Section: Related Workmentioning
confidence: 99%