2019
DOI: 10.1093/llc/fqz024
|View full text |Cite
|
Sign up to set email alerts
|

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Abstract: This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The arti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
29
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 63 publications
(29 citation statements)
references
References 24 publications
(11 reference statements)
0
29
0
Order By: Relevance
“…This is an instance of intrinsic OCR evaluation, where we only rely on the OCR model to assess it(self). Such assessments are unsatisfactory because they might not be comparable when the software/provider changes and provide no indication on how the OCR quality influences other tasks or is related to other, external data or systems (Hill and Hengchen, 2019). This is the broader scope of extrinsic OCR evaluations.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…This is an instance of intrinsic OCR evaluation, where we only rely on the OCR model to assess it(self). Such assessments are unsatisfactory because they might not be comparable when the software/provider changes and provide no indication on how the OCR quality influences other tasks or is related to other, external data or systems (Hill and Hengchen, 2019). This is the broader scope of extrinsic OCR evaluations.…”
Section: Related Workmentioning
confidence: 99%
“…Studies have considered information access and retrieval (Traub et al, 2018), authorship attribution (Franzini et al, 2018), named entity recognition (Hamdi et al, 2019), and topic modelling (Nelson, 2020;Mutuvi et al, 2018). Recently (Hill and Hengchen, 2019) compared different tasks on a corpus in English: topic modelling, collocation analysis, authorship attribution and vector space modelling. From this study, a critical OCR quality threshold between 70 and 80% emerged, where most tasks perform very poorly below this threshold, good results are achieved above it, and varying results are achieved within, according to the task at hand.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…If some NLP resource does not exist for a language, stop complaining about how low-resourced it is, get up and gather the data. Of course, there are always exceptions when gathering the data required for a large language might not be a walk in a park such as when dealing with historical data [11]. And it is true that even resources for non-endangered languages can be noisy [19].…”
Section: Introductionmentioning
confidence: 99%