Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Hill, Mark J.; Hengchen, Simon

doi:10.1093/llc/fqz024

Cited by 63 publications

(29 citation statements)

References 24 publications

(11 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is an instance of intrinsic OCR evaluation, where we only rely on the OCR model to assess it(self). Such assessments are unsatisfactory because they might not be comparable when the software/provider changes and provide no indication on how the OCR quality influences other tasks or is related to other, external data or systems (Hill and Hengchen, 2019). This is the broader scope of extrinsic OCR evaluations.…”

Section: Related Workmentioning

confidence: 99%

“…Studies have considered information access and retrieval (Traub et al, 2018), authorship attribution (Franzini et al, 2018), named entity recognition (Hamdi et al, 2019), and topic modelling (Nelson, 2020;Mutuvi et al, 2018). Recently (Hill and Hengchen, 2019) compared different tasks on a corpus in English: topic modelling, collocation analysis, authorship attribution and vector space modelling. From this study, a critical OCR quality threshold between 70 and 80% emerged, where most tasks perform very poorly below this threshold, good results are achieved above it, and varying results are achieved within, according to the task at hand.…”

Section: Related Workmentioning

confidence: 99%

“…Another element to be explored is the impact of time, and consequently of the combined effects of linguistic change and OCR quality on the application of tools usually trained on contemporary languages. Lastly, the range of tasks which are considered in previous work is limited, with comparisons across tasks attempted in a single, seminal paper (Hill and Hengchen, 2019). In this work, we start addressing these research questions by considering a larger set of tasks and utilizing text drawn from a source which poses many challenges for OCR: historic newspapers (Pletschacher et al, 2014).…”

Section: Related Workmentioning

confidence: 99%

“…We find that, while quality bands 1 and, to a lesser degree 2, still maintain a good fidelity with their human-corrected counterparts, this is not the case for bands 3 and, particularly, 4. The issue is not as much that OCR topic models became meaningless but, more subtly, that they retain their interpretability (Hill and Hengchen, 2019) while becoming substantially different from what they would be using clean texts. − ∑ t∈V i p i (t)log[p i (t)].…”

Section: Topic Modellingmentioning

confidence: 99%

See 3 more Smart Citations

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Strien

Beelen

Ardanuy

et al. 2020

Proceedings of the 12th International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks -sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning -using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Topic Modellingmentioning

confidence: 99%

See 2 more Smart Citations

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Strien

Beelen

Ardanuy

et al. 2020

Proceedings of the 12th International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

show abstract

“…If some NLP resource does not exist for a language, stop complaining about how low-resourced it is, get up and gather the data. Of course, there are always exceptions when gathering the data required for a large language might not be a walk in a park such as when dealing with historical data [11]. And it is true that even resources for non-endangered languages can be noisy [19].…”

Section: Introductionmentioning

confidence: 99%

Endangered Languages are not Low-Resourced!

Hämäläinen¹

2021

Multilingual Facilitation

View full text Add to dashboard Cite

The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.

show abstract

Design of Text Resources and Tools

McGillivray

Tóth

2020

Applying Language Technology in Humanities Research

View full text Add to dashboard Cite

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Cited by 63 publications

References 24 publications

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Endangered Languages are not Low-Resourced!

Design of Text Resources and Tools

Contact Info

Product

Resources

About