2022
DOI: 10.1093/llc/fqac014
|View full text |Cite
|
Sign up to set email alerts
|

Reading in the mist: high-quality optical character recognition based on freely available early modern digitized books

Abstract: In this paper, we present a workflow for reworking digitized versions of early modern books, freely available in the public domain, in such a way that they will be capable of yielding high-quality optical character recognition (OCR) results suitable for computational text mining. Testing our method, we observed that anything above 90% OCR accuracy is sufficient for semantic analysis. In addition, the overall homogeneity in the OCR accuracy across the corpus proved to be more important than having perhaps only … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 15 publications
0
2
0
Order By: Relevance
“…For example, van Strien et al (2020) and Hill and Hengchen (2019) show that for 85-90% correctly transcribed texts, good results can be arrived at more or less irrespective of the method applied. Our own research of the impact of the OCR inaccuracies on collocate extraction shows that, compared with a fully accurate transcription, an 80% and more highly accurate transcription provides close to exactly the same results (Sangiacomo et al 2022a). Indeed, for collocate extraction, it seems that a truly random distribution of errors would lead to significant problems only from 70% downwards.…”
Section: 53: Methods Of Digitizationmentioning
confidence: 77%
See 1 more Smart Citation
“…For example, van Strien et al (2020) and Hill and Hengchen (2019) show that for 85-90% correctly transcribed texts, good results can be arrived at more or less irrespective of the method applied. Our own research of the impact of the OCR inaccuracies on collocate extraction shows that, compared with a fully accurate transcription, an 80% and more highly accurate transcription provides close to exactly the same results (Sangiacomo et al 2022a). Indeed, for collocate extraction, it seems that a truly random distribution of errors would lead to significant problems only from 70% downwards.…”
Section: 53: Methods Of Digitizationmentioning
confidence: 77%
“…Concerning the differences between the final corpus and what was found in the dictionaries, seeSangiacomo et al 2021. 11 For further information on the digitization of the corpus, seeSangiacomo et al 2022a. …”
mentioning
confidence: 99%