Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries 2022
DOI: 10.1145/3529372.3533298
|View full text |Cite
|
Sign up to set email alerts
|

A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis

Abstract: This exploratory study proposes a prototype sentence-level parallel corpus to support studying optical character recognition (OCR) quality in curated digitized library collections. Existing data resources, such as ICDAR2019 [21] and GT4HistOCR[23], generally aligned content by artifact publishing characteristics such as documents or lines, which is limited to explore OCR noise concentrating on natural language granularity like sentences and chapters. Building upon an existing volume-aligned corpus that collect… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 18 publications
(21 reference statements)
0
2
0
Order By: Relevance
“…With the increasing popularity of applying NLP techniques to DL textual resources for macro-level computation research [24,4], concerns about the reliability of NLP techniques for processing digitized library collections have recently been on the rise [27,45,48]. Based on our literature review, one of the major issues that challenges NLP techniques' reliability on OCR'd texts is their potential inclusion of errors resulting from the OCR process [27,45,48].…”
Section: Impact Of Ocr Errors On Downstream Nlp Tasksmentioning
confidence: 99%
See 1 more Smart Citation
“…With the increasing popularity of applying NLP techniques to DL textual resources for macro-level computation research [24,4], concerns about the reliability of NLP techniques for processing digitized library collections have recently been on the rise [27,45,48]. Based on our literature review, one of the major issues that challenges NLP techniques' reliability on OCR'd texts is their potential inclusion of errors resulting from the OCR process [27,45,48].…”
Section: Impact Of Ocr Errors On Downstream Nlp Tasksmentioning
confidence: 99%
“…Overall, existing work concentrating on the investigation of the impact of OCR errors on NLP tasks can be divided into two groups. One is based on quantitative analysis [27,45]. In this group, researchers usually measure and compare the performance differences of the same NLP tool applied on the clean versus the OCR'd version of texts.…”
Section: Impact Of Ocr Errors On Downstream Nlp Tasksmentioning
confidence: 99%