2011 International Conference on Document Analysis and Recognition 2011
DOI: 10.1109/icdar.2011.157
|View full text |Cite
|
Sign up to set email alerts
|

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

Abstract: Abstract-This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Gutenberg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
37
0
2

Year Published

2012
2012
2023
2023

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 51 publications
(39 citation statements)
references
References 5 publications
0
37
0
2
Order By: Relevance
“…We have recently learned about a similar tool, RETAS [14] . This starts its alignment based on the neat idea that texts through their Zipf distribution typically have about 50% of their word types being hapaxes.…”
Section: The Problem Of Aligning 'Old' With 'Gold'mentioning
confidence: 99%
“…We have recently learned about a similar tool, RETAS [14] . This starts its alignment based on the neat idea that texts through their Zipf distribution typically have about 50% of their word types being hapaxes.…”
Section: The Problem Of Aligning 'Old' With 'Gold'mentioning
confidence: 99%
“…For evaluation purposes, a noise-free version of the same text is downloaded from the Project Gutenberg's website 2 . For labeling word bounding boxes, the OCR output and the ground truth text are aligned using a text alignment tool [15]. The estimated character accuracy for the whole book is 98.4%.…”
Section: A Datasetsmentioning
confidence: 99%
“…Feng [1] and Yalniz [8] both present different methods of aligning ground truth with OCR that do not rely on the existence of a canonically correct electronic document. Instead, both assume their ground truth texts to be in the form of plain text, typically ASCII.…”
Section: Related Workmentioning
confidence: 99%