Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage 2017
DOI: 10.1145/3078081.3078096
|View full text |Cite
|
Sign up to set email alerts
|

Profiling of OCR'ed Historical Texts Revisited

Abstract: In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a statistical profile available that provides an estimate of error classes with associated frequencies, and that points to conjectured errors and suspicious tokens. The method introduced in [3] computes such a profile, combining lexica, pattern sets and advanced matching t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
3
3
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 4 publications
0
9
0
Order By: Relevance
“…In the revisited version of this method, Fink et al [49] additionally refine profiles from user feedback of manual correction steps. They also enlarge their set of patterns with addition ones found in documents of earlier periods.…”
Section: Isolated-word Approachesmentioning
confidence: 99%
“…In the revisited version of this method, Fink et al [49] additionally refine profiles from user feedback of manual correction steps. They also enlarge their set of patterns with addition ones found in documents of earlier periods.…”
Section: Isolated-word Approachesmentioning
confidence: 99%
“…Therefore, we aim to include a trainable pixel classifier in order to either provide a valid starting point for other segmentation approaches by classifying pixels and consequently connected contours as text, image, and noise or even perform a fine-grained semantic markup [46]. Of course, a more powerful segmentation approach must also comprise a more sophisticated method for the determination of the reading order which 39 https://www.cost.eu/cost-actions/ 40 https://www.distant-reading.net/ also has to be integrated into LAREX. To generate the reading order, the idea is to allow the user to comfortably specify rules based on the detected region types as well as their absolute and relative position.…”
Section: Future Workmentioning
confidence: 99%
“…The system is under active development which resulted in several improvements on the original approach. In [40] Fink et al added three major extensions: First, making the system more adaptive to manual interventions of the user increased the precision with respect to identifying erroneous OCR tokens. Second, the linguistic background resources were extended by new historical patterns which leads a more successful discrimination of historical spelling from real OCR errors.…”
Section: Pocotomentioning
confidence: 99%