Proceedings of the 9th ACM Symposium on Document Engineering 2009
DOI: 10.1145/1600193.1600236
|View full text |Cite
|
Sign up to set email alerts
|

On lexical resources for digitization of historical documents

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2010
2010
2016
2016

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 9 publications
0
7
0
Order By: Relevance
“…Despite high image quality and standard typeface, OCR word accuracy is only 86.65%, which is much worse than for modern languages; Holley [2] classifies this as "poor accuracy". It is due to the fact that the OCR software is unaware of the special characters and the lack of lexicons or other linguistic resources for historical language variants, which also do not have standardized orthographies (see Gotscharek et al [1] on the issue of language resources). While mif2html required software development, our experiments with text samples indicate that training the OCR software or correcting its output would both require an inordinate amount of human labor to produce any significant effect.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Despite high image quality and standard typeface, OCR word accuracy is only 86.65%, which is much worse than for modern languages; Holley [2] classifies this as "poor accuracy". It is due to the fact that the OCR software is unaware of the special characters and the lack of lexicons or other linguistic resources for historical language variants, which also do not have standardized orthographies (see Gotscharek et al [1] on the issue of language resources). While mif2html required software development, our experiments with text samples indicate that training the OCR software or correcting its output would both require an inordinate amount of human labor to produce any significant effect.…”
Section: Discussionmentioning
confidence: 99%
“…1 Even though FrameMaker documents are generally portable between the Macintosh, UNIX, and Windows versions of FrameMaker, the character encoding is always platform-specific. The character encoding of our documents is MacRoman.…”
Section: Character Encodingmentioning
confidence: 99%
See 1 more Smart Citation
“…Just as humans rely on dictionaries to understand, access and navigate cultural heritage so, too, do machines. Presently, dictionaries play a crucial role in many kinds of computational systems; for example, multilingual data processing and interchange (see, inter alia, Dietrich 2010); machine translation (see, inter alia, Aljlayl et al 2011); and information retrieval (see, inter alia, Gotscharek et al 2009). While sophisticated search engines deliver an abundance of information, one of the greatest on-going challenges of modern day information technology is to make this abundance manageable.…”
Section: For Man and Machine: The Changing Role Of The Digital Dictiomentioning
confidence: 99%
“…There have been a lot of work in preserving books in the form of scanned images, digitizing scanned pages using OCR engines and crowdsourcing, analyzing digital data, and building archive explorers [21,6,17,37,7,64,44,61,60]. While OCR engines offer a scalable mechanism to digitize scanned images, they have limited accuracy or no support for many of the world's popular languages, hence automating digitization work-flow [52] is not feasible.…”
Section: Related Workmentioning
confidence: 99%