Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval 2012
DOI: 10.1145/2348283.2348347
|View full text |Cite
|
Sign up to set email alerts
|

Finding translations in scanned book collections

Abstract: This paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a book must preserve the linear progression of ideas for it to be a valid translation. Consider two books in two different languages, say English and German. The English book in the collection is represented by the sequence of words (in the order they appear in the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2014
2014
2015
2015

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 23 publications
(23 reference statements)
0
2
0
Order By: Relevance
“…There have been a lot of work in preserving books in the form of scanned images, digitizing scanned pages using OCR engines and crowdsourcing, analyzing digital data, and building archive explorers [21,6,17,37,7,64,44,61,60]. While OCR engines offer a scalable mechanism to digitize scanned images, they have limited accuracy or no support for many of the world's popular languages, hence automating digitization work-flow [52] is not feasible.…”
Section: Related Workmentioning
confidence: 99%
“…There have been a lot of work in preserving books in the form of scanned images, digitizing scanned pages using OCR engines and crowdsourcing, analyzing digital data, and building archive explorers [21,6,17,37,7,64,44,61,60]. While OCR engines offer a scalable mechanism to digitize scanned images, they have limited accuracy or no support for many of the world's popular languages, hence automating digitization work-flow [52] is not feasible.…”
Section: Related Workmentioning
confidence: 99%
“…Our proposed approach is inspired in previous works where the alignment of text sequences is used to correct errors between different editions of the same book [20], or to align original [19] and translated editions [21]. In other cases the semantic information is used to retrieval purposes [16] words related semantically are given as similar, although the transcription and the shape of the word is different-.…”
Section: Introductionmentioning
confidence: 99%