Revisiting Known-Item Retrieval in Degraded Document Collections

Soo, Jason

doi:10.2352/issn.2470-1173.2016.17.drr-065

Cited by 2 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This suggests that 2 may be the best candidate vector size for most applications. This is supported by previous research ( [24], [19]) which show that a balance of context-free and context-dependent candidates perform best. By using a fusion method with a candidate vector size of 2, we select 2 candidates based on context (selected using bigrams) and 2 candidates not based on context (selected using Segments' substrings rules).…”

Section: A Methodssupporting

confidence: 85%

“…For example, we don't include symbols as white space when tokenizing because of the increased probability of OCR systems to replace characters with punctuation. A detailed explanation of the filtering process is described in [19]. We then iterate over all the filtered terms, tasking each evaluated approach to generate substitution candidates for the unrecognized terms.…”

Section: Methodsmentioning

confidence: 99%

“…Over the past years, we evaluated methods for reliably correcting phase one errors via post-processing using our method called Segments [17], [18], [19]. Segments differs from previous research in that it is an unsupervised approach, which makes minimal assumptions about resource availability, and has no dependence on language within the algorithm.…”

Section: Introductionmentioning

confidence: 99%

“…It uses substring rules to correct phase one errors. More recent research has focused on the role of balance between context, such as word-level bi-grams and tri-grams, and term focused substrings, finding that a fusion between the two improves performance [19].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Searching Corrupted Document Collections

Soo

2016

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Self Cite

View full text Add to dashboard Cite

Historical documents are typically digitized using optical Character Recognition. While effective, the results may not always be accurate and are highly dependent on the input. Consequently, degraded documents are often corrupted. Our focus is finding flexible, reliable methods to correct for such degradation, in the face of limited resources. We extend upon our substring and context fusion based retrieval system known as Segments, to consider metadata. By extracting topics from documents, and supplementing and weighting our lexicon with co-occurring terms found in documents with those topics, we achieve a statistically significant improvement over the state-ofthe-art in all but one test configuration. Our mean reciprocal rank measured on two free, publicly available, independently judged datasets is 0.7657 and 0.5382.

show abstract

Section: A Methodssupporting

confidence: 85%

Section: Methodsmentioning

confidence: 99%